Bluesky AI Data Control: Users to Decide How Data is Used

Bluesky Addresses AI Data Usage and User Consent

During a presentation at the SXSW conference in Austin on Monday, Jay Graber, CEO of Bluesky, revealed the social network is developing a system for user control regarding the utilization of their data by generative AI technologies.

The increasing demand for data to train artificial intelligence models necessitates a clear AI policy for the emerging social platform, despite the company’s current lack of plans to train its own AI systems using user-generated content.

AI Training on Existing Content

Bluesky has already observed instances of its publicly available content being used to train AI systems. Last year, 404 Media identified a dataset comprising 1 million Bluesky posts hosted on Hugging Face, demonstrating this practice.

In contrast, Bluesky’s competitor, X, is actively utilizing user posts to train its AI chatbot, Grok, through its affiliated company xAI.

A policy change implemented last fall permitted third parties to leverage X users’ posts for AI training purposes. This decision, coupled with political developments following the U.S. elections and the increased influence of X owner Elon Musk, contributed to a further migration of users from X to Bluesky.

Bluesky’s Growth and Proposed Framework

As a result of these factors, the open-source, decentralized alternative to X has experienced substantial growth, now boasting over 32 million users within a two-year timeframe.

Graber explained at SXSW that Bluesky is collaborating with partners to establish a framework for user consent concerning the use – or non-use – of their data in generative AI applications.

“We really believe in user choice,” Graber stated, emphasizing that users will have the ability to define how their Bluesky content can be utilized.

Comparison to Search Engine Scraping

“It could be something similar to how websites specify whether they want to be scraped by search engines or not,” she elaborated.

“Search engines are still capable of scraping websites regardless of these settings, as websites are inherently public on the internet. However, the ‘robots.txt’ file is generally respected by many search engines,” she continued.

“Widespread adoption and support from users, companies, and regulators are crucial for the success of this framework. But I believe it represents a viable path forward.”

GitHub Proposal and Collaboration

The proposed system, currently available on GitHub, would allow users to grant or deny consent at either the account level or even for individual posts. Other companies would then be expected to honor these user preferences.

“We’ve been working on this with others in the industry who share concerns about the impact of AI on data privacy,” Graber added. “I think it’s a constructive step in the right direction.”