Bluesky Data & AI Training Debate - What Users Are Saying

Bluesky Considers User Control Over Data Scraping
Bluesky, a social networking platform, has recently unveiled a proposal on GitHub detailing potential new user controls. These controls would allow individuals to specify whether their posts and data can be utilized for purposes such as generative AI training and public archiving.
Discussion at South by Southwest
CEO Jay Graber initially addressed the proposal during a presentation at South by Southwest earlier this week. However, it garnered renewed attention after she shared details about it on the Bluesky platform itself. The company’s plans sparked concern among some users.
User Concerns and Perceived Reversal
Certain users interpreted the proposal as a departure from Bluesky’s prior commitments. These commitments included assurances that user data would not be sold to advertisers and that user posts would not be used for AI training. One user, Sketchette, expressed strong disapproval, stating the platform’s appeal lay in its privacy.
Addressing Existing Scraping Practices
Graber clarified that generative AI companies are currently scraping publicly available data from across the internet. She emphasized that, like a website, all content on Bluesky is inherently public. Consequently, Bluesky aims to establish a “new standard” to regulate this scraping process.
Drawing Parallels to Robots.txt
This proposed standard is envisioned to function similarly to the robots.txt file. Websites utilize this file to communicate permissions to web crawlers. Current debates surrounding AI training and copyright have brought robots.txt into sharper focus, notably its lack of legal enforceability.
Proposed User Data Control Categories
The proposal outlines four distinct categories for user control over their data. Users of the Bluesky app, or applications built on the ATProtocol, would be able to manage permissions within their settings. These categories include:
- Generative AI
- Protocol bridging (connecting different social networks)
- Bulk datasets
- Web archiving (like the Internet Archive’s Wayback Machine)
Respecting User Intent
If a user opts out of allowing their data to be used for generative AI training, the proposal stipulates that AI development companies and research teams should respect this preference. This applies both when scraping websites and when utilizing bulk data transfers through the protocol.
Positive Reception from Industry Observers
Molly White, author of the Citation Needed newsletter and Web3 is Going Just Great blog, characterized the proposal as “a good proposal.” She expressed surprise at the negative reactions, noting that Bluesky isn’t actively encouraging AI scraping, but rather attempting to introduce a consent mechanism.
Reliance on Ethical Compliance
White acknowledged a potential weakness: the reliance on scrapers voluntarily adhering to these signals. She pointed out that some companies have previously disregarded robots.txt or engaged in copyright infringement to acquire data for scraping purposes.
Related Posts

Reddit Testing Verification Badges - What You Need to Know

Instagram 'Your Algorithm' Tool: Control Your Reels Feed

Facebook Redesign: New Focus on Friends, Photos & Marketplace

TikTok Shared Feeds: New Content Organization Feature

Adobe Premiere Mobile Hub for YouTube Shorts - New Features
