Bluesky Data & AI Training Debate - What Users Are Saying

Bluesky Considers User Control Over Data Scraping

Bluesky, a social networking platform, has recently unveiled a proposal on GitHub detailing potential new user controls. These controls would allow individuals to specify whether their posts and data can be utilized for purposes such as generative AI training and public archiving.

Discussion at South by Southwest

CEO Jay Graber initially addressed the proposal during a presentation at South by Southwest earlier this week. However, it garnered renewed attention after she shared details about it on the Bluesky platform itself. The company’s plans sparked concern among some users.

User Concerns and Perceived Reversal

Certain users interpreted the proposal as a departure from Bluesky’s prior commitments. These commitments included assurances that user data would not be sold to advertisers and that user posts would not be used for AI training. One user, Sketchette, expressed strong disapproval, stating the platform’s appeal lay in its privacy.

Addressing Existing Scraping Practices

Graber clarified that generative AI companies are currently scraping publicly available data from across the internet. She emphasized that, like a website, all content on Bluesky is inherently public. Consequently, Bluesky aims to establish a “new standard” to regulate this scraping process.

Drawing Parallels to Robots.txt

This proposed standard is envisioned to function similarly to the robots.txt file. Websites utilize this file to communicate permissions to web crawlers. Current debates surrounding AI training and copyright have brought robots.txt into sharper focus, notably its lack of legal enforceability.

Proposed User Data Control Categories

The proposal outlines four distinct categories for user control over their data. Users of the Bluesky app, or applications built on the ATProtocol, would be able to manage permissions within their settings. These categories include:

Generative AI
Protocol bridging (connecting different social networks)
Bulk datasets
Web archiving (like the Internet Archive’s Wayback Machine)

Respecting User Intent

If a user opts out of allowing their data to be used for generative AI training, the proposal stipulates that AI development companies and research teams should respect this preference. This applies both when scraping websites and when utilizing bulk data transfers through the protocol.

Positive Reception from Industry Observers

Molly White, author of the Citation Needed newsletter and Web3 is Going Just Great blog, characterized the proposal as “a good proposal.” She expressed surprise at the negative reactions, noting that Bluesky isn’t actively encouraging AI scraping, but rather attempting to introduce a consent mechanism.

Reliance on Ethical Compliance

White acknowledged a potential weakness: the reliance on scrapers voluntarily adhering to these signals. She pointed out that some companies have previously disregarded robots.txt or engaged in copyright infringement to acquire data for scraping purposes.

Topics

More

Bluesky Data & AI Training Debate - What Users Are Saying

Bluesky Considers User Control Over Data Scraping

Discussion at South by Southwest

User Concerns and Perceived Reversal

Addressing Existing Scraping Practices

Drawing Parallels to Robots.txt

Proposed User Data Control Categories

Respecting User Intent

Positive Reception from Industry Observers

Reliance on Ethical Compliance

Related Posts

TikTok to Cede US Control to American Investor Group

Trump Media to Merge with Fusion Power Company TAE Technologies

Facebook Limits Link Posting for Professional Accounts

Bluesky Launches Privacy-Focused 'Find Friends' Feature

X Updates Terms, Countersues Over 'Twitter' Trademark

DoorDash Launches Zesty: AI-Powered Restaurant Discovery App