AI Data Licensing Protocol Launched by RSS Co-Creator

The AI Industry Confronts its Data Licensing Challenges
Following Anthropic’s $1.5 billion copyright settlement, the artificial intelligence sector is actively addressing issues surrounding its training data. Currently, approximately 40 additional lawsuits are pending, seeking compensation for the use of data without proper licensing – including a case involving Midjourney and the depiction of Superman.
Without a standardized licensing framework, AI firms could encounter a substantial increase in copyright litigation, potentially hindering the industry’s progress.
Introducing Real Simple Licensing (RSL)
A collaborative effort by technologists and web publishers has resulted in the creation of a system designed to facilitate data licensing on a large scale, contingent upon adoption by AI companies. This initiative, known as Real Simple Licensing (RSL), already has the backing of prominent web publishers such as Reddit, Quora, and Yahoo. The central question now revolves around whether this growing support will be sufficient to initiate negotiations with leading AI laboratories.
Eckart Walther, a co-founder of RSL and the original creator of the RSS standard, explained that the primary objective was to establish a training-data licensing system capable of operating across the entire internet. “Machine-readable licensing agreements are essential for the functioning of the internet,” Walther stated to TechCrunch. “RSL is specifically designed to provide this solution.”
While organizations like the Dataset Providers Alliance have long advocated for more transparent data collection practices, RSL represents the first comprehensive attempt to develop both the technical and legal infrastructure necessary to implement such practices effectively.
The RSL Protocol and Collective Licensing
The RSL Protocol defines specific licensing terms that publishers can apply to their content, determining whether AI companies require a customized license or adherence to Creative Commons stipulations. Websites participating in the system will incorporate these terms into their “robots.txt” file, using a standardized format for easy identification of data usage rights.
Legally, the RSL team has formed a collective licensing organization, the RSL Collective, to negotiate terms and distribute royalties, mirroring the functions of ASCAP for musicians or MPLC for films. This aims to provide a single point of contact for royalty payments and enable rights holders to establish terms with numerous potential licensees simultaneously.
Growing Support for the RSL Initiative
A diverse range of web publishers have already joined the RSL Collective, including Yahoo, Reddit, Medium, O’Reilly Media, Ziff Davis (parent company of Mashable and Cnet), Internet Brands (owner of WebMD), People Inc., and The Daily Beast. Other entities, such as Fastly, Quora, and Adweek, are endorsing the standard without formally joining the collective.
Significantly, the RSL Collective includes publishers that have already secured individual licensing agreements, notably Reddit, which reportedly earns around $60 million annually from Google for data usage in training. Companies retain the option to negotiate bespoke deals within the RSL framework, similar to how artists can establish unique licensing terms while still receiving royalties through ASCAP.
However, for smaller publishers lacking the leverage to negotiate individual agreements, RSL’s collective terms are likely to be their primary avenue for compensation.
Challenges in Tracking and Attribution
Determining when royalties are due for specific training data presents unique challenges for AI models. The process is relatively straightforward for products like Google’s AI Search Abstracts, which source data from the web in real-time and provide clear attribution for each piece of information.
However, if training data ingestion isn’t logged, verifying whether a specific document was used in an LLM can be exceedingly difficult, particularly if publishers request payment per inference rather than a flat fee, as offered by some RSL licenses.
Optimism and the Path Forward
Despite these challenges, RSL’s creators are confident that AI companies can manage the complexities of tracking and attribution. “Some existing licensing agreements already require reporting capabilities, demonstrating its feasibility,” says Doug Leeds, a co-founder of RSL and former CEO of IAC Publishing. “Absolute perfection isn’t necessary; a reasonably accurate system for compensating rights holders is sufficient.”
The crucial question remains whether AI companies will embrace the RSL system. While companies like ScaleAI and Mercor demonstrate a willingness to pay for data, the web has historically been viewed as a source of inexpensive, lower-quality data. The availability of datasets like Common Crawl may make it difficult to extract royalties from data labs are accustomed to obtaining for free.
Furthermore, the distinction between legitimate web scraping and machine-enhanced browsing, as highlighted by the recent dispute between Cloudflare and Perplexity, isn’t always clear.
Looking Ahead
Leeds addressed this concern by referencing recent statements from AI leaders advocating for a system like RSL, particularly Sundar Pichai’s remarks at last year’s Dealbook Summit. The RSL team intends to hold these leaders accountable for their public endorsements. “They have publicly stated the need for a system like this,” Leeds emphasized. “We require a protocol, a functioning system.”
A viable solution may now be within reach.
Related Posts

Disney Cease and Desist: Google Faces Copyright Infringement Claim

OpenAI Responds to Google with GPT-5.2 After 'Code Red' Memo

Waymo Baby Delivery: Birth in Self-Driving Car

Google AI Leadership: Promoting Data Center Tech Expert
