EleutherAI Releases Large AI Training Dataset | Open Source AI

EleutherAI Releases Large Open-Source Dataset for AI Training
EleutherAI, a dedicated AI research organization, has unveiled a substantial collection of text data intended for the training of artificial intelligence models. This dataset is notable for being both licensed and sourced from the open domain.
Dataset Creation and Collaboration
The creation of this dataset, formally named the Common Pile v0.1, spanned approximately two years. It was a collaborative effort involving AI startups such as Poolside and Hugging Face, alongside contributions from numerous academic institutions.
With a total size of 8 terabytes, the Common Pile v0.1 served as the foundation for training two new AI models developed by EleutherAI: Comma v0.1-1T and Comma v0.1-2T. EleutherAI asserts that these models achieve performance levels comparable to those built using data obtained without proper licensing or copyright adherence.
Legal Landscape and AI Training
Currently, several AI companies, including OpenAI, are facing legal challenges concerning their AI training methodologies. These practices often involve web scraping, which includes the utilization of copyrighted materials like books and academic research papers to construct datasets for model training.
While some companies have established licensing agreements with content providers, many rely on the U.S. legal principle of fair use as a defense against potential copyright infringement claims.
Transparency Concerns in AI Research
EleutherAI contends that these ongoing lawsuits have “significantly reduced” the transparency exhibited by AI companies. This lack of openness, the organization believes, is detrimental to the broader AI research community, hindering the understanding of model functionality and potential weaknesses.
Stella Biderman, EleutherAI’s executive director, articulated this concern in a blog post published on Hugging Face. She stated that researchers within certain companies have cited legal proceedings as a reason for their inability to publish research focused on data-intensive areas.
Details of the Common Pile v0.1
The Common Pile v0.1 is available for download through Hugging Face’s AI development platform and GitHub. Its creation involved consultation with legal professionals and draws upon resources like 300,000 public domain books digitized by the Library of Congress and the Internet Archive.
Furthermore, EleutherAI leveraged Whisper, OpenAI’s open-source speech-to-text model, to transcribe audio content for inclusion in the dataset.
Performance of Comma Models
EleutherAI posits that Comma v0.1-1T and Comma v0.1-2T demonstrate the effectiveness of the Common Pile v0.1. The organization claims these models, each containing 7 billion parameters and trained on a subset of the dataset, rival the performance of Meta’s initial Llama AI model in benchmarks assessing coding, image understanding, and mathematical abilities.
Parameters, often referred to as weights, are the internal components within an AI model that dictate its behavior and responses.
The Future of Openly Licensed Data
“We believe the prevailing notion that unlicensed text is essential for achieving high performance is not substantiated,” Biderman explained in her post. “As the volume of openly licensed and public domain data expands, we anticipate improvements in the quality of models trained on such content.”
Addressing Past Practices
The Common Pile v0.1 can also be viewed as an attempt by EleutherAI to rectify past actions. The company previously released The Pile, an open dataset that included copyrighted material, which has since drawn criticism and legal scrutiny.
Ongoing Commitment to Open Datasets
EleutherAI has pledged to release open datasets on a more frequent basis, continuing its collaboration with research and infrastructure partners.
Updated 9:48 a.m. Pacific: Biderman clarified on X that EleutherAI contributed to the release of the datasets and models, but emphasized the involvement of numerous partners, including the University of Toronto, which played a leading role in the research.
Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature
