MLCommons and Hugging Face Release Massive Speech Dataset for AI

New Public Dataset for AI Speech Research Released
A collaborative effort between MLCommons, a nonprofit dedicated to AI safety, and Hugging Face, an AI development platform, has resulted in the release of a substantial collection of public domain voice recordings. This resource is intended to facilitate advancements in AI research.
Extensive Collection of Audio Data
The dataset, formally named Unsupervised People’s Speech, comprises over a million hours of audio content. It encompasses recordings in at least 89 different languages. MLCommons initiated this project to bolster research and development within the field of speech technology.
The organization emphasized that supporting natural language processing research for languages beyond English is crucial. This expansion aims to make communication technologies more accessible on a global scale, as stated in a recent blog post.
Potential Research Applications
Researchers can leverage this dataset in several key areas. These include improving speech models for languages with limited resources, refining speech recognition across diverse accents and dialects, and exploring innovative applications in speech synthesis.
Potential Risks and Considerations
While the goals of this project are commendable, it’s important to acknowledge the potential risks associated with utilizing large AI datasets like Unsupervised People’s Speech.
Data Bias Concerns
One significant concern is the potential for data bias. The source of the recordings, Archive.org, has a predominantly English-speaking – and specifically American – contributor base. Consequently, the majority of the audio within the dataset features American-accented English.
This composition could lead to AI systems trained on this data exhibiting biases. For instance, they might struggle with transcribing English spoken by non-native speakers or generating synthetic voices in languages other than English.
Consent and Licensing Issues
Another potential issue involves the consent of individuals whose voices are included. There's a possibility that some recordings were contributed without full awareness of their intended use in AI research, including commercial applications.
Although MLCommons asserts that all recordings are either in the public domain or licensed under Creative Commons, the possibility of errors in licensing information exists.
Existing Issues with Dataset Licensing
Analysis from MIT has revealed that numerous publicly available AI training datasets lack proper licensing information and contain inaccuracies. Advocates for creators, such as Ed Newton-Rex, CEO of Fairly Trained, argue against requiring creators to actively “opt out” of these datasets.
Newton-Rex points out that many creators lack a viable method for opting out. Furthermore, even for those who can, the process is often confusing and incomplete. He contends that placing the burden of opting out on creators is unfair, given that AI uses their work in competitive contexts.
Ongoing Maintenance and Developer Caution
MLCommons has committed to ongoing updates, maintenance, and quality improvements for Unsupervised People’s Speech. However, developers are advised to proceed with caution, considering the potential flaws inherent in the dataset.
Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature
