LOGO

MLCommons Releases 86,000-Hour Speech Dataset for AI Research

December 3, 2020
MLCommons Releases 86,000-Hour Speech Dataset for AI Research

Developing a machine learning system requires substantial data, but acquiring this data can often be challenging. MLCommons was established to bring together various companies and organizations to create extensive, publicly available databases for AI training. This collaborative effort aims to enable researchers worldwide to work more effectively and collectively propel the advancement of this evolving field. Their initial project, the People’s Speech Dataset, is significantly larger and more diverse than existing alternatives.

MLCommons is a newly formed nonprofit organization closely associated with MLPerf. MLPerf has successfully gathered input from numerous companies and academic institutions to establish industry-standard benchmarks for evaluating machine learning performance. Through this process, the team identified a critical shortage of openly accessible datasets suitable for widespread use.

To facilitate meaningful comparisons – for instance, between a model developed by Google and one from Amazon, or even a model from UC Berkeley – it’s essential that they all be evaluated using the same testing data. In the realm of computer vision, the ImageNet dataset serves as a widely adopted and frequently cited resource for researchers and experts. However, a comparable comprehensive dataset doesn’t currently exist for tasks like speech-to-text accuracy.

“Benchmarks are crucial for fostering constructive and quantifiable discussions about progress. It became clear that to truly advance the industry, we need readily available datasets – but many are restricted by licensing issues or don’t represent the current state of the art,” explained David Kanter, co-founder and executive director of MLCommons.

While major corporations possess vast collections of voice data, this data is typically proprietary and may be subject to legal limitations preventing its use by others. Publicly available datasets exist, but their limited size – often containing only a few thousand hours of data – restricts their usefulness, as modern competitive systems demand considerably more.

“Creating large datasets is beneficial because it enables the creation of benchmarks, but it also accelerates progress for the entire community. We may not be able to match the resources available internally to large companies, but we can significantly narrow the gap,” Kanter stated. MLCommons was founded to facilitate the creation, management, and accessibility of these necessary data resources and connections.

The People’s Speech Dataset was compiled from diverse sources, including approximately 65,000 hours of English audiobooks with corresponding text. An additional 15,000 hours were gathered from various online sources, encompassing a range of acoustic environments, speakers, and speaking styles, such as conversational speech versus narration. Furthermore, 1,500 hours of English audio were sourced from Wikipedia, and 5,000 hours of synthetic speech generated by GPT-2 were incorporated (“A bit of the snake eating its own tail,” Kanter quipped). The dataset includes speech in fifty-nine languages, although English constitutes the majority of the content.

While the goal is to maximize diversity – recognizing that a virtual assistant designed for Portuguese cannot be effectively built using only English data – it’s also important to establish a baseline understanding of data requirements. For example, is 10,000 hours of data sufficient for developing a functional speech-to-text model? Does increasing this to 20,000 hours significantly improve development speed, efficiency, or overall performance? What quantity of data is needed to achieve proficiency in American English while also supporting Indian and British accents?

The prevailing consensus regarding datasets is that “larger is better,” and companies like Google and Apple are working with datasets far exceeding a few thousand hours. This understanding led to the inclusion of 86,000 hours in the initial release of this dataset. This is intended to be the first of many iterations, with future versions expanding to encompass more languages and accents.

“Once we confirm that we can deliver value, we will release updates and be transparent about the dataset’s current state,” explained Peter Mattson, another MLCommons co-founder and head of Google’s Machine Learning Metrics Group. “We also need to develop methods for quantifying diversity. The industry is actively seeking this; there’s a substantial return on investment in supporting an organization dedicated to dataset construction expertise.”

The organization also intends to encourage sharing and innovation within the field through MLCube, a new standard for exchanging models that simplifies the process and reduces ambiguity. Despite machine learning being a highly active area of research and development, sharing an AI model for testing, execution, or modification isn’t as straightforward as it should be.

MLCube is designed as a wrapper for models that defines and standardizes key aspects, such as dependencies, input and output formats, and hosting requirements. While artificial intelligence is inherently complex, the tools for its creation and evaluation are still in their early stages of development.

The dataset is expected to be available soon on the MLCommons website, licensed under CC-BY, permitting commercial use. Several reference models trained on the dataset will also be released.

#AI#speech data#MLCommons#speech recognition#dataset#AI research