for companies that use ml, labeled data is the key differentiator

The Shift to Data-Centric Programming in AI
Artificial intelligence is fundamentally reshaping the software industry, driving a transition from traditional logic-based programming to a data-centric approach. In this new landscape, data has become an indispensable resource. The volume of training data a company amasses directly correlates with the performance and capabilities of its AI-driven products.
Tesla's Advantage and the Learning Paradigms
The significant lead Tesla holds in the development of advanced driver assistance systems (ADAS) can be attributed to its unparalleled data collection efforts. Having amassed data from over 10 billion miles driven, Tesla surpasses competitors like Waymo, which has approximately 20 million miles of data. However, any organization considering machine learning (ML) implementation must carefully evaluate a key technical decision: whether to employ supervised or unsupervised learning techniques.
These two approaches differ significantly in their methodology. Unsupervised learning involves directly feeding acquired data into models, with the expectation that patterns will emerge.
Elon Musk draws a parallel between unsupervised learning and the human brain’s ability to interpret raw sensory input. He has indicated that achieving functional unsupervised learning for ADAS remains a substantial, unresolved challenge.
Supervised Learning: The Current Standard
Currently, supervised learning represents the most viable approach for the majority of ML applications. A 2021 O’Reilly report on AI adoption revealed that 82% of companies surveyed utilize supervised learning, compared to 58% employing unsupervised learning. Gartner forecasts that supervised learning will continue to be favored through 2022, asserting that the majority of economic value derived from ML is currently based on its use cases.
A critical component of supervised learning is the process of data labeling – transforming raw data into a usable format. For example, in the context of Tesla’s ADAS, human annotators meticulously identify and label objects within images, such as pedestrians, traffic signals, and other vehicles.
Peter Levine, a partner at Andreessen Horowitz, emphasizes that raw data, despite its abundance, typically requires substantial modification before it can be utilized by an ML system. Data must be aggregated, transformed, cleaned, augmented, and, crucially, labeled before integration with frameworks like PyTorch or Tensorflow.
The Bottleneck of Data Labeling
Data labeling often consumes up to 80% of the resources allocated to an average ML project. It is also a frequent point of failure, with 70% of companies reporting difficulties in labeling their data effectively. Traditionally, data labeling has relied on a labor-intensive approach, scaling linearly with the number of workers employed – a method that doesn’t lend itself to efficient growth.
Fortunately, AI itself offers a solution: utilizing ML to pre-label data, allowing human workers to focus on verifying and refining the computer’s output, and addressing complex edge cases. This accelerates the process and reduces costs.
Despite computers surpassing human capabilities in image recognition over five years ago, the data annotation market has only recently experienced rapid growth. Valued at $695.5 million in 2019, it is projected to exceed $6 billion by 2027.
Key Players in the Data Annotation Market
Scale AI is a prominent player in this expanding market. A recent $325 million funding round valued the company at $7 billion. Scale AI demonstrated its capabilities by improving Toyota’s annotation throughput tenfold within weeks. Chris Abshire, a senior partner at Toyota AI Ventures, highlighted the importance of “easily obtaining data, and then extracting value from that data with minimal human intervention” as a key objective for many AI startups.
Data annotation extends beyond the automotive industry. John Deere’s AI subsidiary, Blue River Technology, leverages supervised learning to enhance the precision of its smart sprayers in distinguishing between weeds and crops. Utilizing the Labelbox platform, a competitor to Scale.ai, Blue River Technology reduced its labeling time by nearly 50%, accelerating iteration and reducing expenses. Emma Bassein, Blue River’s director of data and machine learning, reported a 25% reduction in cost per label in 2020.
Different Approaches to Data Labeling
Scale and Labelbox represent distinct approaches to data labeling. Scale operates as a service provider, labeling data on behalf of its clients, relieving them of the task entirely. This model is particularly popular among enterprises requiring large-scale training datasets, such as self-driving car companies.
Labelbox, conversely, offers a platform-based solution, empowering data owners with the tools to annotate their data while maintaining control. This approach is favored by organizations prioritizing data quality over sheer volume.
Data Quality: A Critical Factor
Data quality is the second most significant challenge for companies engaged in AI, and data labeling plays a vital role in ensuring it. Data quality encompasses factors such as volume, diversity, accuracy, and bias. For instance, insufficient data representing rainy conditions could compromise the performance of ADAS technology in inclement weather.
A robust training data platform can proactively identify and rectify these issues before deployment, preventing potential failures. The labeling process can also reveal biases within the data, which could lead to discriminatory outcomes – as exemplified by Amazon’s recruiting model that exhibited gender bias.
For companies adopting supervised learning, a strategy for rapid data labeling is essential. Just as attracting top software engineers has been crucial for writing effective code, the new paradigm demands the generation of high-quality data to develop superior AI models.