LOGO

AI Startups & Data Control: Why They're Building Their Own Infrastructure

October 16, 2025
AI Startups & Data Control: Why They're Building Their Own Infrastructure

The Rise of Curated Data in AI Training

This summer, Taylor and a housemate participated in a unique training exercise for an AI vision model. They wore GoPro cameras affixed to their foreheads while engaging in artistic pursuits and daily chores. The purpose was to capture footage from multiple perspectives of the same actions, enabling precise synchronization for the AI system.

Challenges and Compensation

The work presented certain difficulties, as Taylor explained, often resulting in headaches after prolonged camera use. A noticeable red mark would appear on the forehead where the camera pressed. Despite these challenges, the compensation was substantial, allowing Taylor to dedicate the majority of her day to creating art.

Initially, they aimed to deliver five hours of synchronized footage daily. However, Taylor soon discovered that a seven-hour commitment was necessary to accommodate breaks and allow for physical recuperation.

Turing's Approach to Vision Model Training

Taylor, who requested anonymity, was contracted through Turing, an AI company, as reported by TechCrunch. Turing’s objective wasn’t to develop an AI capable of painting, but rather to cultivate more generalized skills in sequential problem-solving and visual reasoning.

Unlike large language models, Turing’s vision model is exclusively trained on video data, with the majority of this data being collected directly by the company itself.

Turing is actively engaging a diverse range of professionals – including artists like Taylor, chefs, construction workers, and electricians – to contribute to this data collection effort. Sudarshan Sivaraman, Turing’s Chief AGI Officer, emphasized to TechCrunch that this manual collection is crucial for achieving a sufficiently varied dataset.

A Shift in AI Data Strategies

Turing’s work exemplifies a broader trend within the AI industry regarding data handling. Previously, training datasets were often freely scraped from the internet or assembled using low-cost annotators. Now, companies are investing significantly in carefully curated data.

With the fundamental capabilities of AI already established, proprietary training data is increasingly viewed as a key competitive advantage. Furthermore, many companies are choosing to manage the data collection process internally rather than outsourcing it.

Fyxer's Focus on Data Quality

Fyxer, an email management company utilizing AI for sorting and drafting replies, provides another illustration of this shift. Founder Richard Hollingsworth found that employing numerous small models, each trained on highly specific data, yielded the best results.

While Fyxer leverages a pre-existing foundation model, the core principle remains consistent with Turing’s approach: the quality of the data is paramount, exceeding the importance of sheer quantity.

The Value of Human Expertise

This realization led to some unconventional hiring decisions. In its early stages, Fyxer often had more executive assistants than engineers and managers, as these assistants were essential for training the model.

“We needed to train on the fundamentals of whether an email should be responded to,” Hollingsworth explained to TechCrunch. “It’s a very people-oriented problem, and finding great people is very hard.”

Although the pace of data collection remained constant, Hollingsworth became increasingly selective about datasets, favoring smaller, more meticulously curated sets during post-training refinement. He reiterated that data quality is the defining factor in performance.

Synthetic Data and the Importance of a Strong Foundation

The use of synthetic data is also becoming more prevalent, expanding the range of potential training scenarios. However, this also amplifies the impact of any flaws present in the original dataset. Turing estimates that 75% to 80% of its data is synthetic, derived from the initial GoPro videos.

Consequently, maintaining the highest possible quality in the original dataset is even more critical. As Sivaraman stated, “If the pre-training data itself is not of good quality, then whatever you do with synthetic data is also not going to be of good quality.”

Data as a Competitive Moat

Beyond quality control, there’s a significant competitive advantage to maintaining in-house data collection. For Fyxer, this dedicated effort serves as a strong barrier against competitors.

Hollingsworth believes that while open-source models are readily available, the ability to secure expert annotators and train a functional product is far more challenging. “We believe that the best way to do it is through data,” he told TechCrunch, “through building custom models, through high-quality, human-led data training.”

Correction: A previous version of this piece referred to Turing by an incorrect name. TechCrunch regrets the error.

#AI startups#data infrastructure#data ownership#machine learning#data control#AI data