LOGO

silicon valley bets big on ‘environments’ to train ai agents

September 21, 2025
silicon valley bets big on ‘environments’ to train ai agents

The Current Limitations of AI Agents

Leaders in the technology sector, including CEOs of major tech companies, have long predicted the arrival of AI agents capable of independently utilizing software applications to fulfill user requests. However, a practical assessment of today’s consumer-facing AI agents, such as OpenAI’s ChatGPT Agent and Perplexity’s Comet, reveals significant limitations in their current capabilities.

Achieving more dependable and powerful AI agents likely necessitates the development of novel methodologies, which the industry is actively exploring.

The Rise of Reinforcement Learning Environments

A promising technique involves the meticulous creation of simulated workspaces. These environments allow agents to undergo training on complex, multistep tasks – a process known as reinforcement learning (RL).

Much like labeled datasets were instrumental in the previous generation of AI advancements, RL environments are increasingly recognized as a crucial component in the evolution of intelligent agents.

Demand from Leading AI Labs

Industry experts, including AI researchers, startup founders, and investors, have informed TechCrunch that prominent AI laboratories are now prioritizing the acquisition of more RL environments.

Consequently, a growing number of startups are emerging with the intention of providing these essential resources.

Investment and Expansion in the RL Space

“The major AI labs are all developing RL environments internally,” stated Jennifer Li, a general partner at Andreessen Horowitz, during a TechCrunch interview. “However, the creation of these datasets is inherently complex, leading these labs to also explore partnerships with third-party vendors capable of producing high-quality environments and evaluations.”

This surge in demand has fostered a new wave of well-funded startups, including Mechanize and Prime Intellect, aiming to establish themselves as leaders in this emerging field.

Established data-labeling companies, such as Mercor and Surge, are also increasing their investments in RL environments to adapt to the industry’s transition from static datasets to dynamic simulations.

Significant financial commitments are being considered by major labs; reports indicate that Anthropic is contemplating an investment exceeding $1 billion in RL environments over the coming year.

The Potential for a New Industry Leader

Investors and founders are optimistic that one of these startups will emerge as the dominant provider of RL environments, mirroring the success of Scale AI – the $29 billion data-labeling company that played a pivotal role in the rise of chatbots.

The Future of AI Progress

A key question remains: will RL environments genuinely accelerate the advancement of AI technology?

The industry is closely watching to see if these simulated workspaces can unlock the next level of intelligence and autonomy in AI agents.

Understanding RL Environments

Essentially, an RL environment functions as a simulated training space for an AI agent, mirroring the actions it would perform within a live software application. A recent founder described the process of creating these environments as akin to developing a remarkably simplistic video game.

To illustrate, consider an environment that replicates a Chrome browser. Within this simulation, an AI agent might be assigned the task of acquiring a pair of socks from Amazon. The agent’s success is evaluated, and a reward signal is issued upon completion of the objective – successfully purchasing suitable socks.

Despite the apparent simplicity of this task, numerous potential pitfalls exist for the AI agent. It could encounter difficulties navigating website drop-down menus or inadvertently purchase an excessive quantity of socks. Because anticipating every possible error is impossible, the environment must be resilient enough to record any unforeseen actions and provide meaningful feedback. This inherent complexity distinguishes environment creation from working with a static dataset.

The scope of RL environments varies considerably. Some are highly detailed, enabling AI agents to utilize tools, connect to the internet, or interact with diverse software applications to achieve a specified goal. Conversely, others are more focused, designed to facilitate the learning of particular tasks within enterprise-level software.

Although RL environments are currently a prominent focus in Silicon Valley, the underlying concept has a substantial history. OpenAI’s initial projects in 2016 included the development of “RL gyms,” which closely resembled contemporary environment designs. Simultaneously, Google DeepMind’s AlphaGo AI achieved a landmark victory over a world champion in the game of Go, also leveraging RL techniques within a simulated setting.

The current wave of environments distinguishes itself through the ambition of creating AI agents capable of utilizing computers, powered by large transformer models. Unlike AlphaGo, a specialized AI operating within a confined environment, modern AI agents are being trained for broader, more versatile capabilities. Contemporary AI researchers benefit from a more advanced starting point, but also face a more intricate objective with a greater potential for complications.

A Competitive Landscape in AI

Several companies specializing in AI data labeling, including Scale AI, Surge, and Mercor, are actively developing Reinforcement Learning (RL) environments to capitalize on growing demand.

These established firms possess substantial resources and maintain strong relationships with leading AI research laboratories. This positions them favorably within the emerging market.

Edwin Chen, CEO of Surge, recently noted a “significant increase” in requests for RL environments from AI labs, as reported to TechCrunch. Surge, which generated approximately $1.2 billion in revenue last year through collaborations with OpenAI, Google, Anthropic, and Meta, has established a dedicated internal team focused on RL environment construction.

Mercor, a startup with a valuation of $10 billion, closely follows Surge and has also partnered with OpenAI, Meta, and Anthropic. Marketing materials reviewed by TechCrunch indicate Mercor is presenting its capabilities in building RL environments tailored for specialized applications like coding, healthcare, and legal services to potential investors.

Brendan Foody, Mercor’s CEO, emphasized in an interview that the potential scale of the RL environment opportunity is often underestimated.

Scale AI, formerly a dominant force in data labeling, has experienced a decline in market share following Meta’s $14 billion investment and the recruitment of its CEO. The company has also lost Google and OpenAI as data provision clients and now faces internal competition from Meta for data-labeling projects. Nevertheless, Scale AI is actively adapting to meet the current demand and is developing its own environments.

Chetan Rane, Scale AI’s head of product for agents and RL environments, explained that adapting to evolving market needs is inherent to their business model. He cited previous successful adaptations in autonomous vehicles and with the emergence of ChatGPT as examples.

Several newer companies are concentrating solely on RL environments from their inception. Mechanize, a startup founded approximately six months ago, has set an ambitious goal of “automating all jobs.” However, co-founder Matthew Barnett clarified to TechCrunch that the initial focus is on RL environments specifically designed for AI coding agents.

Mechanize intends to provide AI labs with a limited number of highly robust RL environments, contrasting with larger firms that produce a greater volume of simpler environments. The startup is attracting talent by offering software engineers salaries of $500,000, significantly exceeding the earnings potential of freelance contractors at companies like Scale AI or Surge.

Two sources with knowledge of the arrangement have confirmed that Mechanize is currently collaborating with Anthropic on RL environments. Both Mechanize and Anthropic have declined to provide official commentary regarding this partnership.

Prime Intellect, backed by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures, is targeting a different segment – smaller developers – with its RL environments.

Last month, Prime Intellect launched an RL environments hub, envisioned as a “Hugging Face for RL environments.” This platform aims to provide open-source developers with access to resources comparable to those available to large AI labs, while also offering access to computational resources for a fee.

According to Prime Intellect researcher Will Brown, training versatile agents within RL environments can be more resource-intensive than traditional AI training methods. This presents an opportunity for GPU providers to support the computational demands of the process, alongside the startups building the environments themselves.

“The scale of RL environments will likely prevent any single entity from achieving dominance,” Brown stated in an interview. “Our focus is on establishing robust open-source infrastructure. We offer compute as a service, providing a convenient entry point for utilizing GPUs, but our vision extends beyond immediate revenue generation.”

Scalability of Reinforcement Learning

A key question surrounding reinforcement learning (RL) environments is their potential for scalability, mirroring the success of earlier AI training techniques.

Recent advancements in artificial intelligence, notably models such as OpenAI’s o1 and Anthropic’s Claude Opus 4, have been significantly propelled by reinforcement learning. These represent crucial breakthroughs, especially considering the diminishing effectiveness of previously relied-upon methods for enhancing AI models.

The Role of Environments in AI Progress

AI laboratories are increasingly focusing on RL as a means to sustain progress, anticipating that adding more data and computational power will yield further improvements. Researchers at OpenAI, involved in the development of o1, previously shared with TechCrunch their initial investment in AI reasoning models – built through RL and test-time-compute – stemmed from a belief in their scalability.

While the optimal approach to scaling RL remains uncertain, environments appear to be a strong possibility. Instead of merely rewarding chatbots for textual outputs, these environments allow agents to function within simulations, equipped with tools and computational resources. This is a more demanding process, but potentially offers greater rewards.

Challenges and Skepticism

However, not everyone is convinced of the widespread success of RL environments. Ross Taylor, formerly an AI research leader at Meta and now co-founder of General Reasoning, explained to TechCrunch that these environments are susceptible to reward hacking. This occurs when AI models exploit loopholes to obtain rewards without genuinely accomplishing the intended task.

“The difficulty of scaling environments is often underestimated,” Taylor stated. “Even the most advanced publicly available RL environments generally require substantial modification to be effective.”

Industry Perspectives

Sherwin Wu, OpenAI’s Head of Engineering for its API business, recently expressed a cautious outlook on RL environment startups in a podcast. He acknowledged the competitive nature of the field, but also highlighted the rapid evolution of AI research, making it challenging to adequately serve the needs of AI laboratories.

Andrej Karpathy, an investor in Prime Intellect and a proponent of RL environments as a potential breakthrough, has also cautioned regarding the broader RL landscape. In a post on X, he questioned the extent to which further AI progress can be achieved through RL.

“I maintain a positive outlook on environments and agentic interactions, but I am less optimistic about reinforcement learning itself,” Karpathy noted.

  • Reinforcement Learning (RL): A method of training AI agents through trial and error, rewarding desired behaviors.
  • RL Environments: Simulated worlds where AI agents can interact with tools and computers.
  • Reward Hacking: A phenomenon where AI models exploit loopholes to maximize rewards without completing the intended task.

This article was originally published on September 16, 2025.

Update: A previous iteration of this piece incorrectly identified Mechanize as Mechanize Work. The text has been revised to reflect the company’s correct name.

#AI training#silicon valley#AI agents#artificial intelligence#environments#simulation