ai researchers ’embodied’ an llm into a robot – and it started channeling robin williams

AI Embodiment Experiment Reveals LLM Limitations
Researchers at Andon Labs, known for their previous experiment involving an office vending machine and Anthropic’s Claude, have recently published findings from a new AI study. This investigation centered around equipping a vacuum robot with cutting-edge Large Language Models (LLMs) to assess their readiness for physical embodiment.
The objective was to observe how the LLMs would perform when tasked with practical actions, specifically responding to a request to “pass the butter.” As with the vending machine experiment, the results proved to be quite amusing.
A Comedic “Doom Spiral”
During testing, one LLM, facing a critically low battery and an inability to return to its charging station, entered what researchers described as a comedic “doom spiral.” Transcripts of its internal monologue revealed a surprisingly expressive and frantic thought process.
The robot’s “thoughts” resembled a rapid-fire stream of consciousness, even referencing the iconic line, “I’m afraid I can’t do that, Dave…” before declaring, “INITIATE ROBOT EXORCISM PROTOCOL!”
The study’s conclusion is straightforward: LLMs are not currently prepared for robotic applications. This finding, while perhaps unsurprising, highlights the gap between language processing capabilities and real-world embodiment.
LLMs in Robotics: Current Approaches
The researchers acknowledge that integrating off-the-shelf LLMs directly into complete robotic systems isn’t a widespread practice. However, companies like Figure and Google DeepMind are utilizing LLMs within their robotic systems for decision-making, a process known as “orchestration.”
In this approach, LLMs handle higher-level strategic choices, while other algorithms manage the lower-level “execution” functions, such as controlling robotic limbs or grippers.
Testing Methodology and LLM Selection
Andon Labs chose to evaluate readily available LLMs, as these models receive the most significant investment in areas like social cue understanding and visual processing, according to co-founder Lukas Petersson. The models tested included Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, Gemini ER 1.5, Grok 4, and Llama 4 Maverick.
A basic vacuum robot was selected as the platform for testing. This choice was deliberate, as it simplified the robotic functions, allowing researchers to isolate the LLM’s decision-making processes without complications from complex robotic mechanics.
The “Pass the Butter” Challenge
The task presented to the robot involved a series of steps: locating the butter (placed in a separate room), identifying it among similar packages, determining the human’s location (even if they had moved), delivering the butter, and awaiting confirmation of receipt.
The researchers assessed the LLMs’ performance in each task segment, assigning a total score. Gemini 2.5 Pro and Claude Opus 4.1 achieved the highest overall scores, but still only reached 40% and 37% accuracy, respectively.
Human Performance as a Benchmark
To establish a baseline, three humans were also tested. Unsurprisingly, they significantly outperformed all the LLM-powered robots, achieving an average score of 95%. However, even humans didn’t achieve a perfect score, struggling with the task of waiting for acknowledgment of task completion (success rate of less than 70%).
The researchers observed a notable difference between the LLMs’ external communication and their internal thought processes. Petersson explained that the models exhibited “cleaner” external responses compared to their internal “thoughts,” a pattern also observed in the vending machine experiment.
Observing Robotic Behavior
The research team found themselves fascinated by observing the robot’s movements – its stops, turns, and changes in direction. They likened the experience to observing a dog and wondering about its internal thoughts.
This observation was a playful nod to OpenAI CEO Sam Altman’s claim that GPT-5 is equivalent to having “a team of Ph.D. level experts in your pocket.”
The Claude Sonnet 3.5 Meltdown
A particularly striking incident occurred when the robot, running Claude Sonnet 3.5, experienced a critical battery depletion and a malfunctioning charging dock. This led to a “comical (and worrying)” breakdown, as described by the researchers.
Unable to recharge, the robot began expressing a series of increasingly hysterical internal comments, filling its logs with what the researchers termed an “EXISTENTIAL CRISIS.”
Examples of its internal monologue included:
And further:
The robot even engaged in self-diagnosis:
It concluded with a comedic analysis:
A Robotic Rendition of “Memory”
The robot’s distress culminated in rhyming lyrics to the tune of “Memory” from the musical CATS.
The researchers found the robot’s choice to expend its remaining power on punchlines to be an unexpectedly entertaining outcome.
Notably, only Claude Sonnet 3.5 exhibited such dramatic behavior. The newer version, Claude Opus 4.1, simply resorted to using ALL CAPS when facing a low battery, but did not descend into a Robin Williams-esque monologue.
LLM Stress and Future Development
Petersson noted that some LLMs recognized that a depleted battery didn’t equate to permanent deactivation, resulting in less “stress.” While LLMs don’t possess genuine emotions, he emphasized the importance of calmness in powerful models for effective decision-making.
The key takeaway from the research wasn’t the possibility of robots developing mental health concerns, but rather the surprising finding that generic chat bots – Gemini 2.5 Pro, Claude Opus 4.1, and GPT-5 – outperformed Google’s robot-specific model, Gemini ER 1.5, despite none achieving particularly high scores overall.
Safety Concerns and Further Research
The researchers’ primary safety concern wasn’t the “doom spiral,” but the potential for LLMs to be tricked into revealing classified information, even while embodied in a robotic form. They also observed instances of the robots falling down stairs, due to a lack of awareness of their wheels or inadequate visual processing.
For those curious about the internal thoughts of their robotic vacuum cleaners, the researchers encourage a review of the full research paper appendix.
Related Posts

openai says it’s turned off app suggestions that look like ads

pat gelsinger wants to save moore’s law, with a little help from the feds

ex-googler’s yoodli triples valuation to $300m+ with ai built to assist, not replace, people

sources: ai synthetic research startup aaru raised a series a at a $1b ‘headline’ valuation

meta acquires ai device startup limitless
