Google Gemini Contractors Rate AI Responses Outside Expertise

The Evolving Role of Human Evaluation in Generative AI

While Generative AI systems appear remarkably advanced, their development relies heavily on human input. Companies such as Google and OpenAI employ teams of individuals – including prompt engineers and analysts – to assess the precision of chatbot responses and refine the underlying AI models.

New Guidelines and Potential Accuracy Concerns

Recently, an internal directive from Google, shared with contractors working on the Gemini AI, has sparked debate. This guideline, reviewed by TechCrunch, suggests a potential increase in the likelihood of Gemini providing inaccurate information, particularly concerning sensitive subjects like healthcare, to end-users.

Changes to Prompt Evaluation Protocols

Contractors collaborating with GlobalLogic, a Hitachi-owned outsourcing firm, routinely evaluate AI-generated responses based on criteria including “truthfulness.”

Previously, these contractors had the option to bypass evaluating responses to prompts that fell outside their area of expertise.

For instance, an evaluator without a scientific background could decline to assess a highly specialized question related to cardiology.

Removal of the "Skip" Option

However, last week GlobalLogic communicated a policy shift originating from Google. Contractors are now prohibited from skipping prompts, even if they lack the necessary expertise to accurately assess the AI’s output.

The previous guidelines explicitly stated: “If you do not have critical expertise (e.g. coding, math) to rate this prompt, please skip this task.”

This has been revised to: “You should not skip prompts that require specialized domain knowledge.”

Implications for Specialized Topics

Instead of skipping, contractors are now instructed to evaluate the portions of a prompt they *do* understand and to note their lack of expertise in the relevant field.

This change raises concerns about Gemini’s accuracy in specialized areas, as contractors may be asked to evaluate complex AI responses concerning topics like rare diseases without possessing the requisite background knowledge.

One contractor expressed concern internally, questioning, “I thought the point of skipping was to increase accuracy by giving it to someone better?”

Limited Exceptions to the New Rule

The updated guidelines now restrict the ability to skip prompts to only two scenarios: incomplete prompts or responses, or prompts containing harmful content requiring specific consent forms for evaluation.

Google’s Response

Google did not initially respond to TechCrunch’s requests for comment. Following publication of the story, Google acknowledged the reporting and stated that the company is “constantly working to improve factual accuracy in Gemini.”

According to Google spokesperson Shira McNamara, raters contribute to a broad range of tasks across Google’s products and platforms.

“They do not solely review answers for content; they also provide valuable feedback on style, format, and other factors,” McNamara explained.

“The ratings they provide do not directly impact our algorithms, but when taken in aggregate, are a helpful data point to help us measure how well our systems are working.”

This article was updated to include a post-publication comment from Google.

Secure tips can be sent to this reporter via Signal at +1 628-282-2811.

Topics

More