OpenAI Bot Overwhelms Website: A DDoS-Like Attack

OpenAI Bot Causes Website Disruption for Triplegangers

On Saturday, Oleksandr Tomchuk, CEO of Triplegangers, received notification that his company’s e-commerce platform was inaccessible. Initial assessment indicated a distributed denial-of-service (DDoS) attack was underway.

Further investigation revealed the source of the disruption: a bot originating from OpenAI, engaged in a persistent attempt to scrape the entirety of Triplegangers’ extensive website.

Extensive Data Scraping Attempt

“We maintain a catalog of over 65,000 products, with a dedicated page for each item,” Tomchuk explained to TechCrunch. “Each of these pages features at least three photographic images.”

OpenAI’s bot was generating “tens of thousands” of server requests, aiming to download the complete dataset, encompassing hundreds of thousands of images and their corresponding detailed descriptions.

“The bot utilized 600 different IP addresses to perform the scraping, and our log analysis from the previous week suggests the actual number may be considerably higher,” Tomchuk stated, referring to the IP addresses employed in the attempted data consumption.

“These crawlers were overwhelming our site’s capacity,” he continued. “The activity effectively functioned as a DDoS attack.”

Triplegangers’ Business Model

The Triplegangers website is central to the company’s operations. The firm, comprised of seven employees, has dedicated over a decade to building what it describes as the largest database of “human digital doubles” available online.

This database consists of 3D image files created from scans of real human models. The company sells these 3D object files, alongside photographs – encompassing elements like hands, hair, skin, and complete body scans – to 3D artists, video game developers, and anyone requiring digitally recreated, authentic human characteristics.

The Role of Robot.txt

Triplegangers’ terms of service explicitly prohibit the unauthorized scraping of its images by bots. However, this alone proved insufficient. Effective protection requires a correctly configured robot.txt file, containing specific directives instructing OpenAI’s bot, GPTBot, to refrain from accessing the site.

OpenAI operates additional bots, including ChatGPT-User and OAI-SearchBot, each requiring its own set of exclusion tags, as detailed on OpenAI’s crawler information page.

Originally designed to inform search engines which areas of a website should not be indexed, the Robot Exclusion Protocol (robot.txt) is intended to be respected by crawlers. OpenAI states that it honors properly configured robot.txt files, though it acknowledges a potential delay of up to 24 hours for its bots to recognize updates.

As Tomchuk discovered, the absence of a correctly implemented robot.txt file is interpreted by OpenAI and other entities as permission to scrape data freely. It is not an opt-in system.

Financial Implications and Voluntary Compliance

Compounding the issue, the OpenAI bot’s activity occurred during U.S. business hours, causing a disruption to Triplegangers’ operations. Tomchuk also anticipates a significantly increased bill from Amazon Web Services (AWS) due to the substantial CPU usage and data transfer generated by the bot.

It’s important to note that robot.txt relies on voluntary compliance from AI companies. Last summer, Perplexity, another AI startup, faced criticism following a Wired investigation suggesting it was not adhering to robot.txt directives.

how openai’s bot crushed this seven-person company’s website ‘like a ddos attack’

Uncertainty Regarding Data Acquisition

By midweek, following several days of encountering issues, Triplegangers had successfully implemented a correctly configured robot.txt file. Furthermore, a Cloudflare account was established to prevent access from its GPTBot, alongside several other identified bots such as Barkrowler – an SEO crawler – and Bytespider, utilized by TokTok. Tomchuk also expressed optimism regarding the blocking of crawlers originating from additional AI model companies.

He reported that the website did not experience any crashes on Thursday morning. However, Tomchuk currently lacks a definitive method to ascertain precisely what data OpenAI managed to acquire, nor does he have a means to request its removal.

Attempts to contact OpenAI directly have been unsuccessful, and OpenAI has yet to release the previously announced opt-out mechanism, as recently highlighted by TechCrunch. This presents a particularly complex challenge for Triplegangers.

Concerns Over Rights and Data Usage

“Our business inherently deals with sensitive rights issues, as we scan images of real individuals,” Tomchuk explained. Regulations like Europe’s GDPR stipulate that “a photograph of any person on the web cannot be utilized without consent.”

Triplegangers’ website proved to be a particularly attractive target for AI crawlers. Companies like Scale AI, valued in the billions, have emerged by employing individuals to meticulously tag images for AI training purposes. The Triplegangers site features photos with detailed tagging, including ethnicity, age, the presence of tattoos or scars, and various body types.

Ironically, it was the aggressive nature of the OpenAI bot’s scraping activity that initially alerted Triplegangers to its vulnerability. Had the bot operated with greater restraint, Tomchuk believes the issue might have gone unnoticed.

The Burden of Opt-Out

“A concerning loophole appears to be exploited by these companies, requiring website owners to proactively block them by updating their robot.txt files with specific tags,” Tomchuk stated. This places the responsibility on business owners to understand and implement these blocking measures.

Tomchuk emphasizes the importance of informing other small online businesses that the only way to detect if an AI bot is extracting copyrighted material from their website is through active monitoring. He is not alone in experiencing these issues; other website owners have recently shared with Business Insider how OpenAI bots caused their sites to crash and incurred substantial AWS costs.

Escalating Problem of AI Crawlers

The scale of the problem has significantly increased in 2024. Recent research conducted by DoubleVerify, a digital advertising firm, revealed an 86% rise in “general invalid traffic” during 2024 – traffic that does not originate from genuine users – attributable to AI crawlers and scrapers.

Despite this increase, “the majority of websites remain unaware that they have been scraped by these bots,” Tomchuk cautions. “Consequently, daily monitoring of log activity is now essential to identify these bots.”

The entire process can be likened to a protection racket: AI bots will acquire data unless preventative measures are in place.

“Permission should be sought before data is scraped, rather than simply taking it,” Tomchuk asserts.

TechCrunch offers a newsletter dedicated to AI! Subscribe here to receive it in your inbox each Wednesday.

Topics

More

OpenAI Bot Overwhelms Website: A DDoS-Like Attack

OpenAI Bot Causes Website Disruption for Triplegangers

Extensive Data Scraping Attempt

Triplegangers’ Business Model

The Role of Robot.txt

Financial Implications and Voluntary Compliance

Concerns Over Rights and Data Usage

The Burden of Opt-Out

Escalating Problem of AI Crawlers

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization