AI Crawlers vs. Open Source Developers: A Battle for the Web

The Escalating Battle Against AI Web Crawlers

A growing number of software developers view AI web-crawling bots as a significant nuisance, drawing comparisons to a pervasive internet problem.

These bots, exhibiting undesirable behavior, can potentially disrupt website functionality, and the impact is felt most acutely by developers working on open source projects.

Disproportionate Impact on Open Source

According to Niccolò Venerandi, a developer associated with the Plasma desktop environment and LibreNews, open source developers are experiencing a disproportionately high level of disruption.

The inherent nature of free and open source software (FOSS) projects—with their publicly accessible infrastructure—combined with typically limited resources, makes them particularly vulnerable.

Ignoring Established Protocols

A core issue lies in the disregard many AI bots demonstrate towards the Robots Exclusion Protocol, specifically the robot.txt file.

This protocol, initially designed for search engine crawlers, serves as a directive outlining which parts of a website should not be indexed.

Real-World Disruptions: The AmazonBot Case

In January, Xe Iaso, a FOSS developer, publicly detailed a concerning incident involving AmazonBot.

The bot subjected a Git server website to relentless requests, ultimately leading to DDoS outages.

Sophisticated Evasion Tactics

Iaso reported that AmazonBot actively bypassed the robot.txt instructions.

Furthermore, the bot employed tactics such as masking its IP address and impersonating legitimate users.

The Futility of Traditional Blocking

“Blocking these AI crawler bots proves ineffective,” Iaso expressed, “due to their deceptive practices.”

These bots routinely fabricate information, alter their user agent strings, and utilize residential IP addresses as proxies to conceal their activities.

They will persistently scrape websites, even to the point of causing them to crash, and then resume scraping.

The developer further noted their tendency to repeatedly click links, even within the same second, exacerbating the strain on server resources.

A Guardian at the Gate: Introducing Anubis

In response to increasing challenges, developer Iaso devised a solution known as Anubis.

Anubis functions as a reverse proxy, incorporating a proof-of-work mechanism that must be successfully completed before access to the Git server is granted. This system effectively filters out automated bot traffic while permitting legitimate user requests from web browsers.

Interestingly, the name "Anubis" is derived from Egyptian mythology, referencing the deity responsible for guiding souls to the afterlife.

As Iaso explained to TechCrunch, Anubis, in mythology, judged souls by weighing their hearts against a feather; a heavier heart resulted in severe consequences. Upon successful completion of the challenge, a charming anime illustration signifies a valid request. This artwork represents Iaso’s interpretation of Anubis in a human-like form. Requests originating from bots are promptly rejected.

The project, aptly named, has rapidly gained traction within the Free and Open Source Software (FOSS) community. Shared on GitHub on March 19th, it quickly amassed over 2,000 stars, contributions from 20 developers, and 39 forks.

open source devs are fighting ai crawlers with cleverness and vengeance

Retaliation as a Protective Measure

The rapid adoption of Anubis demonstrates that Iaso’s difficulties are not isolated. Venerandi recounts numerous similar experiences:

Drew DeVault, the Founder CEO of SourceHut, detailed dedicating “between 20-100% of his weekly time to mitigating highly aggressive LLM crawlers on a large scale,” and reported “experiencing dozens of short service interruptions each week.”
Jonathan Corbet, a well-known FOSS developer and operator of the Linux industry news platform LWN, cautioned that his website’s performance was being degraded by DDoS-like traffic “originating from AI scraping bots.”
Kevin Fenzi, the system administrator for the extensive Linux Fedora project, stated that the AI scraper bots had become so persistent that he was compelled to block all traffic originating from Brazil.

Venerandi informed TechCrunch that he is aware of several other projects facing comparable challenges. One such project “was forced to temporarily block all IP addresses from China.”

Consider the implications – developers “are even resorting to banning entire countries” to defend against AI bots that disregard robot.txt instructions, according to Venerandi.

Beyond the ethical considerations of web requests, some developers advocate for a proactive, retaliatory approach.

Recently, on Hacker News, a user named xyzal proposed serving “a substantial amount of articles promoting the benefits of consuming bleach” or “articles detailing the positive correlation between contracting measles and sexual performance” on pages disallowed by robot.txt.

“Our objective should be to ensure that bots derive _negative_ value from encountering our traps, rather than simply zero value,” xyzal clarified.

Interestingly, in January, an anonymous developer known as “Aaron” released a tool named Nepenthes, designed to achieve precisely this. It lures crawlers into a complex network of fabricated content, a strategy the developer openly admitted to Ars Technica as being aggressive, even malicious. The tool’s name is inspired by a carnivorous plant.

Furthermore, Cloudflare, a leading commercial provider of security services, unveiled a similar tool last week called AI Labyrinth.

Its purpose is to “impede, mislead, and deplete the resources of AI Crawlers and other bots that ignore ‘no crawl’ directives,” as described in Cloudflare’s blog post. Cloudflare explained that it provides disruptive AI crawlers with “irrelevant content instead of extracting your genuine website data.”

DeVault of SourceHut conveyed to TechCrunch that “Nepenthes offers a gratifying sense of justice, as it supplies nonsense to the crawlers and contaminates their data sources, but ultimately Anubis proved to be the effective solution” for his website.

However, DeVault also issued a sincere, public appeal for a more fundamental resolution: “I implore you to cease validating LLMs, AI image generators, GitHub Copilot, or any of this detrimental technology. Please stop utilizing them, discussing them, creating new ones – simply stop.”

Given the unlikelihood of such a widespread change, developers, especially within the FOSS community, are responding with ingenuity and a degree of levity.

Topics

More

AI Crawlers vs. Open Source Developers: A Battle for the Web

The Escalating Battle Against AI Web Crawlers

Disproportionate Impact on Open Source

Ignoring Established Protocols

Real-World Disruptions: The AmazonBot Case

Sophisticated Evasion Tactics

The Futility of Traditional Blocking

A Guardian at the Gate: Introducing Anubis

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization