AI Crawlers Drive Up Wikimedia Commons Bandwidth Usage by 50%

Wikimedia Foundation Faces Surge in Bandwidth Usage Due to AI Scraping

The Wikimedia Foundation, the organization overseeing Wikipedia and numerous other collaborative knowledge initiatives, announced on Wednesday a 50% increase in bandwidth consumption for multimedia downloads from Wikimedia Commons since January 2024.

This rise in usage, as detailed in a blog post published Tuesday, isn't attributed to increased user demand for information. Instead, it stems from automated scraping bots designed to gather data for training artificial intelligence (AI) models.

Unprecedented Traffic and Associated Costs

The Foundation stated that its infrastructure is equipped to handle sudden traffic spikes from human users during significant events. However, the volume of traffic generated by these scraper bots is unprecedented, creating escalating risks and financial burdens.

Wikimedia Commons serves as a freely available archive of images, videos, and audio files, distributed under open licenses or within the public domain.

Disparity in Traffic Patterns

Analysis reveals that bots account for approximately two-thirds (65%) of the most resource-intensive traffic. Despite this, these bots contribute to only 35% of overall pageviews.

This difference arises because frequently accessed content is stored in caches closer to users. Less popular content resides further away, in the central data center, making its delivery more costly. Bots predominantly target this less-frequently accessed material.

“Human readers generally concentrate on specific, often related, subjects,” Wikimedia explains. “Conversely, crawler bots tend to systematically access a larger number of pages, including those with lower popularity.”

Impact on Site Reliability and Resources

Consequently, the Wikimedia Foundation’s site reliability team is dedicating significant time and resources to blocking these crawlers. This is essential to prevent disruptions for regular users.

These efforts are further complicated by the increasing costs associated with cloud services utilized by the Foundation.

A Growing Threat to the Open Internet

This situation exemplifies a broader trend that poses a threat to the open internet. Last month, software engineer and open-source proponent Drew DeVault highlighted that AI crawlers disregard “robots.txt” files, which are intended to prevent automated access.

Gergely Orosz, a pragmatic engineer, also reported last week that AI scrapers from companies like Meta have increased bandwidth demands for his personal projects.

Defensive Measures and Ongoing Challenges

While open-source infrastructure is particularly vulnerable, developers are actively responding with innovative solutions. Some technology companies are also contributing to address the issue.

For example, Cloudflare recently launched AI Labyrinth, a system that employs AI-generated content to impede crawler activity.

The Potential for Increased Restrictions

However, this remains a continuous cycle of adaptation and counter-adaptation. Ultimately, it could compel many publishers to implement access restrictions, such as logins and paywalls, which would negatively impact all web users.

Key Takeaways

Bandwidth consumption on Wikimedia Commons has increased significantly due to AI scraping.
Scraper bots are driving up costs and straining infrastructure.
The open internet is facing a growing challenge from automated data collection.
Defensive measures are being developed, but the situation requires ongoing attention.

Topics

More

AI Crawlers Drive Up Wikimedia Commons Bandwidth Usage by 50%

Wikimedia Foundation Faces Surge in Bandwidth Usage Due to AI Scraping

Unprecedented Traffic and Associated Costs

Disparity in Traffic Patterns

Impact on Site Reliability and Resources

A Growing Threat to the Open Internet

Defensive Measures and Ongoing Challenges

The Potential for Increased Restrictions

Key Takeaways

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization