Perplexity AI Scraping Controversy: Websites Blocked

Perplexity AI Accusations of Circumventing Website Restrictions

According to internet infrastructure provider Cloudflare, the AI startup Perplexity is reportedly engaging in the crawling and scraping of website content, even from sites that have explicitly disallowed such practices.

Cloudflare's Research Findings

On Monday, Cloudflare released research detailing its observations. The research indicates that Perplexity is disregarding established blocking mechanisms and actively concealing its crawling and scraping operations.

Cloudflare’s researchers contend that Perplexity is intentionally masking its identity when attempting to scrape web pages, effectively working to bypass a website’s stated preferences.

The Role of Data in AI Development

AI products, including those developed by Perplexity, necessitate the acquisition of substantial data from the internet.

AI startups have historically scraped text, images, and videos from various online sources, often without obtaining explicit permission, to facilitate the functionality of their products.

Websites have increasingly attempted to counteract this practice by utilizing the Robots.txt file, a web standard that informs search engines and AI companies regarding which pages are permitted for indexing and which are not.

Perplexity's Methods of Circumvention

Perplexity appears to be deliberately bypassing these restrictions by modifying the “user agent” of its bots.

A user agent is a signal that identifies a website visitor based on their device and browser version.

Additionally, Perplexity is reportedly altering its autonomous system networks (ASN), which are numerical identifiers for large internet networks, as per Cloudflare’s findings.

“This activity was noted across a large number of domains – tens of thousands – and involved millions of requests daily,” stated Cloudflare in its published post.

“We were able to identify this crawler through a combination of machine learning techniques and network signal analysis.”

Perplexity's Response

Perplexity spokesperson Jesse Dwyer characterized Cloudflare’s blog post as a “sales pitch.”

In an email to TechCrunch, Dwyer asserted that the screenshots included in the post “demonstrate that no content was actually accessed.”

A subsequent email from Dwyer claimed that the bot identified in the Cloudflare blog post “does not belong to us.”

Cloudflare's Actions

Cloudflare initially detected the problematic behavior following complaints from its customers.

These customers reported that Perplexity continued to crawl and scrape their sites despite the implementation of rules within their Robots.txt files and specific blocks targeting Perplexity’s known bots.

Following these reports, Cloudflare conducted tests to verify and confirm that Perplexity was indeed circumventing these blocks.

“Our observations revealed that Perplexity employs not only its declared user-agent but also a generic browser designed to mimic Google Chrome on macOS when its declared crawler is blocked,” Cloudflare explained.

The company has since removed Perplexity’s bots from its verified list and implemented new techniques to effectively block them.

Cloudflare's Broader Stance on AI Crawlers

Cloudflare has recently adopted a more assertive position regarding AI crawlers.

Last month, the company announced the launch of a marketplace enabling website owners and publishers to charge AI scrapers for accessing their sites.

Cloudflare’s CEO, Matthew Prince, expressed concerns that AI is disrupting the established business model of the internet, particularly for publishers.

Last year, Cloudflare also released a free tool designed to prevent bots from scraping websites for the purpose of AI training.

Previous Accusations

This is not the first instance of Perplexity facing accusations of unauthorized scraping.

Last year, news organizations, including Wired, alleged that Perplexity was plagiarizing their content.

Weeks later, Perplexity’s CEO, Aravind Srinivas, was unable to provide a clear definition of plagiarism during an interview with TechCrunch’s Devin Coldewey at the Disrupt 2024 conference.

Topics

More

Perplexity AI Scraping Controversy: Websites Blocked

Perplexity AI Accusations of Circumventing Website Restrictions

Cloudflare's Research Findings

The Role of Data in AI Development

Perplexity's Methods of Circumvention

Perplexity's Response

Cloudflare's Actions

Cloudflare's Broader Stance on AI Crawlers

Previous Accusations

Related Posts

ChatGPT Launches App Store for Developers

Pickle Robot Appoints Tesla Veteran as First CFO

Peripheral Labs: Self-Driving Car Sensors Enhance Sports Fan Experience

Luma AI: Generate Videos from Start and End Frames

Alexa+ Adds AI to Ring Doorbells - Amazon's New Feature

Amazon Appoints Peter DeSantis to Lead New AI Organization