Perplexity Accused of Ignoring AI Scraping Blocks to Harvest Website Content

AI startup Perplexity is under fire after internet infrastructure company Cloudflare accused it of actively evading website restrictions to scrape content without permission.

In a blog post published Monday, Cloudflare said it detected Perplexity’s bots circumventing site protections by disguising their identity and using deceptive techniques to harvest data. Specifically, the AI firm allegedly ignored robots.txt directives, a widely accepted web standard that lets site owners tell bots which pages not to access and used generic browser headers to impersonate legitimate traffic, such as Google Chrome on macOS.

“This activity was observed across tens of thousands of domains and millions of requests per day,” Cloudflare wrote. “We were able to fingerprint this crawler using a combination of machine learning and network signals.”

The accusation comes amid growing tensions between AI companies hungry for training data and website operators seeking to protect their content. Perplexity, like many AI firms, relies heavily on large-scale data scraping to power its generative AI models, often pulling from websites without explicit consent.

Cloudflare said it launched an investigation after numerous customers complained about persistent crawling by Perplexity, even after setting up blocks specifically designed to stop its bots. Tests conducted by Cloudflare confirmed the evasive behavior, including the use of alternate IP addresses and spoofed user agents to dodge detection.

Perplexity has denied the allegations. Spokesperson Jesse Dwyer dismissed Cloudflare’s findings as a “sales pitch,” telling TechCrunch that the evidence shown “proves no content was accessed.” He also claimed the bot identified by Cloudflare “isn’t even ours.”

In response, Cloudflare has removed Perplexity’s bots from its list of verified crawlers and deployed new countermeasures to block them more effectively.

This isn’t the first time Perplexity has faced accusations of unauthorised scraping. The company has previously been scrutinised for pulling in content from websites that explicitly prohibit such activity.

The conflict highlights a broader industry debate about data usage, consent, and the sustainability of web publishing in the AI era. Cloudflare, which supports millions of websites, has recently taken a firmer stance. In June, it launched a marketplace that allows site owners to charge AI firms for scraping, and it has offered free tools to block unwanted crawlers.

“AI is breaking the business model of the internet,” Cloudflare CEO Matthew Prince said last month. “We have to draw the line somewhere.”

Last year, publications like Wired accused Perplexity of plagiarizing their content. When pressed weeks later at TechCrunch Disrupt 2024, CEO Aravind Srinivas struggled to clearly define what the company considers plagiarism during a live interview with reporter Devin Coldewey.

- Advertisement -

Trending →

Salesloft GitHub Breach Led to Supply Chain Attack on Big Tech Customers

Whistleblowers Accuse Meta of Suppressing Child Safety Research

Anthropic Endorses California’s SB 53 AI Safety Bill

Signal Introduces Secure Backups and First Paid Plan

Spotify Rolls Out Smart Filters for Personalized Listening

Perplexity Accused of Ignoring AI Scraping Blocks to Harvest Website Content

Perplexity faces fresh backlash for allegedly bypassing website safeguards to scrape data without consent.

You Might Also Like ↷

Meta Rolls Out AI Voice Translations On Facebook And Instagram.

Reddit to Unify Search Interface in Push to Become a Full-Fledged Search Engine

Perplexity Makes $34.5B Cash Bid to Buy Google Chrome, Pledges to Keep It Open Source

Mukesh Ambani Launches Reliance Intelligence to Build India’s AI Backbone

Trending →

You Might Also Like ↷

Our Newsletter