AI startup Perplexity is under fire after internet infrastructure company Cloudflare accused it of actively evading website restrictions to scrape content without permission.
In a blog post published Monday, Cloudflare said it detected Perplexity’s bots circumventing site protections by disguising their identity and using deceptive techniques to harvest data. Specifically, the AI firm allegedly ignored robots.txt directives, a widely accepted web standard that lets site owners tell bots which pages not to access and used generic browser headers to impersonate legitimate traffic, such as Google Chrome on macOS.
“This activity was observed across tens of thousands of domains and millions of requests per day,” Cloudflare wrote. “We were able to fingerprint this crawler using a combination of machine learning and network signals.”
The accusation comes amid growing tensions between AI companies hungry for training data and website operators seeking to protect their content. Perplexity, like many AI firms, relies heavily on large-scale data scraping to power its generative AI models, often pulling from websites without explicit consent.
Cloudflare said it launched an investigation after numerous customers complained about persistent crawling by Perplexity, even after setting up blocks specifically designed to stop its bots. Tests conducted by Cloudflare confirmed the evasive behavior, including the use of alternate IP addresses and spoofed user agents to dodge detection.
Related: Lovense Fixes Sex Toy Security Flaw… Then Threatens Legal Action
Perplexity has denied the allegations. Spokesperson Jesse Dwyer dismissed Cloudflare’s findings as a “sales pitch,” telling TechCrunch that the evidence shown “proves no content was accessed.” He also claimed the bot identified by Cloudflare “isn’t even ours.”
In response, Cloudflare has removed Perplexity’s bots from its list of verified crawlers and deployed new countermeasures to block them more effectively.
This isn’t the first time Perplexity has faced accusations of unauthorised scraping. The company has previously been scrutinised for pulling in content from websites that explicitly prohibit such activity.
The conflict highlights a broader industry debate about data usage, consent, and the sustainability of web publishing in the AI era. Cloudflare, which supports millions of websites, has recently taken a firmer stance. In June, it launched a marketplace that allows site owners to charge AI firms for scraping, and it has offered free tools to block unwanted crawlers.
“AI is breaking the business model of the internet,” Cloudflare CEO Matthew Prince said last month. “We have to draw the line somewhere.”
Last year, publications like Wired accused Perplexity of plagiarizing their content. When pressed weeks later at TechCrunch Disrupt 2024, CEO Aravind Srinivas struggled to clearly define what the company considers plagiarism during a live interview with reporter Devin Coldewey.