Anthropic’s crawler is ignoring websites’ anti-AI scraping policies

3 months ago 64

The ClaudeBot web crawler that Anthropic uses to scrape grooming information for AI models similar Claude has hammered iFixit’s website astir a cardinal times successful a 24-hour period, seemingly violating the repair company’s Terms of Use successful the process.

“If immoderate of those requests accessed our presumption of service, they would person told you that usage of our contented expressly forbidden. But don’t inquire me, inquire Claude!” said iFixit CEO Kyle Wiens connected X, posting images that amusement Anthropic’s chatbot acknowledging that iFixit’s contented was disconnected limits. “You’re not lone taking our contented without paying, you’re tying up our devops resources. If you privation to person a speech astir licensing our contented for commercialized use, we’re close here.”

iFixit’s Terms of Use policy states that “reproducing, copying oregon distributing” immoderate contented from the website is “strictly prohibited without the explicit anterior written permission” from the company, with circumstantial inclusion of “training a instrumentality learning oregon AI model.” When Anthropic was questioned connected this by 404 Media, however, the AI institution linked backmost to an FAQ page that says its crawler tin lone beryllium blocked via a robots.txt record extension.

Wiens says iFixit has since added the crawl-delay extension to its robots.txt. We person asked Wiens and Anthropic for remark and volition update this communicative if we perceive back.

iFixit doesn’t look to beryllium alone, with Read the Docs co-founder Eric Holscher and Freelancer.com CEO Matt Barrie saying successful Wiens’ thread that their tract had besides been aggressively scraped by Anthropic’s crawler. This besides doesn’t look to beryllium caller behaviour for ClaudeBot, with several months-old Reddit threads reporting a melodramatic summation successful Anthropic’s web scraping. In April this year, the Linux Mint web forum attributed a tract outage to strain caused by ClaudeBot’s scraping activities.

Disallowing crawlers via robots.txt files is besides the opt-out method of prime for galore other AI companies similar OpenAI, but it doesn’t supply website owners with immoderate flexibility to denote what scraping is and isn’t permitted. Another AI company, Perplexity, has been known to ignore robots.txt exclusions entirely. Still, it is 1 of the fewer options disposable for companies to support their information retired of AI grooming materials, which Reddit has applied successful its caller crackdown connected web crawlers.

Read Entire Article