Cloudflare just hit the AI web crawlers with a thunderbolt: pay for content or be blocked

The delivery network through whose servers about a fifth of all internet traffic passes is blocking AI web crawlers

John NaughtonColumnist

The big news of the month is that a large tech company has declared war on the AI industry. On 1 July, Cloudflare, a leading cybersecurity and content delivery network (CDN) provider, through whose servers about a fifth of all internet traffic passes, declared “content independence day”. From that day onwards, AI web crawlers – the bots that tech companies use to scrape online content – will not be able to access sites running on Cloudflare’s servers.

Why is this a big deal? Several reasons. First, CDNs that run datacentres in different locations across the globe are an important part of the internet’s global architecture. As the web grew, companies wanted to serve content locally – to reduce latency, such as delays in transmission – but maintaining servers in every region was expensive and logistically complicated. Getting your stuff hosted and served by CDNs was a no-brainer.

Second, websites are vulnerable things. If you’re a big corporation, your website becomes a target for distributed denial of service (DDoS) attacks designed to overwhelm it, and CDNs have become pretty good at defending against these virtual assaults. Two months ago, for example, Cloudflare blocked the largest DDoS attack ever recorded: a staggering 7.3 terabytes per second – or the equivalent of flooding a network with more than 9,350 full-length HD movies – directed at one of its customers.

The moral for any chief executive is clear: if you like sleeping at night, use a CDN.

In the last few years, a new kind of threat to online-hosted content has materialised: web crawlers. These are internet bots run by AI companies that systematically browse the web, gathering all the content they can find for training generative AI systems such as large language models (LLMs).

Mostly this hoovering has been done without the permission of content creators or owners and rationalised by various kinds of legalistic cant about “fair use” and the like; or with fatuous arguments that creators could always opt out if they didn’t like their intellectual property being summarily appropriated.

All of which explains why Cloudflare’s decision is significant. It now – by default – blocks AI web crawlers from scraping content from its clients’ websites without permission or compensation. In other words, it neatly inverts the cynical logic of the AI giants and their accomplices in the British and other governments. Instead of IP owners having to opt out of being mugged, the assailants have to ask politely – and perhaps also pay up.

Or, as Steven Vaughan-Nichols of the Verge puts it: “On behalf of its 2 million-plus customers, 20% of the web, Cloudflare now blocks AI crawlers … Additionally, Cloudflare promises to detect ‘shadow’ scrapers – bots that attempt to evade detection – by using behavioural analysis and machine learning. What’s good for the AI goose is good for the gander.”

This is good news from an equity viewpoint, but it’s also significant in a wider context, because what’s now becoming evident is that AI’s strip-mining is rapidly undermining the original business model of the web. In that model, you created a website and published content on it. Search engines then indexed the web and made your site findable. People could then visit the site and by doing so provide whatever returns – financial or otherwise – you hoped to get from it.

Increasingly, though, people are using chatbots for search rather than traditional search engines such as Google, Bing and co. And instead of getting a list of sites that may be relevant to their inquiry, they receive a neatly packaged answer to it. In some cases (such as, say, the Perplexity AI tool) the bot provides a list of the sites from which it has compiled its reply. But other bots seem to be less scrupulous about their research.

So what seems to be happening is that LLMs are rapidly becoming “answer machines”. This is convenient for users, obviously, but it also means that they have to trust the process by which data was extracted from the sources it used – and that it hasn’t “hallucinated” or made things up.

Paradoxically, it may also lead to the drying up of those very sources on which the chatbot has relied, for it turns out that many websites are now being oppressed by importuning chatbots. “The AI boom,” reports Wired, “has produced a corresponding boomlet in AI-focused web crawlers, and these bots scrape webpages with a frequency that can mimic a DDoS attack, straining servers and knocking websites offline.” It’s high time for AI companies to pay per crawl. Three cheers for Cloudflare!

What I’m reading

The book of Luke

Emerson, AI, and the Force by Neal Stephenson is a characteristically elegant Substack essay.

Intellectual heavyweights

A lovely essay by the economist Brad DeLong is Big Serious Books Can Really Be Your Intellectual Friends.

Focus group

Attention Is All You Need is a really thoughtful long view by Kevin Munger on what’s happened to our media ecosystem.

Newsletters

Choose the newsletters you want to receive

For information about how The Observer protects your data, read our Privacy Policy

Photograph by Richard B Levine

About

Work with us Careers

Join

PDF Edition Journalism school

Events Shop

Follow

The Observer

The Observer Magazine

The ObserverNew Review

The Observer Food Monthly