One company’s devious plan to stop AI web scrapers from stealing your content


AI is stealing your content. We know this is how AI companies have built their highly-valued businesses – by scraping the web and using your data to train their chatbots.

Web scraping isn’t new. In the past, websites could rely on simple protocols like robots.txt to define what could, and could not, be used by web crawlers. Those guidelines were respected by the companies doing the scraping to, say, build results for search engines. AI companies, however, are not abiding by this social contract and are ignoring those instructions.

Cloudflare, a global network service that helps some of the biggest websites in the world deliver content to users, has devised a new plan to deal with AI companies’ web scrapers. And the idea is as positively devious as it is ingenious. 

In a new blog post, Cloudflare has shared how it’s now “trapping misbehaving bots in an AI labyrinth.” Basically, bots that don’t follow the rules laid out for them via protocols such as robots.txt, a simple text file that lays out what web crawlers are allowed to do on a site, will be messed with in order to waste the time and resources of the company in charge of the bot.

“AI-generated content has exploded…at the same time, we’ve also seen an explosion of new crawlers used by AI companies to scrape data for model training,” Cloudflare said in its post. “AI Crawlers generate more than 50 billion requests to the Cloudflare network every day, or just under 1% of all web requests we see.”

Mashable Light Speed

Cloudflare says it previously just blocked AI web crawlers and scrapers. However, doing so alerted those behind the bots that their access had been denied, and as a result they would shift strategies in order to continue their scraping campaigns.

So, Cloudflare came up with an idea to build a honeypot: a series of fake webpages created with AI-generated content.

The fact that Cloudflare is utilizing AI-generated content to fight AI web scrapers isn’t just for schadenfreude. When AI trains off of AI-generated content, it actually degrades the AI model itself. The industry even has a term for it: “model collapse.” Cloudflare is essentially making sure that bots that break the rules are punished for doing so.

Cloudflare’s post gets into the technical details of building the AI labyrinth. But, the main gist of it is that Cloudflare devised things in a way where a human visitor shouldn’t ever see these AI-generated honeypot pages. In addition, humans would notice the “AI-generated nonsense” on these pages. Bots, however, would fall down the rabbit hole, wasting computational resources as they go deeper and deeper through the multiple pages of AI-generated content.

Cloudflare customers are able to opt-in to using the AI labyrinth right now to protect their content from web scrapers.





Source link