Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Nemexis

Smack-Fu Master, in training
24
I don't understand why OpenAI would re-download entire web archives multiple times a day. Can't they just download it once and train their AI on that copy? Why download the same data over and over again? Or have a smarter way of only downloading changed content?

As explained in the article, because they try to have the most "up to date" information AND because latest models sometime re-check their sources when a query comes in
 
Upvote
2 (2 / 0)
At this point, I'm honestly anticipating that this madness is going to end up with someone getting fed up, suing OpenAI and other AI companies for violating the Computer Fraud and Abuse Act and winning in court.

It's not even particularly farfetched--these companies are actively evading attempts to block their traffic, and case law does support organizations conducting a DDOS attack on a website as violating the law...
But they can just pay Donnie Trump and no one will enforce the court orders.
 
Upvote
2 (3 / -1)

phuzz

Ars Centurion
217
Subscriptor
Why this isn't illegal, I don't know.
Because they have more money than you.
Look at the recent case with Meta, where they've basically admitted to downloading pirated eBooks and their only defence was "we set our torrent client to not upload, so it's all good bro"
Laws aren't what's written down, they're what's enforced, and with enough money for lawyers there won't be any meaningful enforcement.
 
Upvote
24 (24 / 0)

clb2c4e

Smack-Fu Master, in training
50
Could the 'proof of work' solution be changed to actual work? E.g. mining Bitcoin, searching for seti.

Then the companies that are affected could make a little money or manpower off of reach request but for the individual user its hopefully a small extra step.

The thought being is then the cost would be put more back on the ai companies.

I have no knowledge in this sector so this might simply be impossible.
 
Upvote
4 (4 / 0)
At least Jon Steinkek will be inspired by the sight of another too-readily encouraged plunder and destruction of the former ecology to mint The GTX of Wrath; widely regarded as the first great american NFT.
Nahhh, he'll probably just build a robot that breaks necks accidentally but doesn't die when it gets shot.
 
Upvote
2 (2 / 0)

ambivalent

Smack-Fu Master, in training
78
I can confirm I've had to implement protective measures for the company I work for due to unreasonable hyperaggressive AI bot scans, and it's a reasonably large hosting service - if you think it's bad against a single domain, imagine the cumulative effect against tens of thousands. Let's call them what they are - data vampires. They consume all the bandwidth and compute resource they're allowed to get away with and give absolutely nothing in return.
 
Upvote
16 (16 / 0)
In addition to the "who's doing this" list: we kinda suspect that the cases where we're seeing botnet-like behaviour - small numbers of requests from big ranges of residential IP addresses - probably aren't being done by significantly-sized AI companies directly. The most plausible theory I've seen to explain those is that it may be more lucrative right now to use a botnet (or one of those scummy phone apps that does nefarious things in the background) to generate a data corpus and shop it around to AI companies who won't ask many questions about where it came from than it is to use the same botnet to send spam/phishing emails, so we suspect at least some of the kinds of people who build and buy access to botnets are doing this. It'd be fascinating to know if there's been a noticeable drop in the amount of other nefarious things being done by botnets recently.
Hmm, I've just realized the lady who says she works for a debt collector but stumbles on and doesn't give the exact same name each time hasn't called me in a while. I'm starting to miss Miss Bergum/Bergman/Berman/Mmbermn from Asset Recovery Solutions/Systems/Services.
 
Upvote
3 (3 / 0)
Handy, yes, but I'm sure others have pointed out, it's regularly ignored.

When it comes to AI's, they don't bother asking. They execute a knockless entry, and fuck with everything inside looking for all it can grab.

Why this isn't illegal, I don't know.
Everyone knows it's not illegal if it's done with computers. Unless you're a university student. I guess that's why all the tech bros dropped out of college!
 
Upvote
8 (8 / 0)
Arguing from the anthropic principle, it's the worst timeline ever that stills allows enough civilization to allow us to complain about the timeline on a site like Ars.
You say that, but they could be doing a "thousand flowers" style campaign - a year from now we're all in an El Salvadorian prison and they're making us watch TCL films Clockwork Orange style.
 
Upvote
4 (4 / 0)

chateauarusi

Smack-Fu Master, in training
79
It remains to be seen whether the current incarnation of "AI" will indeed produce some number of transformative use cases, as argued by advocates and promoters, or if it'll be a technological dead end. I am deeply skeptical, but I'm reluctantly willing to reserve judgment.

What stories like this make arguable, though, is that the current human implementation of the technology sucks festering goat anus.
 
Upvote
6 (6 / 0)

snowcone

Ars Scholae Palatinae
612
I was setting up a bunch of Linux (mint) desktops this weekend and i swear i was getting the most ridiculously low download speeds from the repos i was downloading from. github was also slow.
i was getting between 2 and 4 mbps, meanwhile i could download from Steam at around 70mbps.

It seems like a trend recently, i've seen this on 3 different internet connections. Either this is some weird peering issue (I live in South Africa) or i'm competing for bandwidth with these damn bots...
 
Upvote
5 (5 / 0)

Wheels Of Confusion

Ars Legatus Legionis
70,964
Subscriptor
I was setting up a bunch of Linux (mint) desktops this weekend and i swear i was getting the most ridiculously low download speeds from the repos i was downloading from. github was also slow.
i was getting between 2 and 4 mbps, meanwhile i could download from Steam at around 70mbps.

It seems like a trend recently, i've seen this on 3 different internet connections. Either this is some weird peering issue (I live in South Africa) or i'm competing for bandwidth with these damn bots...
I got the same when updating an MX install this weekend. Kbps speeds from the repos. I live in the US. I hadn't thought to connect with AI trawlers.
 
Upvote
2 (2 / 0)

Bigdoinks

Ars Scholae Palatinae
874
I've been wondering whether the solution is to present code to the bot/user that only a bot would know how to handle. So if someone's just doing straight HTML pulls with basic linkbacks from other HTML pages, let it through. If they're actually loading the javascript tags, add in a delay in the response (so legitimate users on newer browsers will get access, but it'll have a performance hit). And anything that fits a known AI bot profile... it gets sent to Cloudflare's maze or similar.
I think you have it the other way around. Give the bots a minified JS bundle and a boilerplate HTML file and let them sort it out.

The JS Beaver has seen it's shadow: Client-side React is cool again! /s
 
Upvote
2 (2 / 0)

J.King

Ars Praefectus
4,132
Subscriptor
Last week I went through my Forgejo server's access logs and wrote out a robots.txt file using Forgejo's own as a starting point. Fortunately it cut down on things a fair bit, because the main motivator for my doing this was that I actually needed to read the access log to debug a problem with some client software accessing another one of my domains. The constant flood of AI bot requests was making it impossible.

There's still the odd burst of garbage, of course, not to mention the genuinely misbehaved robots, but it's nuts that even "well-behaved" bots which identify themselves and respect robots.txt are not actually well-behaved these days.
 
Upvote
7 (7 / 0)

marsilies

Ars Legatus Legionis
23,256
Subscriptor++
Upvote
7 (7 / 0)

Windhaven

Smack-Fu Master, in training
45
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.

Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)

Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)
Other people have suggested CDNs like Cloudflare, but they might have settings/opinions about security that means that people running ancient browsers won’t be able to connect.

Just spitballing (this would be a moderate bit of technical work, so might be impractical to do), but could you move the main site to something modern and protected, and then allow registered to generate a code that could be used on a legacy mirror that requires the code to read-only access the site, otherwise the request is blocked?
Yeah, it would be similar to using logins, but you’d have less risk of stuff being leaked, and moderately less security concerns from sending read-only codes rather than full logins.

(FWIW - I’m a science major who mainly programs to analyze data and reads enough tech articles to sound like I know what I’m doing - feel free to correct me if this is a bad idea)
 
Upvote
2 (2 / 0)

marsilies

Ars Legatus Legionis
23,256
Subscriptor++
I don't understand why OpenAI would re-download entire web archives multiple times a day. Can't they just download it once and train their AI on that copy? Why download the same data over and over again? Or have a smarter way of only downloading changed content?
The article notes one possible reason:
This pattern suggests ongoing data collection rather than one-time training exercises, potentially indicating that companies are using these crawls to keep their models' knowledge current.

Going into more detail, when a user asks ChatGPT or Google Search a question, they go and poll the source(s) again to make sure their answer is current. So it's not for training, it's for feeding into the LLM as part of the prompt to generate a response. So they could be doing this on-demand, or for something frequently asked, they limit to "only" checking once every 6 hours.

As for only downloading changed content, how would you know it's changed unless you check it? Websites don't broadcast what pages have changed, and who would they be broadcasting that info to? Maybe something could be put in the metadata on a page to indicate when it last changed, but then websites could lie about that to stop bot crawling. And then there's dynamically generated pages, where it doesn't really exist between web requests.

One thing that could potentially work is if the crawlers all relied on a singular service that checked only for page differences, so they could use that as a reference for polling for new content. But that would still rely on a crawler that was crawling the entire internet on a pretty regular basis.
 
Upvote
4 (4 / 0)
I work at an academic library, and our archive collections have been absolutely hammered by AI bots for the past year. They don't care about robots.txt, and they change names constantly so they can't be easily detected and blocked.

We finally started using various Cloudflare mechanisms, which sucks because those aren't free and our budget is tight. So we had to reduce the collections/services budgets just to keep the websites -- used daily by researchers all over the world -- up and running.
 
Upvote
25 (25 / 0)

AdamWill

Ars Scholae Palatinae
853
Subscriptor++
Given the impact this is having on sites "users", for those sites that are not providing a wide public service, why is not one possible solution/mitigation creating a login system, and either limiting access to logged in users, or slow down/limit requests not from logged-in users?

I get that there is an administrative overhead of adding/managing users and it also makes each request quite a bit more costly... but are we not at the point where the crawlers and the administrative overhead of responding to them, at least in some cases, is worth it?
It is an option. sysadmins have been reluctant to do it so far because it goes against the philosophy of the open web. that's also the most common objection to using cloudflare's service; once you involve cloudflare in your hosting you're effectively giving up on the internet as it was designed, and substantially compromising the privacy of yourself and your users (though it is very effective).

it's likely to happen in some cases though anyway, because at this point we have no good options. you just have to pick a bad one, whether it's requiring logins or using some kind of trap/tarpit thing, whether self-hosted like anubis or just giving up and giving your site to a third-party operator like cloudflare.
 
Upvote
12 (12 / 0)

scrimbul

Ars Tribunus Militum
2,440
It is an option. sysadmins have been reluctant to do it so far because it goes against the philosophy of the open web. that's also the most common objection to using cloudflare's service; once you involve cloudflare in your hosting you're effectively giving up on the internet as it was designed, and substantially compromising the privacy of yourself and your users (though it is very effective).

it's likely to happen in some cases though anyway, because at this point we have no good options. you just have to pick a bad one, whether it's requiring logins or using some kind of trap/tarpit thing, whether self-hosted like anubis or just giving up and giving your site to a third-party operator like cloudflare.
There are other cultural reasons balkanization of the Internet was inevitable even with decentralized protocols.

It can be temporary but at some point the whole world tightly regulates public scraping or undersea cables start getting cut and more firewalls start going up.

LLMs are going to get less useful over time, not more, specifically because information doesn't want to be free to be checked by the entire world every 5 seconds.
 
Upvote
7 (7 / 0)

Psyborgue

Account Banned
7,564
Subscriptor++
Honestly sounds like we need something like Blackwall at this point. AI crawler hits your site? Hit back harder and force them offline.
If they’re only making outgoing requests it probably won’t do much. Feed them plausible junk can work and now Cloudflare can do it for you.
 
Upvote
0 (0 / 0)

Psyborgue

Account Banned
7,564
Subscriptor++
All of this in order to bring us something no one wants nor asked for.
I want it, asked for it, but still believe robots.txt should be obeyed. I would like to see that codified in law or enforced with the laws already on the books. It’s accessing a computer without permission — after being explicitly denied even.
 
Upvote
1 (1 / 0)

RuralRob

Seniorius Lurkius
26
Subscriptor
Raise your hands if you remember the incredible benefits offered by download managers, being able to have multiple connections downloading different parts of the same file was incredible.

I will still never forget the day that Safari for windows released, the web was never faster than it was then, page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast), that is, for a few months until Chrome released with about the same performance and web developers were free to add bloat slowing everything down again.

Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...
 
Upvote
9 (9 / 0)

terrydactyl

Ars Tribunus Angusticlavius
6,766
Subscriptor
Having thoroughly ravaged the natural world for anything of profit, transnational corporations,backed by billionaires looking for even larger and more fashionable hoards of wealth, set their eyes on the digital commons, hellbent on squeezing all value from society before it collapsed. Thus ended the golden age of human access to both the natural and virtual world.
That reminds me of the line:

When it come time to hang all the capitalists, they will vie for the rope contract.
 
Upvote
10 (10 / 0)

marsilies

Ars Legatus Legionis
23,256
Subscriptor++
Upvote
3 (3 / 0)

42Kodiak42

Ars Scholae Palatinae
813
Could the 'proof of work' solution be changed to actual work? E.g. mining Bitcoin, searching for seti.

Then the companies that are affected could make a little money or manpower off of reach request but for the individual user its hopefully a small extra step.

The thought being is then the cost would be put more back on the ai companies.

I have no knowledge in this sector so this might simply be impossible.
No and no. (for starters, mining bitcoin isn't actual work and does more harm than good, but there's a wider categorical reason beyond that specific example)

Many distributed work schemes need a decent level of cooperation from the 'workers.' How many people are going to skimp on purely voluntary SETI-At-Home work? Not many, there's no reason to do it wrong, so SETI-At-Home only needs to account for a handful of bad actors who will be malicious for its own sake. With a "proof-of-work" scheme, like those employed by blockchains, you still have a vast majority of workers involved to make some gains out of it, they don't get anything if they're caught skimping there, and they're easy to catch because there aren't enough skimpers to outweigh those doing the actual work.

It doesn't quite work that well as a blocking system when you need to block this much traffic. In some of these cases, only 3% of the access requests are from genuine users willing to complete the proof of work. 97% of access requests are coming from sources malicious enough to disregard your wishes not-to-be-crawled, they may very well be malicious enough to send a canned, incorrect response to the proof-of-work challenge, restricting you to types of work that can be checked faster than they can be solved, but even then, the guys you want to stop aren't providing you with any valuable work in the first place, so you're not actually getting anything out of the crawler DDOS no matter what you try.
 
Last edited:
Upvote
7 (7 / 0)