Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Nemexis · Mar 26, 2025

CommanderZulu said:
I don't understand why OpenAI would re-download entire web archives multiple times a day. Can't they just download it once and train their AI on that copy? Why download the same data over and over again? Or have a smarter way of only downloading changed content?

As explained in the article, because they try to have the most "up to date" information AND because latest models sometime re-check their sources when a query comes in

JMTronicHobbyist · Mar 26, 2025

Megahedron said:
At this point, I'm honestly anticipating that this madness is going to end up with someone getting fed up, suing OpenAI and other AI companies for violating the Computer Fraud and Abuse Act and winning in court.

It's not even particularly farfetched--these companies are actively evading attempts to block their traffic, and case law does support organizations conducting a DDOS attack on a website as violating the law...

But they can just pay Donnie Trump and no one will enforce the court orders.

phuzz · Mar 26, 2025

Fatesrider said:
Why this isn't illegal, I don't know.

Because they have more money than you.
Look at the recent case with Meta, where they've basically admitted to downloading pirated eBooks and their only defence was "we set our torrent client to not upload, so it's all good bro"
Laws aren't what's written down, they're what's enforced, and with enough money for lawyers there won't be any meaningful enforcement.

clb2c4e · Mar 26, 2025

Could the 'proof of work' solution be changed to actual work? E.g. mining Bitcoin, searching for seti.

Then the companies that are affected could make a little money or manpower off of reach request but for the individual user its hopefully a small extra step.

The thought being is then the cost would be put more back on the ai companies.

I have no knowledge in this sector so this might simply be impossible.

JMTronicHobbyist · Mar 26, 2025

fuzzyfuzzyfungus said:
At least Jon Steinkek will be inspired by the sight of another too-readily encouraged plunder and destruction of the former ecology to mint The GTX of Wrath; widely regarded as the first great american NFT.

Nahhh, he'll probably just build a robot that breaks necks accidentally but doesn't die when it gets shot.

ambivalent · Mar 26, 2025

I can confirm I've had to implement protective measures for the company I work for due to unreasonable hyperaggressive AI bot scans, and it's a reasonably large hosting service - if you think it's bad against a single domain, imagine the cumulative effect against tens of thousands. Let's call them what they are - data vampires. They consume all the bandwidth and compute resource they're allowed to get away with and give absolutely nothing in return.

JMTronicHobbyist · Mar 26, 2025

AdamWill said:
In addition to the "who's doing this" list: we kinda suspect that the cases where we're seeing botnet-like behaviour - small numbers of requests from big ranges of residential IP addresses - probably aren't being done by significantly-sized AI companies directly. The most plausible theory I've seen to explain those is that it may be more lucrative right now to use a botnet (or one of those scummy phone apps that does nefarious things in the background) to generate a data corpus and shop it around to AI companies who won't ask many questions about where it came from than it is to use the same botnet to send spam/phishing emails, so we suspect at least some of the kinds of people who build and buy access to botnets are doing this. It'd be fascinating to know if there's been a noticeable drop in the amount of other nefarious things being done by botnets recently.

Hmm, I've just realized the lady who says she works for a debt collector but stumbles on and doesn't give the exact same name each time hasn't called me in a while. I'm starting to miss Miss Bergum/Bergman/Berman/Mmbermn from Asset Recovery Solutions/Systems/Services.

JMTronicHobbyist · Mar 26, 2025

Fatesrider said:
Handy, yes, but I'm sure others have pointed out, it's regularly ignored.

When it comes to AI's, they don't bother asking. They execute a knockless entry, and fuck with everything inside looking for all it can grab.

Why this isn't illegal, I don't know.

Everyone knows it's not illegal if it's done with computers. Unless you're a university student. I guess that's why all the tech bros dropped out of college!

JMTronicHobbyist · Mar 26, 2025

HalfFlat said:
Arguing from the anthropic principle, it's the worst timeline ever that stills allows enough civilization to allow us to complain about the timeline on a site like Ars.

You say that, but they could be doing a "thousand flowers" style campaign - a year from now we're all in an El Salvadorian prison and they're making us watch TCL films Clockwork Orange style.

chateauarusi · Mar 26, 2025

It remains to be seen whether the current incarnation of "AI" will indeed produce some number of transformative use cases, as argued by advocates and promoters, or if it'll be a technological dead end. I am deeply skeptical, but I'm reluctantly willing to reserve judgment.

What stories like this make arguable, though, is that the current human implementation of the technology sucks festering goat anus.

snowcone · Mar 26, 2025

I was setting up a bunch of Linux (mint) desktops this weekend and i swear i was getting the most ridiculously low download speeds from the repos i was downloading from. github was also slow.
i was getting between 2 and 4 mbps, meanwhile i could download from Steam at around 70mbps.

It seems like a trend recently, i've seen this on 3 different internet connections. Either this is some weird peering issue (I live in South Africa) or i'm competing for bandwidth with these damn bots...

Trondal · Mar 26, 2025

lionman said:
Isn't capitalism beautiful?? Only this fucking commies disagree with this.

Do you suggest reforming capitalism, or something else entirely?

blank0 · Mar 26, 2025

Doesn’t this show that an “opt-out” approach, like the one proposed by UK government, is going to be completely theoretical and just means no copyright protection at all? After all these sites already opted out and it means absolutely nothing to the AI industry.

Wheels Of Confusion · Mar 26, 2025

snowcone said:
I was setting up a bunch of Linux (mint) desktops this weekend and i swear i was getting the most ridiculously low download speeds from the repos i was downloading from. github was also slow.
i was getting between 2 and 4 mbps, meanwhile i could download from Steam at around 70mbps.

It seems like a trend recently, i've seen this on 3 different internet connections. Either this is some weird peering issue (I live in South Africa) or i'm competing for bandwidth with these damn bots...

I got the same when updating an MX install this weekend. Kbps speeds from the repos. I live in the US. I hadn't thought to connect with AI trawlers.

ForbiddenBarn · Mar 26, 2025

I wonder if "Amazon" is bots directly run by Amazon or bots running on AWS infrastructure by random companies?

Bigdoinks · Mar 26, 2025

adespoton said:
I've been wondering whether the solution is to present code to the bot/user that only a bot would know how to handle. So if someone's just doing straight HTML pulls with basic linkbacks from other HTML pages, let it through. If they're actually loading the javascript tags, add in a delay in the response (so legitimate users on newer browsers will get access, but it'll have a performance hit). And anything that fits a known AI bot profile... it gets sent to Cloudflare's maze or similar.

I think you have it the other way around. Give the bots a minified JS bundle and a boilerplate HTML file and let them sort it out.

The JS Beaver has seen it's shadow: Client-side React is cool again! /s

J.King · Mar 26, 2025

Last week I went through my Forgejo server's access logs and wrote out a robots.txt file using Forgejo's own as a starting point. Fortunately it cut down on things a fair bit, because the main motivator for my doing this was that I actually needed to read the access log to debug a problem with some client software accessing another one of my domains. The constant flood of AI bot requests was making it impossible.

There's still the odd burst of garbage, of course, not to mention the genuinely misbehaved robots, but it's nuts that even "well-behaved" bots which identify themselves and respect robots.txt are not actually well-behaved these days.

marsilies · Mar 26, 2025

David Mayer said:
I will still never forget the day that Safari for windows released, the web was never faster than it was then, page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast)..

I watched a video about how fast the McMaster-Carr site is, and some of the tricks they use:

View: https://www.youtube.com/watch?v=-Ln-8QM8KhQ

Windhaven · Mar 26, 2025

plectrum said:
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.

Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)

Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)

Other people have suggested CDNs like Cloudflare, but they might have settings/opinions about security that means that people running ancient browsers won’t be able to connect.

Just spitballing (this would be a moderate bit of technical work, so might be impractical to do), but could you move the main site to something modern and protected, and then allow registered to generate a code that could be used on a legacy mirror that requires the code to read-only access the site, otherwise the request is blocked?
Yeah, it would be similar to using logins, but you’d have less risk of stuff being leaked, and moderately less security concerns from sending read-only codes rather than full logins.

(FWIW - I’m a science major who mainly programs to analyze data and reads enough tech articles to sound like I know what I’m doing - feel free to correct me if this is a bad idea)

marsilies · Mar 26, 2025

CommanderZulu said:
I don't understand why OpenAI would re-download entire web archives multiple times a day. Can't they just download it once and train their AI on that copy? Why download the same data over and over again? Or have a smarter way of only downloading changed content?

The article notes one possible reason:

This pattern suggests ongoing data collection rather than one-time training exercises, potentially indicating that companies are using these crawls to keep their models' knowledge current.

Going into more detail, when a user asks ChatGPT or Google Search a question, they go and poll the source(s) again to make sure their answer is current. So it's not for training, it's for feeding into the LLM as part of the prompt to generate a response. So they could be doing this on-demand, or for something frequently asked, they limit to "only" checking once every 6 hours.

As for only downloading changed content, how would you know it's changed unless you check it? Websites don't broadcast what pages have changed, and who would they be broadcasting that info to? Maybe something could be put in the metadata on a page to indicate when it last changed, but then websites could lie about that to stop bot crawling. And then there's dynamically generated pages, where it doesn't really exist between web requests.

One thing that could potentially work is if the crawlers all relied on a singular service that checked only for page differences, so they could use that as a reference for polling for new content. But that would still rely on a crawler that was crawling the entire internet on a pretty regular basis.

bigsnake499 · Mar 26, 2025

What about when you detect an AI agent you give them some nonsensical content, thereby poisoning the well?Or make them run a little snippet of a program that you know leads to an infinite loop?

foobarian · Mar 26, 2025

Hopefully the EU will do something. They're one of the few entities both willing and able to do anything to the big tech companies.

foobarian · Mar 26, 2025

bigsnake499 said:
What about when you detect an AI agent you give them some nonsensical content, thereby poisoning the well?Or make them run a little snippet of a program that you know leads to an infinite loop?

That's already being done. Hasn't solved the larger problem.

Reaperman2 · Mar 26, 2025

I work at an academic library, and our archive collections have been absolutely hammered by AI bots for the past year. They don't care about robots.txt, and they change names constantly so they can't be easily detected and blocked.

We finally started using various Cloudflare mechanisms, which sucks because those aren't free and our budget is tight. So we had to reduce the collections/services budgets just to keep the websites -- used daily by researchers all over the world -- up and running.

Aurich · Mar 26, 2025

Killdozer77 said:
If this were a William Gibson novel someone would have hired actual ninja assassins to fix this problem by now.

We're trying, but it takes longer to defrost them out of the vats than you might like.

AdamWill · Mar 26, 2025

DonColeman said:
Given the impact this is having on sites "users", for those sites that are not providing a wide public service, why is not one possible solution/mitigation creating a login system, and either limiting access to logged in users, or slow down/limit requests not from logged-in users?

I get that there is an administrative overhead of adding/managing users and it also makes each request quite a bit more costly... but are we not at the point where the crawlers and the administrative overhead of responding to them, at least in some cases, is worth it?

It is an option. sysadmins have been reluctant to do it so far because it goes against the philosophy of the open web. that's also the most common objection to using cloudflare's service; once you involve cloudflare in your hosting you're effectively giving up on the internet as it was designed, and substantially compromising the privacy of yourself and your users (though it is very effective).

it's likely to happen in some cases though anyway, because at this point we have no good options. you just have to pick a bad one, whether it's requiring logins or using some kind of trap/tarpit thing, whether self-hosted like anubis or just giving up and giving your site to a third-party operator like cloudflare.

scrimbul · Mar 26, 2025

AdamWill said:
It is an option. sysadmins have been reluctant to do it so far because it goes against the philosophy of the open web. that's also the most common objection to using cloudflare's service; once you involve cloudflare in your hosting you're effectively giving up on the internet as it was designed, and substantially compromising the privacy of yourself and your users (though it is very effective).

it's likely to happen in some cases though anyway, because at this point we have no good options. you just have to pick a bad one, whether it's requiring logins or using some kind of trap/tarpit thing, whether self-hosted like anubis or just giving up and giving your site to a third-party operator like cloudflare.

There are other cultural reasons balkanization of the Internet was inevitable even with decentralized protocols.

It can be temporary but at some point the whole world tightly regulates public scraping or undersea cables start getting cut and more firewalls start going up.

LLMs are going to get less useful over time, not more, specifically because information doesn't want to be free to be checked by the entire world every 5 seconds.

MagicDot · Mar 26, 2025

All of this in order to bring us something no one wants nor asked for.

Psyborgue · Mar 26, 2025

Many of these sites like freedesktop.org have reliability issues under ordinary conditions. Getting pounded by misbehaving robots can’t help.

Psyborgue · Mar 26, 2025

Dakel said:
Honestly sounds like we need something like Blackwall at this point. AI crawler hits your site? Hit back harder and force them offline.

If they’re only making outgoing requests it probably won’t do much. Feed them plausible junk can work and now Cloudflare can do it for you.

Psyborgue · Mar 26, 2025

MagicDot said:
All of this in order to bring us something no one wants nor asked for.

I want it, asked for it, but still believe robots.txt should be obeyed. I would like to see that codified in law or enforced with the laws already on the books. It’s accessing a computer without permission — after being explicitly denied even.

deet · Mar 26, 2025

What do these companies think will happen when all the free work they're scraping now goes behind walls to defend against the abuse?

AmitY · Mar 26, 2025

It seems that some solutions are a part of AI being a DDoS.
A real solution like gray-rocking the crawler would be unpopular because it is a thing of the past like selling CD's to people interested in content.

RuralRob · Mar 26, 2025

David Mayer said:
Raise your hands if you remember the incredible benefits offered by download managers, being able to have multiple connections downloading different parts of the same file was incredible.

I will still never forget the day that Safari for windows released, the web was never faster than it was then, page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast), that is, for a few months until Chrome released with about the same performance and web developers were free to add bloat slowing everything down again.

Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...

Wheels Of Confusion · Mar 26, 2025

RuralRob said:
Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...

You pay for what you get, but then again you also get what you pay for. And their delivery is top-notch: Amazon just can't compete with them.

deet · Mar 26, 2025

Wheels Of Confusion said:
You pay for what you get, but then again you also get what you pay for. And their delivery is top-notch: Amazon just can't compete with them.

And it will be dark times indeed if and when they ever change their style of art.

terrydactyl · Mar 26, 2025

danbert2000 said:
Having thoroughly ravaged the natural world for anything of profit, transnational corporations,backed by billionaires looking for even larger and more fashionable hoards of wealth, set their eyes on the digital commons, hellbent on squeezing all value from society before it collapsed. Thus ended the golden age of human access to both the natural and virtual world.

That reminds me of the line:

When it come time to hang all the capitalists, they will vie for the rope contract.

reekmon · Mar 26, 2025

The tragedy of the commons: if it's free it will get abused into oblivion (see email)

My simplistic question: why not put a data limit on each user who successfully logs in? I assume the vast majority of users are not going to suck up ALL of a data set every time they login

marsilies · Mar 26, 2025

RuralRob said:
Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...

The physical McMaster-Carr catalog, which they still print, is one of Adam Savage's (of MythBusters fame) favorite references:

View: https://www.youtube.com/watch?v=8kbu34dk92s

42Kodiak42 · Mar 26, 2025

clb2c4e said:
Could the 'proof of work' solution be changed to actual work? E.g. mining Bitcoin, searching for seti.

Then the companies that are affected could make a little money or manpower off of reach request but for the individual user its hopefully a small extra step.

The thought being is then the cost would be put more back on the ai companies.

I have no knowledge in this sector so this might simply be impossible.

No and no. (for starters, mining bitcoin isn't actual work and does more harm than good, but there's a wider categorical reason beyond that specific example)

Many distributed work schemes need a decent level of cooperation from the 'workers.' How many people are going to skimp on purely voluntary SETI-At-Home work? Not many, there's no reason to do it wrong, so SETI-At-Home only needs to account for a handful of bad actors who will be malicious for its own sake. With a "proof-of-work" scheme, like those employed by blockchains, you still have a vast majority of workers involved to make some gains out of it, they don't get anything if they're caught skimping there, and they're easy to catch because there aren't enough skimpers to outweigh those doing the actual work.

It doesn't quite work that well as a blocking system when you need to block this much traffic. In some of these cases, only 3% of the access requests are from genuine users willing to complete the proof of work. 97% of access requests are coming from sources malicious enough to disregard your wishes not-to-be-crawled, they may very well be malicious enough to send a canned, incorrect response to the proof-of-work challenge, restricting you to types of work that can be checked faster than they can be solved, but even then, the guys you want to stop aren't providing you with any valuable work in the first place, so you're not actually getting anything out of the crawler DDOS no matter what you try.

Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Centurion

Smack-Fu Master, in training

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Praefectus

Ars Legatus Legionis

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Militum

Director of Many Things

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Account Banned

Account Banned

Account Banned

Ars Praefectus

Wise, Aged Ars Veteran

Seniorius Lurkius

Ars Legatus Legionis

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Praetorian

Ars Legatus Legionis

Ars Scholae Palatinae

nproxy.org