Devs say AI crawlers dominate traffic, forcing blocks on entire countries

AdamWill

Ars Scholae Palatinae
854
Subscriptor++
The tragedy of the commons: if it's free it will get abused into oblivion (see email)

My simplistic question: why not put a data limit on each user who successfully logs in? I assume the vast majority of users are not going to suck up ALL of a data set every time they login 🤷‍♂️
they're not logging in at all.
 
Upvote
5 (5 / 0)

clb2c4e

Smack-Fu Master, in training
50
No and no. (for starters, mining bitcoin isn't actual work and does more harm than good, but there's a wider categorical reason beyond that specific example)

Many distributed work schemes need a decent level of cooperation from the 'workers.' How many people are going to skimp on purely voluntary SETI-At-Home work? Not many, there's no reason to do it wrong, so SETI-At-Home only needs to account for a handful of bad actors who will be malicious for its own sake. With a "proof-of-work" scheme, like those employed by blockchains, you still have a vast majority of workers involved to make some gains out of it, they don't get anything if they're caught skimping there, and they're easy to catch because there aren't enough skimpers to outweigh those doing the actual work.

It doesn't quite work that well as a blocking system when you need to block this much traffic. In some of these cases, only 3% of the access requests are from genuine users willing to complete the proof of work. 97% of access requests are coming from sources malicious enough to disregard your wishes not-to-be-crawled, they may very well be malicious enough to send a canned, incorrect response to the proof-of-work challenge, restricting you to types of work that can be checked faster than they can be solved, but even then, the guys you want to stop aren't providing you with any valuable work in the first place, so you're not actually getting anything out of the crawler DDOS no matter what you try.
Bother, well, hopefully the ai bubble bursts rather soon then.
 
Upvote
0 (0 / 0)

vvax56nM

Wise, Aged Ars Veteran
112
LLMs are going to get less useful over time, not more, specifically because information doesn't want to be free to be checked by the entire world every 5 seconds.
Even if that weren't the case LLM:s are going to be seriously enshittified once investors start actually demanding some return on their investment. I'm sure all of the AI companies are hard at work to solve how to introduce product placements in LLM answers.
 
Upvote
3 (3 / 0)

jacs

Ars Centurion
296
Subscriptor
Even if that weren't the case LLM:s are going to be seriously enshittified once investors start actually demanding some return on their investment. I'm sure all of the AI companies are hard at work to solve how to introduce product placements in LLM answers.
What really has them drooling is the idea that "soon" they'll have the ability to do product placement in real time (on live tv even!) on demand. That's were the real money will be.
 
Upvote
3 (3 / 0)

vvax56nM

Wise, Aged Ars Veteran
112
What really has them drooling is the idea that "soon" they'll have the ability to do product placement in real time (on live tv even!) on demand. That's were the real money will be.
Just wait til you have a neuralink implanted and they can do ads directly to your brain. shudders
 
Upvote
1 (1 / 0)

OrvGull

Ars Legatus Legionis
10,669
The physical McMaster-Carr catalog, which they still print, is one of Adam Savage's (of MythBusters fame) favorite references:


View: https://www.youtube.com/watch?v=8kbu34dk92s

I'm pretty sure I spotted a McMaster-Carr catalog on a shelf in one of the labs in Big Hero Six. That movie was surprisingly good about showing real test equipment and such, so I think it was verisimilitude, not product placement. Few fabrication labs would be without a copy.
 
Upvote
2 (2 / 0)

OrvGull

Ars Legatus Legionis
10,669
The tragedy of the commons: if it's free it will get abused into oblivion (see email)

My simplistic question: why not put a data limit on each user who successfully logs in? I assume the vast majority of users are not going to suck up ALL of a data set every time they login 🤷‍♂️
For the most part these bots aren't logging in.

You could limit by IP but that just rewards spreading the traffic across more proxies.
 
Upvote
1 (1 / 0)

danieltien

Smack-Fu Master, in training
15
Glad to see Nepenthes got a shout out. This arms race will continue to escalate, with bots talking to bots. This is why solving the older forms of this matter, such as email spam and spam phone calls.

Embrace, extend, extinguish. Witness the extinguishing of the web.
Honestly, I want to see Nepenthes or a fork of it start taking measures to poison the data rogue AI crawlers take. If their behavior doesn't result in a pain response of some kind, behaviorally, nothing will ever change.
 
Upvote
3 (3 / 0)

mgforbes

Ars Praetorian
442
Subscriptor++
Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...
And a shout-out as well for DigiKey, another extremely fast and clean interface. Whaddaya got, how many ya got, how much do ya want for 'em. I don't care about your stock price, your CEO's cat blog or your corporate headquarters interior decor.
 
Upvote
2 (2 / 0)
At this point, I'm honestly anticipating that this madness is going to end up with someone getting fed up, suing OpenAI and other AI companies for violating the Computer Fraud and Abuse Act and winning in court.

That was my initial thought. The problem is most big AI startups so well funded they will simply be able to outspend most companies when it comes to legal costs.

I think the more likely scenario will be that CDN companies will find a technical way to block AI crawlers on scale and your only option for hosting will be via one of those CDNs.

I do wonder how companies like Squarespace deal with this.
 
Upvote
0 (0 / 0)

scrimbul

Ars Tribunus Militum
2,460
Honestly, I want to see Nepenthes or a fork of it start taking measures to poison the data rogue AI crawlers take. If their behavior doesn't result in a pain response of some kind, behaviorally, nothing will ever change.
Should start seeing more 'This will probably violate the CFAA if you do anything but attempt to feed LLMs misinformation on training data so use at your own risk but here's how you either seize control of their data center servers and cloud instances, or simply force the crawlers to download spyware to tell you in return all about their home networks with no obfuscation...'

Until data scraping foreign subcontractors have to start doing basic cyber security practice or avoid their people getting doxxed, this won't end since regulating it has no hope.
 
Upvote
1 (1 / 0)
Problem is, legit companies are going to pull back once that inevitably happens. Sketchier ones will throw money at fines to make it go away. And truly scummy ones will simply attack from foreign nations, just like spammers do today. It's all happened before and will happen again.

Dead internet is dead.
Do we dial up again?
 
Upvote
0 (0 / 0)

Jim Salter

Ars Legatus Legionis
16,905
Subscriptor++
I can confirm this, sadly. This morning, my wife let me know that our business' billing application was refusing logins. When I began troubleshooting, I discovered that this was an untrapped error caused by ENOSPC (no space on disk), which in turn was caused by Amazon and OpenAI hammering the jesus out of a mediawiki site on the same server.

2024-stats.png
2025-stats.png

That server went from about 600 megabytes of bandwidth per month as of Jan-Nov 2024, to 25 gigabytes of bandwidth per month in February and March!

The scrapers are hitting RecentChanges on the mediawiki site, just like they do on the "big" wiki, and as a result are consuming very nearly all the CPU and storage throughput available on the entire system as well:

amazon beating hell out of my webserver.png

Here's a chart showing the number of httpd-access log entries from each user agent string seen in March 2025. The top two are OpenAI GPTBot, and Amazon Amazonbot. Scope those bars!
March 2025_ access log entries vs. User Agent at freebsdwiki.net.png
 
Upvote
5 (5 / 0)

Jim Salter

Ars Legatus Legionis
16,905
Subscriptor++
Upvote
2 (2 / 0)

Jim Salter

Ars Legatus Legionis
16,905
Subscriptor++
I wonder if "Amazon" is bots directly run by Amazon or bots running on AWS infrastructure by random companies?
It's run directly by Amazon. "Amazonbot is Amazon's web crawler used to improve our services, such as enabling Alexa to more accurately answer questions for customers."

https://developer.amazon.com/en/amazonbot

I can confirm even more directly, because I am the proud owner of a mediawiki site that Amazon's bots have been hammering relentlessly, to the point that a website on the same server crashed today due to the load from Amazon and from OpenAI. This means I can see the IP addresses of the Amazonbot log entries, and yes, they all come not only from Amazon space, but from Amazon CRAWL space specifically. They belong to Amazon, and Amazon's bidding is what they are doing.
 

Attachments

  • amazon beating hell out of my webserver.png
    amazon beating hell out of my webserver.png
    664 KB · Views: 0
Upvote
4 (4 / 0)
The article notes one possible reason:


Going into more detail, when a user asks ChatGPT or Google Search a question, they go and poll the source(s) again to make sure their answer is current. So it's not for training, it's for feeding into the LLM as part of the prompt to generate a response. So they could be doing this on-demand, or for something frequently asked, they limit to "only" checking once every 6 hours.

As for only downloading changed content, how would you know it's changed unless you check it? Websites don't broadcast what pages have changed, and who would they be broadcasting that info to? Maybe something could be put in the metadata on a page to indicate when it last changed, but then websites could lie about that to stop bot crawling. And then there's dynamically generated pages, where it doesn't really exist between web requests.

One thing that could potentially work is if the crawlers all relied on a singular service that checked only for page differences, so they could use that as a reference for polling for new content. But that would still rely on a crawler that was crawling the entire internet on a pretty regular basis.

I thought this was arleady mostly solved with search engines and older not-search-engine bots (like skype's urlcheckers or preview generators).
Like:
  • IndexNow-like protocols like https://searchengineland.com/indexn...ndex-to-push-content-to-search-engines-375247 (yes, something like this should be standartized)
  • sites provinding more or less real last update date/cache-for in headers and crawlers respecting
  • providing actual user agents and honoring robots.txt (at least if it's not request directly initiated requests). btw, as far as i remember robots.txt does have options for site to request periods
  • using sensible defaults (multiple times per day for same page is not sensible)
 
Upvote
1 (1 / 0)
Also, some developments in tech for antiblocking measures are not due to desire to overload sites, it's due to some groups of users from some countries wanting access but denied directly (due to site wanting to play political games) or (in somecases) without using VPN due to their situations (due to some goverments wanting to play with censorship).
So non-datacenter VPNs and complex tools are used.
Sometimes, I think it would be nice for something like interledger protocol to be fully standartized and supported on all levels (including automated tools). Basically - if you feel you need to access this url - you could pay very small amount and get it and don't have to fight access prevention tools. Payment have to in anonymized and not revocable way (so it have to something like crypto or payment processor itself becomes problem). This would remove some reasons to make workarounds for access.


p.s.
Yes, I'm from one of countries where it does matter.
Regular home router setup now divides domains in 3 groups:
  • could be accessed directly
  • local goverment plays with censorship, traffic is routed o nearest cloudflare warp node (geoip resolves to same city for me, physical cloudflare node is in same country)
  • geoblocks on remote side, traffic is routed via remote VPN (right now it's opera VPN, personal VLESS server planned)
I also use some semi-automated tools (mostly FanFicFare for Calibre, if it doesn't something site specifc which integrates with actual browser is planned)
 
Upvote
0 (0 / 0)