Devs say AI crawlers dominate traffic, forcing blocks on entire countries

David Mayer

Wise, Aged Ars Veteran
1,106
Raise your hand if you used to use dial up in the 90s and downloading a 40 mb file could take over six hours ot more. Also there was no download managers early on so if you lost the connection for some reason you had to start over.
Raise your hands if you remember the incredible benefits offered by download managers, being able to have multiple connections downloading different parts of the same file was incredible.

I will still never forget the day that Safari for windows released, the web was never faster than it was then, page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast), that is, for a few months until Chrome released with about the same performance and web developers were free to add bloat slowing everything down again.
 
Upvote
50 (50 / 0)
I was reading that on DeVault's blog a few days ago, and that got me curious (By which I mean that I trust him when he says there's a problem, but he's well know for his loud rants.)

So, why exactly would major players in the LLM field be using residential proxies to pound on the internet at large and modest open source players in particular, is what I was wondering?

(In other words, using residential proxies is a clear indication of willing to work around restrictions, legal or technical; that big companies would condone in such behavior is at least surprising, not to mention ethically beyond murky..)
 
Last edited:
Upvote
42 (42 / 0)

AdamWill

Ars Scholae Palatinae
854
Subscriptor++
In addition to the "who's doing this" list: we kinda suspect that the cases where we're seeing botnet-like behaviour - small numbers of requests from big ranges of residential IP addresses - probably aren't being done by significantly-sized AI companies directly. The most plausible theory I've seen to explain those is that it may be more lucrative right now to use a botnet (or one of those scummy phone apps that does nefarious things in the background) to generate a data corpus and shop it around to AI companies who won't ask many questions about where it came from than it is to use the same botnet to send spam/phishing emails, so we suspect at least some of the kinds of people who build and buy access to botnets are doing this. It'd be fascinating to know if there's been a noticeable drop in the amount of other nefarious things being done by botnets recently.
 
Upvote
50 (50 / 0)
I was reading that on DeVault's blog a few days ago, and that got me curious (By which I mean that I trust him when he says there's a problem, but he's well know for his loud rants.)

So, why exactly would major players in the LLM field be using residential proxies to pound on the internet at large and modest open source players in particular, is what I was wondering?

I suspect that the ratio of IPs to bandwidth is favorable for applications where you expect the target to eventually try to blacklist you, or at least want to.

Per unit bandwidth you can obviously do better with a real internet connection from a well placed DC; but a random residential dynamic (or CGNAT exit) IP could be anyone; plus CDNs and operators of general-public type sites are presumably more worried about having their or their customers' sites just 'stop working' for Mr. and Ms. Comcast and Timmy Verizon because they've started blacklisting random residential IPs.

By contrast, if you are using 'respectable' static IPs your activity is relatively visible and potentially attributable(or, depending on who actually owns the IP range, runs the risk of your hosting or cloud services provider getting some nastigrams and potentially not wanting Azure or AWS public IPs to become known for having a 50/50 shot at being able to actually download from common OSS projects); and pink-contract/'bulletproof hosting' guys don't really represent a pool of legitimate users who people are worried about losing access to; so anyone who can manage to block them will probably be delighted to.
 
Upvote
9 (9 / 0)
Not content with enshitifying their own online services, now big tech companies are sending out bots to force the enshitification of other people's too.

I suspect that increasing people's appetite for bot slop by DDOSing organic alternatives into the ground isn't the primary motivation; but it would not at all surprise me if some of these guys consider it to be a happy side effect to the degree it happens.

If it becomes increasingly untenable to run an OSS project on the public internet; we'll need more vibe coding!
 
Upvote
19 (19 / 0)

AdamWill

Ars Scholae Palatinae
854
Subscriptor++
I was reading that on DeVault's blog a few days ago, and that got me curious (By which I mean that I trust him when he says there's a problem, but he's well know for his loud rants.)

So, why exactly would major players in the LLM field be using residential proxies to pound on the internet at large and modest open source players in particular, is what I was wondering?

(In other words, using residential proxies is a clear indication of willing to work around restrictions, legal or technical; that big companies would condone in such behavior is at least surprising, not to mention ethically beyond murky..)
See my post above. We suspect they aren't, directly. There are likely middlemen. Conveniently, these operators will be extremely hard to find and sue or prosecute, and proving that any large and easier-to-pin-down companies bought datasets from them will similarly be frustrating.
 
Upvote
28 (28 / 0)

Fatesrider

Ars Legatus Legionis
22,964
Subscriptor
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.

Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)

Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)
Handy, yes, but I'm sure others have pointed out, it's regularly ignored.

When it comes to AI's, they don't bother asking. They execute a knockless entry, and fuck with everything inside looking for all it can grab.

Why this isn't illegal, I don't know.
 
Upvote
19 (19 / 0)

bretayn

Ars Centurion
216
Subscriptor
This is what I deployed to a phpBB site I have (yes, I know) which was having these problems.
It made a huge difference (traffic down to 1/100th of before) although I'm getting complaints from a very small number of legitimate users who are having problems posting. Far more effective than blocking entire DNS ranges.

Good to know! Thank you. And I would 100% recommend using any WAF in front of phpBB. It’s still written like a legacy PHP app.
 
Upvote
0 (0 / 0)

adamsc

Ars Praefectus
4,034
Subscriptor++
The most plausible theory I've seen to explain those is that it may be more lucrative right now to use a botnet (or one of those scummy phone apps that does nefarious things in the background) to generate a data corpus and shop it around to AI companies who won't ask many questions about where it came from than it is to use the same botnet to send spam/phishing emails, so we suspect at least some of the kinds of people who build and buy access to botnets are doing this.

This fits well with the general profligacy everyone sees with traffic (you care a lot more about load if you have to pay for it) and I think it’s also very plausible as a liability shield. These companies know that they are going to be sued many times, and they want a lot of data which is rate-limited or behind a login system of some sort, where having a Facebook IP in the logs is potentially very expensive because it demonstrates knowledge and intent. Bury the same act under a couple layers of subcontractors and then it’s just a regrettable failure of the top-level contractor to manage their subs, and the number on the end of a likely settlement loses a few zeros.
 
Upvote
18 (18 / 0)

Tomcat From Mars

Wise, Aged Ars Veteran
129
Subscriptor
Raise your hand if you used to use dial up in the 90s and downloading a 40 mb file could take over six hours ot more. Also there was no download managers early on so if you lost the connection for some reason you had to start over.
I spent the better part of a weekend downloading the Diablo demo. After it was done I played it for about an hour, then went to the store and bought it.
 
Upvote
5 (5 / 0)

JBinFla

Wise, Aged Ars Veteran
126
I spent last summer playing whack-a-mole with my self hosted website. It's a small site me and my family visit for photos and notes and such. I was getting hammered with multiple request every second. Amazon was the biggest culprit. I viewed logs and customized my robots.txt but no dice. You can't block all their IPs and if you set your throttling too low (100-hits a minute say) legitimate traffic can get throttled depending on asset loads. It's crazy but I guess if you want to self host.

I added a page to my terms of service and started logging. It basically said my site is free for non commercial use but if the data is to be included in any AI, LLM or other machine learning training data it can be licensed for such use for USD $1,000,000.00 per page. I have hundreds of pages. I have little oddball things in there that only exist on my site and are otherwise nonsensical. If I find that in any of the big AI's I will see if I can find one of those lawyers who work on a cut of the winnings no money out of pocket and see if I can make it stick.
 
Upvote
40 (40 / 0)

Hacker Uno

Ars Praetorian
429
Subscriptor++
Back in the day when the cybersec consulting company I ran had crawlers sucking up content which violated the robots.txt file, I found that posting content which includes recursive zip and gz files would often kill crawlers as they unzipped the files and tried to crawl through those contents. We had one company threaten to sue us because we continually crashed their crawler which violated the robots.txt file. They were so bold as to make the claim that the robots.txt file is not legally enforceable and that everything on the internet was public domain which they had the rights to crawl. I told them to "Sue me, and I'll own your company" and that's the last I heard from them.
 
Upvote
64 (64 / 0)

JaeTLDR

Seniorius Lurkius
11
Subscriptor++
Upvote
1 (15 / -14)

Wheels Of Confusion

Ars Legatus Legionis
71,090
Subscriptor
Desperate for a solution, Iaso eventually resorted to moving their server behind a VPN and creating "Anubis," a custom-built proof-of-work challenge system that forces web browsers to solve computational puzzles before accessing the site.
... does this maybe result in covert crypto mining via the visitor's computing resources under the guise of thwarting AI scrapers?
 
Upvote
6 (6 / 0)

_fluffy

Ars Scholae Palatinae
847
I recently decided enough was enough with one particular botnet and brought out the big guns: https://beesbuzz.biz/articles/9050-Blocking-abusive-webcrawlers

At the bottom is a little script where if you give it an IP address you want gone it'll generate ufw rules to block the entire netblock. My server went down from constant 94% CPU usage to around 3% and everything works great again.
 
Upvote
16 (16 / 0)

yababom

Ars Scholae Palatinae
615
I spent last summer playing whack-a-mole with my self hosted website. It's a small site me and my family visit for photos and notes and such. I was getting hammered with multiple request every second. Amazon was the biggest culprit. I viewed logs and customized my robots.txt but no dice. You can't block all their IPs and if you set your throttling too low (100-hits a minute say) legitimate traffic can get throttled depending on asset loads. It's crazy but I guess if you want to self host.

I added a page to my terms of service and started logging. It basically said my site is free for non commercial use but if the data is to be included in any AI, LLM or other machine learning training data it can be licensed for such use for USD $1,000,000.00 per page. I have hundreds of pages. I have little oddball things in there that only exist on my site and are otherwise nonsensical. If I find that in any of the big AI's I will see if I can find one of those lawyers who work on a cut of the winnings no money out of pocket and see if I can make it stick.
I think you should word it as an implicit agreement: "your access of these pages indicates your acceptance of the licensing terms at the agreed rate of USD $1,000,000.00 per page."

--not a lawyer
 
Upvote
22 (22 / 0)

bthest

Smack-Fu Master, in training
40
As much as this really sucks, it is really funny to be running a forum these days with <5 active users and seeing every single thread have >10000 pageviews. The numerical absurdity of it all gives me energy.
Sounds like a business opportunity. Or have advertisers wised up on view counts?
 
Upvote
8 (8 / 0)
Post content hidden for low score. Show…

Arstotzka

Ars Scholae Palatinae
978
Subscriptor++
How do they do that??
They care about their users.

A lot of time and effort has been invested into that website with the focus on true UX. It isn’t hard but in means not grabbing every JavaScript framework, optimizing the database you have instead of leaving it mostly default and letting a generic ORM spit out lowest-common-denominator queries, aggressively caching and prefetching, etc. Constant Refinement.
 
Upvote
26 (26 / 0)

dwmcrobb

Smack-Fu Master, in training
12
The AI arms race has been going on for years. I have to block large swaths of IPv4 from accessing my personal web site, and I've done so for a long time due to ludicrous amounts of scraping from big tech. From my blog from about a year ago, here's what is fairly typical. Note that the scrapers, including those from big tech, are about as smart as a bag of rocks. They don't stop knocking despite being blocked for months or years. They'll fetch the same PDF multiple times in the same day despite the PDF having not changed for years (or decades).

For many sites, just blocking Facebook, Amazon, Google and Microsoft (and OpenAI) will likely get you back to sane levels of traffic.

1742969300121.png
 
Upvote
24 (24 / 0)

tecnomentis

Seniorius Lurkius
1
Subscriptor
I think this kind of behavior should affect the decision about the legality of freely using copyrighted material for training AI. If we required AI companies to sign deals for using copyrighted material, there would be less of this sort of traffic. Laws should be based on the effects on society and this is proof of how giving AI companies free reign is damaging the rest of the internet.
 
Upvote
17 (17 / 0)

ghub005

Ars Tribunus Angusticlavius
8,643
I’m old enough to remember when people operated mail servers as open relays by default.

This reminds me of that time. Unfortunately a small group of people will always abuse free resources because they have no economic incentive to protect or conserve them.

In case you haven’t heard about this concept before - it’s called The Tragedy of the Commons.
 
Upvote
22 (23 / -1)

DonColeman

Smack-Fu Master, in training
54
Subscriptor
Given the impact this is having on sites "users", for those sites that are not providing a wide public service, why is not one possible solution/mitigation creating a login system, and either limiting access to logged in users, or slow down/limit requests not from logged-in users?

I get that there is an administrative overhead of adding/managing users and it also makes each request quite a bit more costly... but are we not at the point where the crawlers and the administrative overhead of responding to them, at least in some cases, is worth it?
 
Upvote
6 (6 / 0)

OrvGull

Ars Legatus Legionis
10,669
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.
You might consider using a more abbreviated request log format or even turning off request logging altogether, unless you're in a jurisdiction that requires it.
 
Upvote
7 (7 / 0)

spacechannel5

Smack-Fu Master, in training
18
Shameless plug, i've been working on a library in Python that intends to be a good citizen when automating http(s) requests to a website; it has support for robots.txt and sitemaps baked directly into it - so if you try to request a page that is disallowed then it will raise a meaningful error. It's in alpha at the moment; but should release into beta with more comprehensive documentation and some bugfixes and QoL improvements very soon.

https://github.com/ethicrawl/ethicrawl
 
Upvote
9 (10 / -1)

laughinghan

Seniorius Lurkius
17
Subscriptor++
The Read the Docs project reported that blocking AI crawlers immediately decreased their traffic by 75 percent, going from 800GB per day to 200GB per day. This change saved the project approximately $1,500 per month in bandwidth costs, according to their blog post "AI crawlers need to be more respectful."

Doesn’t this primarily reflect AWS’ egregious bandwidth pricing? On Hetzner, even if it was all charged as overage, that extra 600GB/day would be $20/mo; accounting for the 20TB monthly allowance included with even their cheapest $4.6/mo VPS, 800GB/day would be <$6/mo.

By the way, this also isn’t quite what the blogpost says. A lot of the traffic was absorbed by their CDN, $1,500 was an estimate of the cost if it had all hit the original server instead. A lot still made it through though because the crawlers hit a lot of infrequently-accessed uncached content. They don’t provide an estimate of their actual costs incurred by crawler traffic.
 
Last edited:
Upvote
6 (6 / 0)

laughinghan

Seniorius Lurkius
17
Subscriptor++
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.

Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)

Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)
That sucks! Surely sucking it up isn’t your only option? Would a CDN help (Cloudflare, Fastly, & others have free plans)? A login system also wouldn’t necessarily require clients to support JS, just cookies normally.
 
Upvote
5 (5 / 0)
Raise your hands if you remember the incredible benefits offered by download managers, being able to have multiple connections downloading different parts of the same file was incredible.
We still do that! It's so important, such an incredible win, that the technique is now its own separate protocol, called torrents. So the benefits are unavailable to HTTP and HTML. Facepalm.
 
Upvote
-2 (3 / -5)