AI bots hungry for data are taking down sites by accident, but humans are fighting back.
See full article...
See full article...
I don't understand why OpenAI would re-download entire web archives multiple times a day. Can't they just download it once and train their AI on that copy? Why download the same data over and over again? Or have a smarter way of only downloading changed content?
But they can just pay Donnie Trump and no one will enforce the court orders.At this point, I'm honestly anticipating that this madness is going to end up with someone getting fed up, suing OpenAI and other AI companies for violating the Computer Fraud and Abuse Act and winning in court.
It's not even particularly farfetched--these companies are actively evading attempts to block their traffic, and case law does support organizations conducting a DDOS attack on a website as violating the law...
Because they have more money than you.Why this isn't illegal, I don't know.
Nahhh, he'll probably just build a robot that breaks necks accidentally but doesn't die when it gets shot.At least Jon Steinkek will be inspired by the sight of another too-readily encouraged plunder and destruction of the former ecology to mint The GTX of Wrath; widely regarded as the first great american NFT.
Hmm, I've just realized the lady who says she works for a debt collector but stumbles on and doesn't give the exact same name each time hasn't called me in a while. I'm starting to miss Miss Bergum/Bergman/Berman/Mmbermn from Asset Recovery Solutions/Systems/Services.In addition to the "who's doing this" list: we kinda suspect that the cases where we're seeing botnet-like behaviour - small numbers of requests from big ranges of residential IP addresses - probably aren't being done by significantly-sized AI companies directly. The most plausible theory I've seen to explain those is that it may be more lucrative right now to use a botnet (or one of those scummy phone apps that does nefarious things in the background) to generate a data corpus and shop it around to AI companies who won't ask many questions about where it came from than it is to use the same botnet to send spam/phishing emails, so we suspect at least some of the kinds of people who build and buy access to botnets are doing this. It'd be fascinating to know if there's been a noticeable drop in the amount of other nefarious things being done by botnets recently.
Everyone knows it's not illegal if it's done with computers. Unless you're a university student. I guess that's why all the tech bros dropped out of college!Handy, yes, but I'm sure others have pointed out, it's regularly ignored.
When it comes to AI's, they don't bother asking. They execute a knockless entry, and fuck with everything inside looking for all it can grab.
Why this isn't illegal, I don't know.
You say that, but they could be doing a "thousand flowers" style campaign - a year from now we're all in an El Salvadorian prison and they're making us watch TCL films Clockwork Orange style.Arguing from the anthropic principle, it's the worst timeline ever that stills allows enough civilization to allow us to complain about the timeline on a site like Ars.
Do you suggest reforming capitalism, or something else entirely?Isn't capitalism beautiful?? Only this fucking commies disagree with this.
I got the same when updating an MX install this weekend. Kbps speeds from the repos. I live in the US. I hadn't thought to connect with AI trawlers.I was setting up a bunch of Linux (mint) desktops this weekend and i swear i was getting the most ridiculously low download speeds from the repos i was downloading from. github was also slow.
i was getting between 2 and 4 mbps, meanwhile i could download from Steam at around 70mbps.
It seems like a trend recently, i've seen this on 3 different internet connections. Either this is some weird peering issue (I live in South Africa) or i'm competing for bandwidth with these damn bots...
I think you have it the other way around. Give the bots a minified JS bundle and a boilerplate HTML file and let them sort it out.I've been wondering whether the solution is to present code to the bot/user that only a bot would know how to handle. So if someone's just doing straight HTML pulls with basic linkbacks from other HTML pages, let it through. If they're actually loading the javascript tags, add in a delay in the response (so legitimate users on newer browsers will get access, but it'll have a performance hit). And anything that fits a known AI bot profile... it gets sent to Cloudflare's maze or similar.
robots.txt
file using Forgejo's own as a starting point. Fortunately it cut down on things a fair bit, because the main motivator for my doing this was that I actually needed to read the access log to debug a problem with some client software accessing another one of my domains. The constant flood of AI bot requests was making it impossible.robots.txt
are not actually well-behaved these days.I watched a video about how fast the McMaster-Carr site is, and some of the tricks they use:I will still never forget the day that Safari for windows released, the web was never faster than it was then, page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast)..
Other people have suggested CDNs like Cloudflare, but they might have settings/opinions about security that means that people running ancient browsers won’t be able to connect.I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.
Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)
Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)
The article notes one possible reason:I don't understand why OpenAI would re-download entire web archives multiple times a day. Can't they just download it once and train their AI on that copy? Why download the same data over and over again? Or have a smarter way of only downloading changed content?
This pattern suggests ongoing data collection rather than one-time training exercises, potentially indicating that companies are using these crawls to keep their models' knowledge current.
That's already being done. Hasn't solved the larger problem.What about when you detect an AI agent you give them some nonsensical content, thereby poisoning the well?Or make them run a little snippet of a program that you know leads to an infinite loop?
We're trying, but it takes longer to defrost them out of the vats than you might like.If this were a William Gibson novel someone would have hired actual ninja assassins to fix this problem by now.
It is an option. sysadmins have been reluctant to do it so far because it goes against the philosophy of the open web. that's also the most common objection to using cloudflare's service; once you involve cloudflare in your hosting you're effectively giving up on the internet as it was designed, and substantially compromising the privacy of yourself and your users (though it is very effective).Given the impact this is having on sites "users", for those sites that are not providing a wide public service, why is not one possible solution/mitigation creating a login system, and either limiting access to logged in users, or slow down/limit requests not from logged-in users?
I get that there is an administrative overhead of adding/managing users and it also makes each request quite a bit more costly... but are we not at the point where the crawlers and the administrative overhead of responding to them, at least in some cases, is worth it?
There are other cultural reasons balkanization of the Internet was inevitable even with decentralized protocols.It is an option. sysadmins have been reluctant to do it so far because it goes against the philosophy of the open web. that's also the most common objection to using cloudflare's service; once you involve cloudflare in your hosting you're effectively giving up on the internet as it was designed, and substantially compromising the privacy of yourself and your users (though it is very effective).
it's likely to happen in some cases though anyway, because at this point we have no good options. you just have to pick a bad one, whether it's requiring logins or using some kind of trap/tarpit thing, whether self-hosted like anubis or just giving up and giving your site to a third-party operator like cloudflare.
If they’re only making outgoing requests it probably won’t do much. Feed them plausible junk can work and now Cloudflare can do it for you.Honestly sounds like we need something like Blackwall at this point. AI crawler hits your site? Hit back harder and force them offline.
I want it, asked for it, but still believe robots.txt should be obeyed. I would like to see that codified in law or enforced with the laws already on the books. It’s accessing a computer without permission — after being explicitly denied even.All of this in order to bring us something no one wants nor asked for.
Raise your hands if you remember the incredible benefits offered by download managers, being able to have multiple connections downloading different parts of the same file was incredible.
I will still never forget the day that Safari for windows released, the web was never faster than it was then, page loads on pretty much all sites were functionally instant (like McMaster-Carr today, seriously give that site a try, it is mindbogglingly fast), that is, for a few months until Chrome released with about the same performance and web developers were free to add bloat slowing everything down again.
You pay for what you get, but then again you also get what you pay for. And their delivery is top-notch: Amazon just can't compete with them.Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...
And it will be dark times indeed if and when they ever change their style of art.You pay for what you get, but then again you also get what you pay for. And their delivery is top-notch: Amazon just can't compete with them.
That reminds me of the line:Having thoroughly ravaged the natural world for anything of profit, transnational corporations,backed by billionaires looking for even larger and more fashionable hoards of wealth, set their eyes on the digital commons, hellbent on squeezing all value from society before it collapsed. Thus ended the golden age of human access to both the natural and virtual world.
The physical McMaster-Carr catalog, which they still print, is one of Adam Savage's (of MythBusters fame) favorite references:Arrrgh! Curse you for showing me this McMaster-Carr website. I can wander aimlessly in it for hours...
No and no. (for starters, mining bitcoin isn't actual work and does more harm than good, but there's a wider categorical reason beyond that specific example)Could the 'proof of work' solution be changed to actual work? E.g. mining Bitcoin, searching for seti.
Then the companies that are affected could make a little money or manpower off of reach request but for the individual user its hopefully a small extra step.
The thought being is then the cost would be put more back on the ai companies.
I have no knowledge in this sector so this might simply be impossible.