On Tuesday, the Wikimedia Foundation announced that relentless AI scraping is putting strain on Wikipedia's servers. Automated bots seeking AI model training data for LLMs have been vacuuming up terabytes of data, growing the foundation's bandwidth used for downloading multimedia content by 50 percent since January 2024. It’s a scenario familiar across the free and open source software (FOSS) community, as we've previously detailed.
The Foundation hosts not only Wikipedia but also platforms like Wikimedia Commons, which offers 144 million media files under open licenses. For decades, this content has powered everything from search results to school projects. But since early 2024, AI companies have dramatically increased automated scraping through direct crawling, APIs, and bulk downloads to feed their hungry AI models. This exponential growth in non-human traffic has imposed steep technical and financial costs—often without the attribution that helps sustain Wikimedia’s volunteer ecosystem.
The impact isn’t theoretical. The foundation says that when former US President Jimmy Carter died in December 2024, his Wikipedia page predictably drew millions of views. But the real stress came when users simultaneously streamed a 1.5-hour video of a 1980 debate from Wikimedia Commons. The surge doubled Wikimedia’s normal network traffic, temporarily maxing out several of its Internet connections. Wikimedia engineers quickly rerouted traffic to reduce congestion, but the event revealed a deeper problem: The baseline bandwidth had already been consumed largely by bots scraping media at scale.
This behavior is increasingly familiar across the FOSS world. Fedora’s Pagure repository blocked all traffic from Brazil after similar scraping incidents covered by Ars Technica. GNOME’s GitLab instance implemented proof-of-work challenges to filter excessive bot access. Read the Docs dramatically cut its bandwidth costs after blocking AI crawlers.
Wikimedia’s internal data explains why this kind of traffic is so costly for open projects. Unlike humans, who tend to view popular and frequently cached articles, bots crawl obscure and less-accessed pages, forcing Wikimedia’s core datacenters to serve them directly. Caching systems designed for predictable, human browsing behavior don’t work when bots are reading the entire archive indiscriminately.
As a result, Wikimedia found that bots account for 65 percent of the most expensive requests to its core infrastructure despite making up just 35 percent of total pageviews. This asymmetry is a key technical insight: The cost of a bot request is far higher than a human one, and it adds up fast.