How crawlers impact the operations of the Wikimedia projects – Diff
Skip to content
Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on
Wikimedia Commons
– has grown significantly. In this post, we’ll discuss the reasons for this trend and its impact.
The
Wikimedia projects
are the largest collection of open knowledge in the world. Our sites are an invaluable destination for humans searching for information, and for all kinds of businesses that access our content automatically as a core input to their products. Most notably, the content has been a critical component of search engine results, which in turn has brought users back to our sites. But with the rise of AI, the dynamic is changing: We are observing a significant increase in request volume, with most of this traffic being driven by
scraping
bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone.
A view behind the scenes: The Jimmy Carter case
When
Jimmy Carter
died in December 2024, his page on English Wikipedia saw more than 2.8 million views over the course of a day. This was relatively high, but manageable. At the same time, quite a few users played a
1.5 hour long video
of Carter’s 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic, doubling its normal rate. As a consequence, for about one hour a small number of Wikimedia’s connections to the Internet filled up entirely, causing slow page load times for some users. The sudden traffic surge alerted our
Site Reliability team
, who were swiftly able to address this by changing the paths our internet connections go through to reduce the congestion. But still, this should not have caused any issues, as the Foundation is well equipped to handle high traffic spikes during exceptional events. So what happened?
Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.
The graph below shows that the base bandwidth demand for multimedia content has been growing steadily since early 2024 – and there’s no sign of this slowing down. This increase in baseline usage means that we have less room to accommodate exceptional events when a traffic surge might occur: a significant amount of our time and resources go into responding to non-human traffic.
Multimedia bandwidth demand for the Wikimedia Projects.
65% of our most expensive traffic comes from bots
The Wikimedia Foundation serves content to its users through a
global network of datacenters
. This enables us to provide a faster, more seamless experience for readers around the world. When an article is requested multiple times, we memorize – or cache – its content in the datacenter closest to the user. If an article hasn’t been requested in a while, its content needs to be served from the core data center. The request then “travels” all the way from the user’s location to the core datacenter, looks up the requested page and serves it back to the user, while also caching it in the regional datacenter for any subsequent user.
While human readers tend to focus on specific – often similar – topics, crawler bots tend to “bulk read” larger numbers of pages and visit also the less popular pages. This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.
While undergoing a migration of our systems, we noticed that only a fraction of the expensive traffic hitting our core datacenters was behaving how web browsers would usually do, interpreting javascript code. When we took a closer look, we found out that at least 65% of this resource-consuming traffic we get for the website is coming from bots, a disproportionate amount given the overall pageviews from bots are about 35% of the total. This high usage is also causing constant disruption for our Site Reliability team, who has to block overwhelming traffic from such crawlers before it causes issues for our readers.
Wikimedia is not alone with this challenge. As noted in our 2025
global trends report
, technology companies are racing to scrape websites for human-created and verified information.
Content publishers
open source projects
, and
websites
of all kinds report similar issues. Moreover, crawlers tend to access any URL. Within the Wikimedia infrastructure, we are observing scraping not only of the Wikimedia projects, but also of key systems in our developer infrastructure, such as our code review platform or our bug tracker. All of that consumes time and resources that we need to support the Wikimedia projects, contributors, and readers.
Our content is free, our infrastructure is not: Establishing responsible use of infrastructure
Delivering trustworthy content
also means supporting a “
knowledge as a service
” model, where we acknowledge that the whole internet draws on Wikimedia content. But this has to happen in ways that are sustainable for us: How can we continue to enable our community, while also putting boundaries around automatic content consumption? How might we funnel developers and reusers into preferred, supported channels of access? What guidance do we need to incentivise responsible content reuse?
We have started to work towards addressing these questions systemically, and have set a major focus on establishing sustainable ways for developers and reusers to access knowledge content in the Foundation’s upcoming fiscal year. You can read more in our draft annual plan:
WE5: Responsible Use of Infrastructure
. Our content is free, our infrastructure is not: We need to act now to re-establish a healthy balance, so we can dedicate our engineering resources to supporting and prioritizing the Wikimedia projects, our contributors and human access to knowledge.
Edit 1 April 2026: There is now a follow-up post, one year later:
Quo Vadis, Crawlers? Progress and what’s next on safeguarding our infrastructure
Share this:
Share on Mastodon (Opens in new window)
Mastodon
Share on Bluesky (Opens in new window)
Bluesky
Can you help us translate this article?
In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?
Start translation
Related
Related
Welcome to Diff
Welcome to Diff, a community blog by – and for – the Wikimedia movement. Join Diff today to share stories from your community and comment on articles. We want to hear your voice!
Subscribe to Diff via Email
Wikimedia News
Wikimedia Foundation News
“Cinematic intensity”: The winners of Wiki Loves Earth 2025
2 March 2026
by Wikimedia Foundation
Wikimedia Technology Blog
A Tech Blog Diff
24 February 2026
by LGoto
Down the Rabbit Hole
Announcing Wikipedia’s top 25 most-read articles of 2025
2 December 2025
by Wikimedia
Photo credits
1024px-A_millipede_insect
Treysam
CC BY-SA 4.0
Multimedia_bandwith_demand_for_the_Wikimedia_Projects
Chris Danis
CC BY 4.0
Report this comment
wpDiscuz
You are going to send email to
Move Comment
US