Wikipedia Workload Analysis for Decentralized Hosting
Authors:
Guido Urdaneta
Guillaume Pierre
Maarten van Steen
Source:
Elsevier Computer Networks
53(11), pp. 1830-1845, July 2009.
Abstract
We study an access trace containing a sample of Wikipedia's
traffic over a 107-day period aiming to identify appropriate
replication and distribution strategies in a fully decentralized
hosting environment. We perform a global analysis of the whole
trace, and a detailed analysis of the requests directed to the
English edition of Wikipedia. In our study, we classify client
requests and examine aspects such as the number of read and save
operations, significant load variations and requests for
nonexisting pages. We also review proposed decentralized wiki
architectures and discuss how they would handle Wikipedia's
workload. We conclude that decentralized architectures must
focus on applying techniques to efficiently handle read
operations while maintaining consistency and dealing with
typical issues on decentralized systems such as churn,
unbalanced loads and malicious participating nodes.
The article preprint
, in PDF
(803,873 bytes).
Attention:
some numbers in the original
article are unfortunately wrong. Please read the
corrigendum
for accurate information.
The article
at ScienceDirect (registration may be necessary).
The
access
trace
used for this article is freely available online.
See also:
the
Decentralized Wikipedia project
Bibtex Entry
@Article{,
author = {Guido Urdaneta and Guillaume Pierre
and Maarten van Steen},
title = {Wikipedia Workload Analysis for Decentralized Hosting},
volume = {53},
number = {11},
pages = {1830-1845},
month = {July},
year = {2009},
journal = {Elsevier Computer Networks},
note = {\url{http://www.globule.org/publi/WWADH_comnet2009.html}}