Analytics/Archive/Data/Pagecounts-raw -

Analytics/Archive/Data/Pagecounts-raw - Wikitech
Jump to content
From Wikitech
Analytics
Archive
(Redirected from
Analytics/Pagecounts-raw
See also the
pageviews API
, available since the end of 2015, and
other sources of pageview data
This page contains historical information
. It may be outdated or unreliable.
NOTE: This dataset is deprecated since 2016-08-01, see
this thread
pagecounts-raw
holds the desktop sites' pageview data (separately for every page) for the timespan from 2007 to 2016, in the same format that
webstatscollector
used to emit and based on the same pageview definition (which differs from the
newer definition introduced in 2014/15
).
The dataset also contains the "
projectcounts
" aggregate pageview data counting traffic to an entire project (e.g. English Wikipedia or Romanian Wikivoyage), which includes mobile views, in contrast includes this page-level data.
This stream is owned by the
Analytics Engineering Team
For the timespan from September 2014 on, it is recommended to use
pagecounts-all-sites
instead, which includes mobile views.
For the timespan from May 2015 on, the
pageviews
dataset is recommended, which is based on
an improved pageview definition
(and also includes mobile views).
Contained data
The dataset consists of files with names
${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz
${YEAR}/${YEAR}-${MONTH}/projectcounts-${YEAR}${MONTH}${DAY}-${HOUR}0000
The
pagecounts
or
pageviews
are gzipped text files holding hourly
per page
aggregates of pageviews, and
projectcounts
or
projectviews
are plain text files holding hourly
per domain-name
aggregates at the project level. The pagecounts and projectcounts files also include total response byte sizes at their respective aggregation level, but this was dropped from the pageviews and projectviews files because it wasn't very accurate.
The time used in the filename is in UTC timezone refers to the end of the aggregation period, not the beginning.
Both page and project files are made up of lines having 4 space-separated fields:
domain_code page_title count_views total_response_size
Field name
Description
domain_code
Domain name of the request, abbreviated.
The domain coding scheme in
pageviews
pagecounts-all-sites
, and
pagecounts-raw
is kept compatible on purpose, thus retaining quirks and inconsistencies in the coding scheme
(and perhaps adding to the confusion with new added complexity). Our apologies if the scheme looks a bit complex (it is), but codes are unambiguous, and are primarily for machine-reading.
Common trailing parts in the domain name have been abbreviated. The main inconsistency is: project 'wikipedia.org' doesn't add a suffix for project name, where 'wikibooks.org' adds .b., 'wiktionary.org' adds .k', etc. (the original scheme predates Wikimedia's mobile site).
Domain_code now can also be an abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org". (Again, as project Wikipedia is not coded in the abbreviation: 'en' stands for "en.wikipedia.org", and 'en.m' stands for "en.m.wikipedia.org".
Domain trailing part
Coded as
Database name
.wikipedia.org
*wiki
(be careful about the other non-
wikipedia sites using this however)
.wikibooks.org
.b
*wikibooks
.wiktionary.org
.d
*wiktionary
.wikimediafoundation.org
.f
foundationwiki
.wikimedia.org
.m
Only the following domains are considered
commons.wikimedia.org
meta.wikimedia.org
incubator.wikimedia.org
species.wikimedia.org
strategy.wikimedia.org
outreach.wikimedia.org
usability.wikimedia.org
quality.wikimedia.org
commonswiki
metawiki
incubatorwiki
specieswiki
strategywiki
outreachwiki
usabilitywiki
qualitywiki
.m.${WHITELISTED_PROJECT}.org
.mw
(See
explanation below
.wikinews.org
.n
*wikinews
.wikiquote.org
.q
*wikiquote
.wikisource.org
.s
*wikisource
.wikiversity.org
.v
*wikiversity
.wikivoyage.org
.voy
*wikivoyage
.mediawiki.org
.w
mediawikiwiki
.wikidata.org
.wd
wikidatawiki
page_title
For page-level files, it holds the title of the unnormalized part after
/wiki/
in the request URL (E.g.:
Main_Page
Berlin
). The page title may also be extracted from the
title
or
page
query parameters, e.g.
/w/index.php?title=Main+Page
. The title will be URL-decoded and will formatted as a
canonical DBkey
with spaces replaced by underscores.
For project-level files or when the title cannot be extracted, it is
count_views
The number of times this page has been viewed in the respective hour.
total_response_size
The total response size caused by the requests for this page in the respective hour. This is a sum over field #7 of
Cache log format fields
So for example a line
en Main_Page 42 50043
means 42 requests to "en.wikipedia.org/wiki/Main_Page", which
accounted in total for 50043 response bytes. And
de.m.voy Berlin 176 314159
would stand for 176 requests to "de.m.wikivoyage.org/wiki/Berlin",
which accounted in total for 314159 response bytes.
Each
domain_code
and
page_title
pair occurs at most once.
The file is sorted by
domain_code
and
page_title
Data not included
This dataset does not contain per language, or per title counts for a project's mobile site. See
pagecounts-all-sites
, if you need them.
(note: this line should be be moved from template to parent page)
So
pagecounts-raw
does not contain counts for mobile or zero sites. Use file version
pagecounts-all-sites
if you need them.
Aggregation for
.mw
Note: anomaly retained for backward compatibility! These lines better belong in project-level files. Best to ignore
.mw
lines.
The
.mw
abbreviation aggregates the mobile sites across all projects per language. The
page_name
gets set to the used language.
So consider a given hour only sees the following requests:
(and assuming each request accounted for 100 bytes), the hour's page-level file would consist only of the line
en.mw en 3 300
. The corresponding project-level file would be
en.mw - 3 300
. So while the
.mw
abbreviation counts the mobile site, it throws wikipedia, wiktionary into the same bucket. And also, it does not distinguish between page_titles.
Availability
dumps.wikimedia.org
The stream is available unsampled as gzipped hourly files from
The date in the file name refers to the
end
of the capturing period, not the beginning.
stat1004 and stat007
Data from 2007 to 2016 is available as hourly files at
/mnt/hdfs/wmf/data/archive/pagecounts-raw/
on
stat1004.eqiad.wmnet
. Also, the folder
/mnt/data/pagecounts/incoming
on
stat1007
has hourly files with data from 2015 and 2016.
The date in the file name refers to the
end
of the capturing period, not the beginning.
There is also a
Hive
table called
projectcounts_raw
with data from 2007 to 2016 that may be related.
pagecounts-ez
Adapted from
a post on Analytics-l
, February 2018:
Another option is to download the data in lossless compressed form,
(see also
Analytics/Data Lake/Traffic/Pagecounts-ez
). The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as well, so downloading this way should be equivalent).
Toolforge
Adapted from
a post on Analytics-l
, February 2018:
You can also work on Toolforge, a virtual cloud that's on the same network as the data, so getting the data is a lot faster and you can use our compute resources (free, of course):
Portal:Toolforge
(IRC support:
#wikimedia-cloud
connect
). See also
PAWS
Events and known problems since 2014-03-01
Date from
Date until
Bug
Details
2014-09-02 ~16:19
bug 70140
Https traffic from ulsfo gets counted twice.
2014-04-17
2014-07-07
bug 67456
Logs from SSL endpoints was not fed into webstatscollector, hence SSL traffic has not been counted by webstatscollector.
2014-07-07 ~16:25
2014-09-02 ~16:19
bug 70295
Requests to Special:CentralAutoLogin/* have been counted.
2014-07-08 19:00
2014-07-08 22:00
bug 67694
2014 FIFA World Cup
(soccer) related traffic spike caused udp2log overload and lead to up to ~10% packetloss during this period of time.
2014-07-13 19:00
2014-07-13 23:00
bug 67694
2014 FIFA World Cup
(soccer) related traffic spike caused udp2log overload and lead to up to ~25% packetloss during this period of time.
2014-07-29 01:35
2014-07-29 01:42
bug 68796
Most of esams missing between 2014-07-29T01:35:45 and 2014-07-29T01:42:00 due to flapping network link (<=11% of total zero traffic around that time)
2014-08-16 ~22:43
2014-08-16 ~22:49
bug 69663
Root mount on oxygen went full, which caused services to panic and udp2log dropped requests during that time
2014-08-17 ~06:26
2014-08-17 ~06:30
bug 69663
Root mount on oxygen went full again, which caused services to panic and udp2log dropped requests during that time
2014-08-24 14:00
2014-08-27 21:00
bug 70118
Resource scarceness on gadolinium causing higher drop rates, and service restarts chopping off part of the data for some hours.
2014-08-28 16:01
2014-08-28 ~20:30
bug 70136
Permission errors on gadolinium prohibited writing of hourly files
2014-10-08 22:00
2014-10-08 24:00
bug 71879
ULSFO having connectivity issues leading to partial message loss
2014-10-15 ~19:02:30
bug 66352
Pageviews to “undefined” and “Undefined” pages have been counted
2014-10-15 ~19:02:30
bug 71790
Redirects have been counted
2014-10-15 ~19:00:00
2014-10-15 ~19:02:30
bug 72102
No messages collected during deployment of new webstatscollector version
2014-10-15 ~20:22:00
2014-10-15 ~20:23:00
bug 72107
No messages collected during restart of webstatscollector's filter
2014-10-20 13:06
2014-10-20 13:27
bug 72306
ULSFO connectivity issues causing packet loss between 6% and 47% for ulsfo caches.
2014-10-21 ~10:30
2014-10-21 ~11:43
bug 72355
Ulsfo connectivity issues causing packet loss for ulsfo caches.
2014-11-25 ~01:56
2014-12-04 14:03
task T76390
Change of HTTPS setup makes requests HTTPS from eqiad and esams (not ulsfo) get count twice.
On 2014-12-08, backfilling the affected period with good data from
pagecounts-all-sites
finished. So since then, the pagecounts/projectcounts files for the affected period are good again.
2014-11-30 ~03:50
2014-11-30 ~10:13
task T76334
No data while analytics infrastructure suffered eqiad network issues.
On 2014-12-08, backfilling the affected period with good data from
pagecounts-all-sites
finished. So since then, the pagecounts/projectcounts files for the affected period are good again.
2015-01-01 00:00
n/a
Switch from
webstatscollector
generated files to Hive generated files (
If32afc
, stripped-down variant of
pagecounts-all-sites
), see
announcement
("You may see a slight increase in article counts. The webrequest data in HDFS is less lossy than the udp2log data").
2015-01-13 ~22:20
2015-01-13 ~23:18
task T86973
No data due to firewall problems
Other notes:
In June 2014
, tablet views were switched from the desktop site to the mobile site, causing
mobile views to increase
and correspondingly the desktop views (i.e. also the per-page numbers from pagecounts-raw) to drop.
Earlier issues (incomplete list):
2013-07-23 to 2013-07-24:
some data loss
(resulting in empty files for 17 hours)
Remarks
about some Wikistats issues (2009-) that may or may not have affected pagecounts-raw too
Errata list from 2011
on Wikistats
See also
stats.grok.se FAQ
Notes about some mirrors of these datasets
Hence, the “project” in projectcounts is somewhat a misnomer, but kept for historical compatibility.
Retrieved from "
Categories
Archive
Data stream
Analytics/Archive/Data/Pagecounts-raw
Add topic