Analytics/Archive/EventLogging - Wikitec

Analytics/Archive/EventLogging - Wikitech
Jump to content
From Wikitech
Analytics
Archive
(Redirected from
Analytics/Systems/EventLogging
This documentation is outdated. See
Event Platform documentation
EventLogging
EL
for short) is a platform for modelling, logging, and processing arbitrary analytic data. It consists of:
a MediaWiki extension
that provides JavaScript and PHP APIs for logging events
a back-end written in Python
which aggregates events, validates them, and streams them to analytics clients.
This documentation is about specific EventLogging instance that collects data on Wikimedia sites.
For users
Schemas
Here's the list of the existing schemas. Note that many of them are active, but not all. Some schemas are still in development (not active yet) and others may be obsolete and listed for historical reference.
The schema's discussion page is the place to comment on the schema design and related topics. It contains a
template
that specifies the schema maintainer(s), the team and project the schema belongs to, its status (active, inactive, in development), and its purging strategy.
Creating a schema
There's thorough documentation on designing and creating a new schema here:
These are some special guidelines to create a schema that
Druid
can digest easily:
Send events
See
Extension:EventLogging/Programming
for how to instrument your MediaWiki code.
Client-side events
Client-side events are logged using a
web beacon
with project's hostname (e.g.
), the path
beacon/event
, and query string containing all the event fields (with
percent-encoded
punctuation). For example:
Decoding the punctuation, this looks like:
"event": { "action": "abort", ... },
"schema": "Edit",
"revision": 1234,
"webHost": "en.wikipedia.org",
"wiki": "enwiki"
Because this data is sent through a URL, we can't use URLs that are
longer than browsers can cope with
. Therefore, EventLogging
limits
unencoded
client-side events to 2000 characters.
Note that the beacon URL you choose does not actually affect the data logged; for simplicity, both
the iOS app
and
the Android app
log all their events to the meta.wikimedia.org beacon even when the events relate to other projects.
Note that anyone could send events to these endpoints, but in production only events whose webhost is a wikimedia one are processed. There are many clones of our sites running our code (like bad.wikipedia-withadds.com) that are, at this time, sending events to the existing beacon.
Accessing data
Data stored by EventLogging for the
various schemas
has varying degrees of privacy, including personally identifiable information and sensitive information, hence access to it requires an
NDA
. Also, by default, EL data is only kept for 90 days, unless otherwise specified, see
Analytics/Systems/EventLogging/Data retention
See
Analytics/EventLogging/Data representations
for an explanation on where the data lives and how to access it.
Access
See:
Analytics/Data access#EventLogging data
and
Analytics/Data access#Production_access
Hadoop & Hive
Raw JSON data is imported into HDFS from Kafka, and then further
refined
into Parquet-backed Hive tables. These tables live in 2
Hive
databases:
event
and
event_sanitized
, and are stored in HDFS at
hdfs:///wmf/data/event
and
hdfs:///wmf/data/event_sanitized
event
stores the original data during 90 days (data older than 90 days is automatically deleted).
event_sanitized
stores the sanitized data indefinitely. The sanitization process uses a whitelist that indicates which tables and fields can be stored indefinitely, see:
Analytics/Systems/EventLogging/Data retention and auto-purging
. You can access all this data through Hive, Spark, or other Hadoop methods.
Data from a given hourly period is only refined into Hive two hours after the end of the period, to allow for late arriving events.
Notes on data in Hive
A UDF has been provided in Hive to convert the
dt
field into a MediaWiki timestamp (
phab:T186155
). It can be used to join to mediawiki-style timestamp strings as follows:
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION GetMediawikiTimestamp AS 'org.wikimedia.analytics.refinery.hive.GetMediawikiTimestampUDF';
SELECT GetMediaWikiTimestamp('2019-02-20T12:34:56Z') AS timestamp;
OK
timestamp
20190220123456
NOTE:
Not all EventLogging analytics schemas are 'refinable'. Some schemas specify invalid field names, e.g. with dots '.' in them, or have field type changes between different records. If this happens, it will not be possible to be store the data in a Hive table and as such it won't appear in the list of refined tables. If your schema has this problem, you should fix it. (Dashes '-' in field names are automatically converted to underscores '_' during the refine process before the data is being ingested into Hive, cf.
phab:T216096#4955417
.)
NOTE:
Hadoop and Hive (in the JVM) are strongly typed, whereas the source EventLogging JSON data is not. This can cause
problems
when importing into Hive, as the refinement step needs to figure out what to do if it encounters type changes. TYPE CHANGES ARE NOT SUPPORTED. Please do not ever change the type of an EventLogging field. You may add new fields as you need and stop using old ones, but do not change types. Some type changes will be partially supported during the refinement stage. E.g. if the schema contains an integer, but future data contains a decimal number, the refinement step will log a warning, but still finish refinement. The record with the offending type changed field have all its fields set to NULL (not just the offending field).
Hive
EventLogging analytics data is imported into
event
and
event_sanitized
databases in Hive.
Note that the EventLogging schema fields are within the
event
column (
struct
). You can access them using dot notation, e.g.
event.userID
Basic example:
SELECT
event
userID
count
as
cnt
FROM
event
MobileWikiAppEdit
WHERE
year
2017
AND
month
11
AND
day
20
AND
hour
19
GROUP
BY
event
userID
ORDER
BY
cnt
DESC
LIMIT
10
...
event
userid
cnt
NULL
1848
333333
87
222229
59
111113
29
111125
21
466534
17
433542
10
754324
121346
123452
Cross-schema example:
SELECT
nav
event
origincountry
srv
event
description
PERCENTILE
nav
event
responsestart
50
AS
responsestart_p50
PERCENTILE
nav
event
responsestart
75
AS
responsestart_p75
COUNT
AS
count
FROM
event
navigationtiming
AS
nav
JOIN
event
servertiming
AS
srv
ON
nav
event
pageviewtoken
srv
event
pageviewtoken
WHERE
nav
year
2020
AND
srv
year
2020
AND
nav
month
AND
srv
month
AND
nav
day
28
AND
srv
day
28
AND
nav
event
isoversample
false
GROUP
BY
nav
event
origincountry
srv
event
description
HAVING
count
1000
Errors for schemas
Errors are available on eventerror table on events database:
Sample select:
select * from eventerror where event.schema like 'MobileWikiApp%' and year=2018 and month=11 and day=1 limit 10;
Spark
Spark can access data directly through HDFS, or as SQL tables in Hive. Refer to the
Spark documentation
for how to do so. Examples:
Spark 2 Scala SQL & Hive:
// spark2-shell
val
query
"""
SELECT
event.userID,
count(*) as cnt
FROM
event.MobileWikiAppEdit
WHERE
year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID
ORDER BY cnt DESC
"""
val
result
spark
sql
query
result
limit
10
).
show
()
...
+--------+----+
userID
cnt
+--------+----+
null
1848
333333
87
222229
59
111113
29
111125
21
466534
17
433542
10
754324
121346
123452
+--------+----+
Spark 2 Python SQL & Hive:
# pyspark2
query
"""
SELECT
event.userID,
count(*) as cnt
FROM
event.MobileWikiAppEdit
WHERE
year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID
ORDER BY cnt DESC
"""
result
spark
sql
query
result
limit
10
show
()
...
+--------+----+
userID
cnt
+--------+----+
null
1848
333333
87
222229
59
111113
29
111125
21
466534
17
433542
10
754324
121346
123452
+--------+----+
Spark 2 R SQL & Hive:
# spark2R
query
<-
SELECT
event.userID,
count(*) as cnt
FROM
event.MobileWikiAppEdit
WHERE
year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID
ORDER BY cnt DESC
result
<-
collect
sql
query
))
head
result
10
...
userID
cnt
NA
1848
333333
87
222229
59
111113
29
111125
21
466534
17
433542
10
754324
121346
10
123452
Hadoop. Archived Data
In 2017, some big EventLogging tables were
archived from MariaDB to Hadoop
. Tables were exported with sqoop into avro format files and tables were created according to the corresponding schema. Thus far we have the following tables archived in Hadoop, in the
archive
database:
mobilewebuiclicktracking_10742159_15423246
Edit_13457736_15423246
MobileWikiAppToCInteraction_10375484_15423246
MediaViewer_10867062_15423246
MobileWikiAppToCInteraction_10375484_15423246
pagecontentsavecomplete_5588433_15423246
PageContentSaveComplete_5588433
PageCreation_7481635
PageCreation_7481635_15423246
PageDeletion_7481655
PageDeletion_7481655_15423246
You can query these tables just like any other table in hive. A tip regarding dealing with binary types:
select * from Some_tbl where (cast(uuid as string) )='ed663031e61452018531f45b4b5502cb';
Caveat: This process does not preserve the data type for e.g. bigint or boolean fields. The archived Hive table will contain them as strings instead, which will need to be converted back (e.g.
CAST(field AS BIGINT)
).
Hadoop Raw Data
Raw EventLogging JSON data is imported hourly into Hadoop by
Gobblin
. It is unlikely that you will ever need to access this raw data directly. Instead, use the refined
event
Hive tables as described above.
Raw data is written to directories named after each schema in hourly partitions in HDFS.
/mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_/hourly////
. There are a myriad of ways to access this data, including Hive and Spark. Below are a few examples. There may be many (better!) ways to do this.
For backup purposes, we keep 90 days of events coming from the eventlogging-client-side topic in
/mnt/hdfs/wmf/data/raw/eventlogging_client_side/hourly////
Advantages of processing EL data in Hadoop (lightning talk slide)
Note that all EventLogging data in Hadoop is automatically purged after 90 days; the whitelist of fields to retain is not used, but this feature could be added in the future if there is sufficient demand.
Hive
Hive has a couple of built in functions for parsing JSON. Since EventLogging records are stored as JSON strings, you can access this data by creating a Hive table with a single string column and then parsing that string in your queries:
ADD
JAR
file
///
usr
lib
hive
hcatalog
share
hcatalog
hive
hcatalog
core
jar
-- Make sure you don't create tables in the default Hive database.
USE
otto
-- Create a table with a single string field
CREATE
EXTERNAL
TABLE
CentralNoticeBannerHistory
json_string
string
PARTITIONED
BY
year
int
month
int
day
int
hour
int
STORED
AS
INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory'
-- Add a partition
ALTER
TABLE
CentralNoticeBannerHistory
ADD
PARTITION
year
2015
month
day
17
hour
16
LOCATION
'/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/16'
-- Parse the single string field as JSON and select a nested key out of it
SELECT
get_json_object
json_string
'$.event.l.b'
as
banner_name
FROM
CentralNoticeBannerHistory
WHERE
year
2015
Spark
Spark Python (
pyspark
):
import
json
data
sc
sequenceFile
"/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/07"
records
data
map
lambda
json
loads
]))
records
map
lambda
'event'
][
'l'
][
][
'b'
],
))
countByKey
()
Out
33
]:
defaultdict
class
int
'>, {'
WMES_General_Assembly
': 5})
MobileWikiAppFindInPage events with SparkSQL in Spark Python (
pyspark 1
):
# Load the JSON string values out of the compressed sequence file.
# Note that this uses * globs to expand to all data in 2016.
data
sc
sequenceFile
"/wmf/data/raw/eventlogging/eventlogging_MobileWikiAppFindInPage/hourly/2016/*/*/*"
map
lambda
])
# parse the JSON strings into a DataFrame
json_data
sqlCtx
jsonRDD
data
# replace with sqlCtx.read.json(data) for pyspark 2
# Register this DataFrame as a temp table so we can use SparkSQL.
json_data
registerTempTable
"MobileWikiAppFindInPage"
top_k_page_ids
sqlCtx
sql
"""SELECT event.pageID, count(*) AS cnt
FROM MobileWikiAppFindInPage
GROUP BY event.pageID
ORDER BY cnt DESC
LIMIT 10"""
for
in
top_k_page_ids
collect
():
%s
%s
pageID
cnt
Edit events with SparkSQL in Spark scala (spark-shell):
// Load the JSON string values out of the compressed sequence file
// and parse them as a DataFrame.
val
rawDataPath
"/wmf/data/raw/eventlogging/eventlogging_Edit/hourly/2015/10/21/16"
val
edits
spark
read
json
spark
createDataset
String
](
spark
sparkContext
sequenceFile
Long
String
](
rawDataPath
).
map
_2
// Register this DataFrame as a temp table so we can use SparkSQL.
edits
registerTempTable
"edits"
// SELECT top 10 edited wikis
val
top_k_edits
sqlContext
sql
"""SELECT wiki, count(*) AS cnt
FROM edits
GROUP BY wiki
ORDER BY cnt DESC
LIMIT 10"""
// Print them out
top_k_edits
foreach
println
Kafka
There are many Kafka tools with which you can read the EventLogging data streams.
kafkacat
is one that is installed on stat1007.
# Uses kafkacat CLI to print window ($1)
# seconds of data from $topic ($2)
function
kafka_timed_subscribe
timeout
$1
kafkacat
-C
-b
kafka-jumbo1001
-t
$2
# Prints the top K most frequently
# occurring values from stdin.
function
top_k
sort
uniq
-c
sort
-nr
head
-n
$1
while
true
do
date
echo
'------------------------------'
# Subscribe to eventlogging_Edit topic for 5 seconds
kafka_timed_subscribe
eventlogging_Edit
# Filter for the "wiki" field
jq
.wiki
# Count the top 10 wikis that had the most edits
top_k
10
echo
''
done
Publishing data
See
Analytics/EventLogging/Publishing
for how to proceed if you want to publish reports based on EventLogging data, or datasets that contain EventLogging data.
Verify received events
Logstash has eventlogging EventError events. You can view all of these at
Validation errors are visible on application logs located at
/srv/log/eventlogging/systemd
In production they also end up in the kafka topic
eventlogging_EventError
There is also a Hive table named
event.eventerror
The processor is the one that handles validation, so, for example;
eventlogging_processor-client-side-.log
will have an error like the following if events are invalid:
Unable
to
validate:
"event"
"pagename"
"Recentchanges"
"namespace"
null,
"invert"
false,
"associated"
false,
"hideminor"
false,
"hidebots"
true,
"hideanons"
false,
"hideliu"
false,
"hidepatrolled"
false,
"hidemyself"
false,
"hidecategorization"
true,
"tagfilter"
null
"schema"
"ChangesListFilters"
"revision"
15876023
"clientValidated"
false,
"wiki"
"nowikimedia"
"webHost"
"no.wikimedia.org"
"userAgent"
"Apple-PubSub/65.28"
cp1066.eqiad.wmnet
42402900
2016
-09-26T07:01:42
This happens if client code has a bug and is sending events that are not valid according to the schema, we normally try to identify the schema at fault and pas that info back to the devs so they can fix it. See a ticket of how do we deal with these errors:
As of
T205437
, validation error logs are also available in Logstash for up to 30 days, i.e.
. A handy link to the associated Kibana search is available on a schema's talk page, provided that it's documented using
the SchemaDoc template
Note well that access to Logstash requires a Wikimedia developer account with membership in a user group indicating that the user has signed an
NDA
User agent sanitization
Main article:
Analytics/Systems/EventLogging/User agent sanitization
The
userAgent
field is sanitized immediately upon storage; the content is replaced with a parsed version in JSON format.
Data retention and purging
Main article:
Analytics/Systems/EventLogging/Data retention
By default, all EventLogging data is deleted after 90 days to comply with our
data retention guidelines
However, individual properties within schemas can be whitelisted so that the data is retained indefinitely; generally, all columns can be whitelisted, except the
clientIp
and
userAgent
fields. This whitelist is maintained in the
analytics/refinery
repo as
static_data/eventlogging/whitelist.yaml
Retiring a schema
When you no longer want to collect a particular data stream, there are a few cleanup steps you should take:
Remove the instrumentation code
Mark the schema inactive by editing the
SchemaDoc template
on its talk page.
Remove its entries from the
whitelist
(so it's easy for others to review what's actively being retained).
Request the deletion of any previously whitelisted data if it's no necessary
Operational support
Tier 2 support
Analytics/Tier2
Outages
Any outages that affect EventLogging will be tracked on
Incident documentation
(also listed
below
) and announced to the lists
eventlogging-alerts@lists.wikimedia.org
and
ops@lists.wikimedia.org
Alarms
Alarms at this time come to the Analytics team. We are working on being able to claim alarms in
icinga
Contact
You can contact the analytics team at:
analytics@lists.wikimedia.org
For developers
Codebase
The EventLogging python codebase can be found at
Architecture
See
Analytics/EventLogging/Architecture
for EventLogging architecture.
Performance
On this page you'll find information about Event Logging performance, such as load tests and benchmarks:
Size limitation
There is a limitation of the size of individual EventLogging events due the underlying infrastructure (limited size of urls in Varnish's varnishncsa/ varnishlog, as well as Wikimedia UDP packets). For the purpose of size limitation, an "entry" is a
/beacon
request URL containing urlencoded JSON-stringified event data. Entries longer than 1014 bytes are truncated. When an entry is truncated, it will fail validation because of parsing (as the result is invalid JSON).
This should be taken into account when creating a schema. Large schemas should be avoided and schema fields with long keys and/or values, too. Consider splitting up a very large schema, or replacing long fields with shorter ones.
To aid with testing the length of schemas, EventLogging's dev-server logs a warning into the console for each event that exceeds the size limit.
Monitoring
You can use various tools to monitor operational metrics, read more in this dedicated page:
Testing
The Event Logging extension can be tested on vagrant easily and that is described on mediawiki.org at
Extension:EventLogging.
The server side of EventLogging (consumer of events) does not have a vagrant setup for testing but can be tested in the Beta Cluster:
How do I ...?
Visit this EventLogging how to page. It contains some dev-ops tips and tricks for EventLogging like: deploying, troubleshooting, restarting, etc. Please, add here any step-by-step on EventLogging dev-ops tasks.
Administration. On call
Here's a list of routine tasks to do when oncall for EventLogging.
Data Quality Issues
Changes and Known Problems with Dataset
Date from
Date until
Task
Details
2020-06-18T20:00:00Z
2019-06-19T22:00:00Z
Task T249261
While attempting the first migration of legacy EventLogging steams to EventGate, Otto misconfigured the EventLogging extension's
$wgEventLoggingServiceUri
for non group0 wikis, effectively causing SearchSatisfaction events to be disable on all non group0 wikis.
2019-09-23
2019-09-29
Task T233718
Many events emitted by MediaWiki are missing in Hive refined event database tables, including events from mediawiki_revision_create, mediawiki_page_create, etc. This was caused by a problem when importing data from Kafka via Camus, but at the time was only known to affect mediawiki_api_request and mediawiki_cirrussearch_request. Data for other mediawiki_* tables was not backfilled, and the raw data has since been deleted.
2017-11
2017-11
Task T179625
Canonical EventLogging data (parsed and validated and stored in Kafka) did not match EventCapsule schema. This was fixed, and data was transformed before insertion into MySQL for backwards compatibility. This helped standardize all event data so that it could be refined and made available in Hive.
2017-07-10
2017-07-12
task T170486
Some data was not inserted in MySQL, but was backfilled for all schemas but page-create. During the backfill, bot events were also accidentally backfilled, resulting in extra data during this time.
2017-05-24
onwards
task T67508
Do not accept data from bots on eventlogging unless bot user agent matches "MediaWiki".
2017-03-29
onwards
task T153207
Change userAgent field in event capsule
2019-03-19 (14 to 22 hours)
task T218831
Eventlogging mysql consumer was restarting for several hours in which it was not able to enter any data on database
2019-04-01
Task: T219842
Kafka Jumbo outage since 22:00 to midnite. Data loss on those hours
2019-09-12
Third party domain data is not getting refined (so sites like w.upupming.site that run clones of our code do not send us their requests)
Incidents
Here's a
list of all related incidents and their post-mortems
. To add a new page to this generated list, use the "
EventLogging/Incident_documentation
" category.
For all the incidents (including ones not related to EventLogging) see:
Incident documentation
Limits of the eventlogging replication script
The log database is replicated to the eventlogging slave databases via a custom script, called eventlogging_sync.sh (script stored in operations/puppet for the curious). While working on
we realized that the script was not able to replicate high volume events in real time, showing a lot of replication lag (even days in the worst case scenario). Please review the task for more info or contact the Analytics team in case you have more questions.
Ad blockers
Our client-side analytics instrumentation is subject to interference by any ad blocking software the user has installed. See, for example,
T240697
T251464
, in which no-JS editor counts were skewed by unaccounted-for ad blockers. Ad blockers typically work by comparing outgoing requests to a list of disallowed URL domains, paths, or other patterns. For example, ad blockers using the popular
EasyPrivacy
block list block requests from page scripts to paths matching
/beacon/event?
(affecting legacy EventLogging) as well as to the domain
intake-analytics.wikimedia.org
(affecting requests to the new event platform intake service).
The following ad blockers are known as of February 2021 to interfere with WMF analytics instrumentation when using default settings. (Note that most if not all ad blockers allow users to add block lists and custom rules that could result in WMF analytics requests being blocked.)
Name
Client platforms affected
Analytics intake systems affected
Notes
uBlock Origin
Web (desktop + mobile)
EventLogging, MEP
EasyPrivacy enabled by default
Brave (web browser)
Web (desktop + mobile)
MEP
Blocks requests to
intake-analytics.wikimedia.org
when using standard (default) privacy settings
See also
Analytics/EventLogging/Outages
Analytics/EventLogging/New pipeline
Analytics/EventLogging/Sanitization vs Aggregation
"EventLogging on Kafka". October 2015 lightning talk:
slides
video
Notes
wikimedia/puppet/modules/profile/manifests/analytics/refinery/job/refine.pp
Retrieved from "
Categories
Services
Data platform
Data platform systems
Analytics/Archive/EventLogging
Add topic