Data Platform/Data Lake/Traffic/BotDetec

Why do we need bot detection?
Wikipedia's content is read by humans and automated agents, which are scripts with different levels of abilities. These automated scripts (normally called 'bots') can be as complicated as a major search engine crawler that "reads" Wikipedia and indexes it all. But, they can also be as simple as someone writing a small script to scrape a couple of pages of interest. Also, we see bot vandalism: scripts that, for example, try to pull pages of sexual nature to Wikipedia's top-read list for a smaller site.
The first step to sort this traffic out is to mark the traffic requests in which bots self-identify as such via the user agent (one of the HTTP headers) using the word 'bot'. We label this kind of traffic 'spider'.
This left us with a significant amount of traffic that, while "automated" in nature, is not marked as such. The biggest problem with mislabeling the traffic is not the overall effect on the pageview metric but rather the impact on any kind of top-pageview list. For example, see the event with the United States Senate page, which appeared (in the midst of covid19) as the most visited page for English Wikipedia. For a while now, the community has been applying filtering rules for any "Top X" list that we compile[1], and we have incorporated some of these filters into our automated traffic detection.
We feel it's important to mention that we are not aiming to remove every single instance of bot traffic but rather remove the most poignant cases of automated traffic, not self-identified as such.
From our analysis (detailed below), the overall pageviews coming from users drop about 5% when we mark 'automated' traffic as such. This number gets as high as 8% in some instances but mostly lingers around 5.5%. This effect is not equally distributed across sites.
A digested version of this explanation can be found on our tech blog:
Identifying "fake" traffic on Wikipedia
. Also, see below a zoom of our data pipelines through the prism of automatic traffic detection.
Traffic bot detection flowchart 2023-01
Current state of the automated traffic detection in our pipeline
The dataset Raw Webrequest (Hive table wmf_raw.webrequest)
The
raw webrequest dataset
is the raw source. It includes all automated traffic.
In this
HQL job
we categorize the requests into 2 groups "user" and "spider", stored in the agent_type field.
For that, we study the request user agents in 2 ways:
This detection was enough when the bots declared themselves properly through the user agent.
Then a job computes user activity stats (Hive table features.actor_hourly)
The notion of "actors" is loosely similar to
web sessions
. It is a string of interactions from a user (or a bot) with a wiki server. This
HQL job
is preparing a summary of each user activity during the hour. (e.g., first page loaded, how many pages, ...), and generally speaking, all necessary stats to run the later detection.
Only requests about "text", and which are considered as pageview on our websites, are counted (
the pageview categorization
is out of the scope here. Let's say: it ensures that the request is something like a user opening a page on one of our Mediawiki).
GoogleWebLight
Traffic is also excluded at this step for simplicity.
We keep 6 parameters (called features). For each active user session during the hour:
First & last interactions with the website
Number of pages seen
Number of pages seen per minute
Does the user use cookies
Length of the user agent
Number of distinct pages visited
Aggregate stats for the last 24 hours (Hive table features.actor_rollup_hourly)
Some user sessions traverse multiple hours. To get more realistic information about the session, we aggregate the data on a window containing the last 24 hours finishing on the computed hour. It's done in an
HQL script
Classify the pageviews_actor (Hive table predictions.actor_label_hourly)
An
HQL script
calculates hourly a label for each session using the feature values computed with data from the 24 hours prior. The label can be "user", "automated", or "unclassified".
The categorization is performed with a decision tree. The decisions are in this order in the
HQL file
a mobile session with few pageviews is a user. (10 pageviews)
a session with a lot of pageviews is a bot. (800 pageviews)
a session with a high frequency of pageviews is a bot. (30 pageviews/minute)
a session mainly without cookies high frequency of pageviews is a bot. (30 pageviews/minute)
a session with too many requests without cookies and a small page_title variability is a bot.
a session using a user agent too large or too small is a bot. (length out of 25-400 characters)
a session with very low pageviews rare is a user. (~0 pageviews/minute)
Studies have been performed to evaluate the gain from using an ML model for classification in place of those heuristics. And the conclusions were that it doesn't worth the added complexity. Though, those heuristics and their values have been determined by experience AND inspiration from the results of the ML model (decision tree classifier).
To
create the pageview actor dataset
, we:
Create other datasets
Other datasets using pageview actors:
clickstream
pageviews hourly
...
Read
the traffic page
for more details.
And look at the page about
automated traffic correction in unique devices
to see how it is used.
Effect of automated-traffic detection on Pageview metric - March 2020
The automated-traffic detection method changes the
agent_type
field of some pageviews from
user
to
automated
. Therefore the number of
user
pageviews will decrease when the mechanism will be deployed.
Effect on most-viewed pages list (top)
Where the bot detection changes are most visible is in any list that compiles top pageviews (top pageviews per per project or per project and country). There are two reasons why the bot detection code impacts these lists significantly: the first one is 'bot vandalism' the second one is 'bot spam'. 'Bot vandals' are bots whose only goal seems to add obscene, sexual or political content to the top pageview list for a given project. We had a recent
instance
of bot vandalism on Hungarian Wikipedia where a significant percentage of the pages on the top pageview list were just bogus titles. We use data such as the one for this event to verify the accuracy of our removal, in this particular incident was very effective. There have been other incidents of lesser volume in terms of pageviews where it is harder to detect that the traffic might be automated.
The second most common effect we see on top pageview lists is 'bot spam'. Some bots request a given page over and over with unknown intent, in some instances it seems the bot is trying to manipulate one of Wikipedias top pageview lists to gain popularity for a topic. See, for example, a
recent example
in German Wikipedia. Normally the bot spam is temporary and the page soon disappears from the top pageview list. However, there are pages with sustained automated traffic since years back. The page of Darth Vader
top list
for English Wikipedia is a good example of one of those.
Over the month of March 2020 for all projects, the number of pageviews by
agent_type
is as follow:
agent_type
sum(view_count)
user
16202727427
71.55%
spider
5188363262
22.91%
automated
1253707204
5.54%
For the 5.54% of traffic labelled as automated traffic, 84% of traffic is
desktop
, 16% is
mobile web
and less than half a percent is from the
mobile app
. This is consistent with the heuristics the community has been using to discard automated traffic when manually curating top pageview lists, it is been a few years that pages with a very high percentage of desktop-only traffic are no longer present on those lists.
The graphs below show that the number of traffic as
automated
is stable, also among
access_method
Totals
Pageviews for all wikimedia projects by agent-type in March 2020
Pageviews for all wikimedia projects by agent-type (ratios) in March 2020
By access-method
Desktop
pageviews for all wikimedia projects by agent-type in March 2020
Mobile-web
Pageviews for all wikimedia projects by agent-type in March 2020
Mobile-app
pageviews for all wikimedia projects by agent-type in March 2020
See also
a former research analysis about agents without cookies
Impact per project
The impact per project depends on the size of projects.
On big projects the volume of traffic flagged as
automated
is relatively regular, similar to the global one (see graphs below).
On smaller projects however the amount of so called automated-traffic is a lot less regular, with spikes or plateau periods. It is very interesting to notice that the removal of the
automated
traffic from the
user
bucket makes the new user-traffic not only somehow smaller but also a lot more stable in many cases (better split between signal and noise).
The number associated with each project is its rank in term of number of pageviews for the month of March 2020, user, spider and automated included.
Top 10 projects by number of views
en.wikipedia (1)
Pageviews for
en.wikipedia
project by agent-type in March 2020
Pageviews for
en.wikipedia
project by agent-type (ratio) in March 2020
es.wikipedia (2)
Pageviews for
es.wikipedia
project by agent-type in March 2020
Pageviews for
es.wikipedia
project by agent-type (ratio) in March 2020
ja.wikipedia (3)
Pageviews for
ja.wikipedia
project by agent-type in March 2020
Pageviews for
ja.wikipedia
project by agent-type (ratio) in March 2020
de.wikipedia (4)
Pageviews for
de.wikipedia
project by agent-type in March 2020
Pageviews for
de.wikipedia
project by agent-type (ratio) in March 2020
Pageviews for
commons.wikimedia
project by agent-type in March 2020
Pageviews for
commons.wikimedia
project by agent-type (ratio) in March 2020
ru.wikipedia (6)
Pageviews for
ru.wikipedia
project by agent-type in March 2020
Pageviews for
ru.wikipedia
project by agent-type (ratio) in March 2020
fr.wikipedia (7)
Pageviews for
fr.wikipedia
project by agent-type in March 2020
Pageviews for
fr.wikipedia
project by agent-type (ratio) in March 2020
it.wikipedia (8)
Pageviews for
it.wikipedia
project by agent-type in March 2020
Pageviews for
it.wikipedia
project by agent-type (ratio) in March 2020
zh.wikipedia (9)
Pageviews for
zh.wikipedia
project by agent-type in March 2020
Pageviews for
zh.wikipedia
project by agent-type (ratio) in March 2020
pt.wikipedia (10)
Pageviews for
pt.wikipedia
project by agent-type in March 2020
Pageviews for
pt.wikipedia
project by agent-type (ratio) in March 2020
5 random Smaller projects
gl.wikipedia (80)
Pageviews for
gl.wikipedia
project by agent-type in March 2020
Pageviews for
gl.wikipedia
project by agent-type (ratio) in March 2020
en.wikivoyage (81)
Pageviews for
en.wikivoyage
project by agent-type in March 2020
Pageviews for
en.wikivoyage
project by agent-type (ratio) in March 2020
wikisource (158)
Pageviews for
wikisource
project by agent-type in March 2020
Pageviews for
wikisource
project by agent-type (ratio) in March 2020
de.wikiquote (259)
Pageviews for
de.wikiquote
project by agent-type in March 2020
Pageviews for
de.wikiquote
project by agent-type (ratio) in March 2020
el.wikiquote (473)
Pageviews for
el.wikiquote
project by agent-type in March 2020
Pageviews for
el.wikiquote
project by agent-type (ratio) in March 2020
Code
Because the detection of automated traffic happens after pageviews are refined the 'automated' marker on the column 'agent_type' is not present on the webrequest table, rather the "automated' marker will be present on the pageview_hourly table. So, records marked initially with agent_type='user' on webrequest might be marked at later time with agent_type='automated'. In order to know where a particular set of requests from as given actor in webrequest is marked as 'automated' it is necessary to calculate the actor signature. Predictions per actor signature are calculated daily. Sample code to calculate actor signatures looks as follows:
```|
-- Example 1: I am wondering about this IP and whether its traffic might be of automated nature
use wmf;
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive-shaded.jar;
CREATE TEMPORARY FUNCTION get_actor_signature AS 'org.wikimedia.analytics.refinery.hive.GetActorSignatureUDF';
with questionable_requests as (
select
distinct get_actor_signature(ip, user_agent, accept_language, uri_host, uri_query, x_analytics_map) AS actor_signature
from webrequest
where
year=2020 and day=20 and hour=1 and month=4
and is_pageview=1
and pageview_info['project'] ='es.wikipedia'
and agent_type="user"
and IP='some'
limit 100
select
label,
label_reason, // this field has an explanation of why the label
AL.actor_signature
from questionable_requests QR join predictions.actor_label_hourly AL
on (AL.actor_signature = QR.actor_signature)
where
year=2020 and day=20 and hour=1 and month=4
-- Example2: For an hour of webrequest select just the traffic that is not labeled as automated
use wmf;
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive-shaded.jar;
CREATE TEMPORARY FUNCTION get_actor_signature AS 'org.wikimedia.analytics.refinery.hive.GetActorSignatureUDF';
with user_traffic as (
select actor_signature
from predictions.actor_label_hourly
where year=2020 and month=4 and day=20 and hour=1
and label="user"
select
pageview_info['page_title'],
get_actor_signature(ip, user_agent, accept_language, uri_host, uri_query, x_analytics_map) AS actor_signature
from webrequest W
join user_traffic UT on (UT.actor_signature = get_actor_signature(ip,user_agent,accept_language,uri_host,uri_query,x_analytics_map))
where
year=2020 and day=20 and hour=1 and month=4
and is_pageview=1
and pageview_info['project'] ='es.wikipedia'
and agent_type="user"
```

Data Platform/Data Lake/Traffic/BotDetection - Wikitech