Commons Impact Metrics/Data Model - Wikitech
Jump to content
From Wikitech
Commons Impact Metrics
This page describes the data model of the Commons Impact Metrics data product. It includes detailed descriptions of all dimensions and metrics as well as the detailed schemas of the 5 base datasets stored in Hive (Iceberg) and exported as
dumps
. It does not include information about the Commons Impact Metrics API (AQS service) or the corresponding Cassandra tables, although some of the definitions are shared. It also does not include contextual information on the project, caveats, nor how to access the data. For find more general information about the project, see
Commons Impact Metrics
Field glossary
The following table contains detailed definitions for all fields of the Commons Impact Metrics data model. They are valid for all datasets and datastores down the data pipeline: Hive (Iceberg), Cassandra, AQS and dumps.
Field name
Description
category
The name (or title) of a category, as it appears in the URL of the corresponding category page in Commons. For example:
Images_from_Memòria_Digital_de_Catalunya
is the category name of https://commons.wikimedia.org/wiki/Category:Images_from_Memòria_Digital_de_Catalunya.
This normalized version of the category name (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores.
category_scope
It is either "shallow" or "deep". It always goes together with a category field and a metric count field.
shallow
Indicates that the associated metric count pertains to the associated category by itself, and not to the category tree.
deep
Indicates that the associated metric count pertains to the associated full category tree (the category itself and all its descendants).
parent_categories
or
categories
It goes together with a category or a media file field. It is a list of the categories that are a direct parent to the associated field. They use the same format as the
category
field described above. Note: When the associated field is a category, this field is called
parent_categories
. When the associated field is a media file, this field is called
categories
primary_categories
It goes together with a
category
or a
media_file
field. It is a list of the categories that are a top ancestor to the associated field. They use the same format as the
category
field described above. Note: Top ancestors are the ones listed in the Commons Impact Metrics allow-list by definition.
media_file
The name (or title) of a media file, as it appears in the URL of the corresponding media file page in Commons. For example:
A_la_font_de_Torrentsenta_a_Gósol_(cropped).jpg
is the media file name of https://commons.wikimedia.org/wiki/File:A_la_font_de_Torrentsenta_a_Gósol_(cropped).jpg.
This normalized version of the media file name (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores.
media_type
Indicates the type of media file. The values are directly imported from the
img_media_type
field in MediaWiki's
image
table, i.e. "BITMAP", "VIDEO", etc.
user_name
A user name, as it appears in the
actor_name
field of MediaWiki's
actor
table. It goes together with an edit event. If no actor is found for the edit,
user_name
is ":anonymous:". If the edit's actor has been suppressed,
user_name
is ":redacted:". The colon symbols are added to those special values to prevent collisions with existing user names.
edit_type
Indicates the type of edit to a media file. Either “create”, for the first revision of a media file page; or “update”, for all subsequent revisions of the media file page.
wiki
The canonical name of a wiki. It indicates the language and the family of the wiki in the format "
page_title
The title of a wiki page, as it appears in the URL of the corresponding page in some wiki. For example:
दिल्ली
is the page title of https://hi.wikipedia.org/wiki/दिल्ली. This normalized version of the page title (with underscores instead of spaces) is the standard that is used across all Commons Impact Metrics datasets and datastores. Note: This data product only reports on
namespace=0
wiki pages.
year_month
For the aggregated datasets,
Pageviews per Category
and
Pageviews per Media File
year_month
indicates the month for which the data is aggregated. For the snapshot based datasets,
Category Metrics Snapshot
and
Media File Metrics Snapshot
year_month
indicates the month at the end of which the snapshot was taken. In all cases the format is "yyyy-MM", i.e. "2024-01".
dt
The timestamp of an edit. It has different formats depending on the datastore. In Hive (Iceberg) it is of type
TIMESTAMP (granularity in seconds)
. In the dumps is has the format "yyyy-MM-dd'T'HH:mm:ss". It is not reported via AQS or dumps.
pageview_count
This metric is always associated with a given Commons media file, or a given Commons category. When it's associated with a media file, it counts the number of pageviews to wiki pages containing that media file. When it's associated with a category, it counts the number of pageviews to wiki pages containing any media file belonging to that category. More specifically, any media file directly associated with the category, if
category_scope="shallow"
; and any media file within the whole associated category tree, if
category_scope="deep"
The
pageview_count
metric can be broken down by
wiki
, by
page_title
and by
year_month
. It is also an additive metric, so it can be aggregated across those same 3 dimensions. For instance, if the
pageview_count
metric value for January is
P1
and for February is
P2
, then the combined value since start of January until end of February is
P1+P2
The
pageview_count
metric has a known caveat in the context of this project:
monthly drift
Read more about monthly drift
media_file_count
This metric is always associated with a given Commons category. It counts the number of media files directly belonging to that category.
media_file_count_deep
This metric is always associated with a given Commons category. It counts the number of media files belonging to that category tree. Meaning the media files can belong to the category itself, or to any of its descendant categories (child, grandchild, etc.).
used_media_file_count
This metric is always associated with a given Commons category. It counts the number of media files directly belonging to that category, which appear in at least 1 wiki page.
used_media_file_count_deep
This metric is always associated with a given Commons category. It counts the number of media files belonging to that category tree, which appear in at least 1 wiki page. Meaning the used media files can belong to the category itself, or to any of its descendant categories (child, grandchild, etc.).
leveraging_wiki_count
This metric is always associated with a given Commons media file, or a given Commons category. When associated with a media file, it counts the number of wikis having at least 1 page that features that media file. When associated with a category, it counts the number of wikis having at least 1 page that features any media file directly belonging to the associated category.
leveraging_wiki_count_deep
This metric is always associated with a given Commons category. It counts the number of wikis having at least 1 page that features any media file belonging to the associated category tree (the category itself, its children, grandchildren, etc.).
leveraging_page_count
This metric is always associated with a given Commons media file, or a given Commons category. When associated with a media file, it counts the number of wiki pages (namespace=0) featuring that media file. When associated with a category, it counts the number of wiki pages (namespace=0) featuring any media file directly belonging to the associated category.
leveraging_page_count_deep
This metric is always associated with a given Commons category. It counts the number of wiki pages (namespace=0) featuring any media file belonging to the associated category tree (the category itself, its children, its grandchildren, etc.).
edit_count
This metric is always associated with a given wiki user name, or a given Commons category. When associated with a user name, it counts the number of edits performed by the associated user, to media files belonging to the allow-listed Commons categories. When associated with a category, it counts the number of edits to media files belonging to the associated category. More specifically, edits to media files directly associated with the category, if
category_scope="shallow"
; and edits to media files within the whole associated category tree, if
category_scope="deep"
Base datasets
The Commons Impact Metrics data product consists of 5 base datasets stored in Hive (Iceberg), and also exported in the form of
dumps
. Two of them (
Category metrics snapshot
and
Media file metrics snapshot
) are snapshot-based. Two of them (
Pageviews per category monthly
and
Pageviews per media file monthly
) are aggregated cubes. The last one (Edits) is event-based.
Snapshot-based datasets
Snapshot-based datasets capture the state of things at a given point in time. In Commons Impact Metrics, the snapshot is taken at the end of the month in question. The metrics included in snapshot-based datasets are absolute counts since the beginning of time, up to the instant of the snapshot. For instance, the number of media files contained inside a Commons category. Note that such metrics can not be aggregated over time.
Category metrics snapshot
This dataset stores metrics about Commons categories. Each row corresponds to a category (for a given snapshot). It can be an allow-listed (primary) category or one of its subcategories. Only allow-listed (primary) categories report on "deep" metrics, all categories report on "shallow" metrics. The metric values (int) are
not
aggregatable. All queries to this table should always filter or breakdown by
category
and
year_month
Schema
Field
Type
Description
category
string
The name of the category this row refers to.
parent_categories
list
The immediate ancestor category names of this row's category.
primary_categories
list
The top ancestor category names of this row’s category.
media_file_count
int
The number of media files contained in this category.
media_file_count_deep
int
The number of media files contained in this category tree. Only available for primary allow-listed categories.
used_media_file_count
int
The number of media files from this category featured in at least one wiki page.
used_media_file_count_deep
int
The number of media files from this category tree featured in at least one wiki page. Only available for primary allow-listed categories.
leveraging_wiki_count
int
The number of wikis featuring at least one of this category’s media files.
leveraging_wiki_count_deep
int
The number of wikis featuring at least one of this category tree’s media files. Only available for primary allow-listed categories.
leveraging_page_count
int
The number of pages featuring at least one of this category’s media files.
leveraging_page_count_deep
int
The number of pages featuring at least one of this category tree’s media files. Only available for primary allow-listed categories.
year_month
string
The month after which the snapshot was taken (YYYY-MM).
Sample queries
-- Get shallow metrics about a given category.
SELECT
media_file_count
used_media_file_count
leveraging_wiki_count
leveraging_page_count
FROM
category_metrics_snapshot
WHERE
year_month
"2024-01"
AND
category
"My_category"
-- Get a time series that shows the usage of a category tree over time.
-- Note we are not filtering by year_month, but we are breaking down by it.
SELECT
year_month
used_media_file_count_deep
FROM
category_metrics_snapshot
WHERE
catgegory
"My_allow_listed_category"
GROUP
BY
year_month
ORDER
BY
year_month
ASC
-- Get the category graph for an allow-listed category tree,
-- represented as edges of the form (parent_category, child_category).
SELECT
EXPLODE
parent_categories
AS
parent_category
category
AS
child_category
FROM
category_metrics_snapshot
WHERE
year_month
"2024-01"
AND
ARRAY_CONTAINS
primary_categories
"My_allow_listed_category"
Media file metrics snapshot
This dataset stores metrics about Commons media files. Each row corresponds to a media file (for a given snapshot). To reduce the size of the data, media files that are not featured in any wiki page do not appear in this dataset. The metric values (int) are
not
aggregatable. All queries to this table should always filter or breakdown by
media_file
and
year_month
Schema
Field
Type
Description
media_file
string
The name of the media file this row refers to.
media_type
string
The media type of the media file.
categories
list
The immediate ancestor category names of this row's media file.
primary_categories
list
The top ancestor category names of this row’s media file.
leveraging_wiki_count
long
The number of wikis featuring this row's media file.
leveraging_page_count
long
The number of wiki pages (across all wikis) featuring this row's media file.
year_month
string
The month after which the snapshot was taken (YYYY-MM).
Sample queries
-- Get metrics about a given media file.
SELECT
leveraging_wiki_count
leveraging_page_count
FROM
media_file_metrics_snapshot
WHERE
year_month
"2024-01"
AND
media_file
"My_media_file.jpg"
-- Get a time series that shows the leverage of a media file over time.
-- Note we are not filtering by year_month, but we are breaking down by it.
SELECT
year_month
leveraging_page_count
FROM
media_file_metrics_snapshot
WHERE
media_file
"My_media_file.jpg"
GROUP
BY
year_month
ORDER
BY
year_month
ASC
Aggregated cubes
Aggregated cube datasets store metric aggregations broken down by a set of dimensions. Ideally, the metrics are additive, and you can slice and dice the dataset aggregating across all dimensions including time. In Commons Impact Metrics, the aggregation is done at a monthly granularity, for data size reasons. And the only metric is
pageview_count
(see its definition in the metrics table above).
Pageviews per category monthly
This dataset stores pageview counts for Commons categories, as explained in the metric definitions section above. The
pageview_counts
are dimensioned by
wiki
page_title
and
year_month
. You can aggregate
pageview_counts
only across those 3 dimensions. You can not aggregate across the
category
dimension. Pageviews to a wiki page can be attributed to more than one category, when that page contains media files from several categories. So, aggregating across the
category
dimension will produce duplicate counts. You can't either aggregate across the
category_scope
dimension, since the
pageview_count
metric has different meanings for
category_scope="shallow"
vs.
category_scope="deep" (read more in the metric definitions section above)
. So, you should always filter or breakdown by both
category
and
category_scope
. Only allow-listed (primary) categories report pageviews for
category_scope="deep"
. All categories report pageviews for
category_scope="shallow"
. Pageviews to a wiki's Main page are not counted. Rows with
pageview_count=0
are omitted.
Schema
Field
Type
Description
category
string
The name of the category this row refers to.
category_scope
string
The scope of the category; either "shallow" or "deep".
primary_categories
list
The top ancestor category names of this row’s media file.
wiki
string
The canonical name of the visualized wiki. Only wikis that feature at least one media file of the corresponding category will appear here.
page_title
string
The title of the visualized wiki page. Only pages featuring at least one media file of the corresponding category will appear here.
pageview_count
long
Aggregated pageview count.
year_month
string
The month for which we aggregate the data (YYYY-MM).
Sample queries
-- Get all-time aggregated pageview counts for a given category tree.
SELECT
SUM
pageview_count
AS
pageview_count
FROM
pageviews_per_category_monthly
WHERE
category
"My_primary_category"
AND
category_scope
"deep"
-- Get a time series that shows the pageviews of a category over time.
SELECT
year_month
pageview_count
FROM
pageviews_per_category_monthly
WHERE
category
"My_category"
AND
category_scope
"shallow"
GROUP
BY
year_month
ORDER
BY
year_month
ASC
-- Get a 2023 rank of the wiki pages with most pageviews
-- containing media files from a given category tree.
SELECT
wiki
page_title
SUM
pageview_count
AS
pageview_count
FROM
pageviews_per_category_monthly
WHERE
year_month
BETWEEN
"2023-01"
AND
"2023-12"
AND
category
"My_primary_category"
AND
category_scope
"deep"
GROUP
BY
wiki
page_title
ORDER
BY
pageview_count
DESC
LIMIT
100
Pageviews per media file monthly
This dataset stores pageview counts for Commons media files, as explained in the metric definitions section above. The
pageview_counts
are dimensioned by
wiki
page_title
and
year_month
. You can aggregate
pageview_counts
only across those 3 dimensions. You can not aggregate across the
media_file
dimension. Pageviews to a wiki page can be attributed to more than one media file, when that page contains several media files. So, aggregating across the
media_file
dimension will produce duplicate counts. So, you should always filter or breakdown by
media_file
. Pageviews to a wiki's Main page are not counted. Rows with
pageview_count=0
are omitted.
Schema
Field
Type
Description
media_file
string
The name of the media file this row refers to.
categories
list
The parent category names of this row's media file.
primary_categories
list
The top ancestor category names of this row’s media file.
wiki
string
The canonical name of the visualized wiki. Only wikis that feature at least one media file of the corresponding category will appear here.
page_title
string
The title of the visualized wiki page. Only pages featuring at least one media file of the corresponding category will appear here.
pageview_count
long
Aggregated pageview count.
year_month
string
The month for which we aggregate the data (YYYY-MM).
Sample queries
-- Get all-time aggregated pageview counts for a given media file.
SELECT
SUM
pageview_count
AS
pageview_count
FROM
pageviews_per_media_file_monthly
WHERE
media_file
"My_media_file.jpg"
-- Get a time series that shows the pageviews of a media file over time.
SELECT
year_month
pageview_count
FROM
pageviews_per_media_file_monthly
WHERE
media_file
"My_media_file.jpg"
GROUP
BY
year_month
ORDER
BY
year_month
ASC
-- Get a 2023 rank of the wiki pages with most pageviews
-- containing a given media file.
SELECT
wiki
page_title
SUM
pageview_count
AS
pageview_count
FROM
pageviews_per_media_file_monthly
WHERE
year_month
BETWEEN
"2023-01"
AND
"2023-12"
AND
media_file
"My_media_file"
GROUP
BY
wiki
page_title
ORDER
BY
pageview_count
DESC
LIMIT
100
Event-based datasets
In event-based datasets each row represents an event, something that happened in the context of the dataset. Event-based datasets always have a field containing the timestamp of the event, with very fine granularity, i.e. milliseconds. Usually, they don't have any metric value.
Edits
This is an event-based dataset. Each row corresponds to an edit event performed on a Commons media file belonging to an allow-listed category tree. The dataset does not contain any metric per se, but you can aggregate row counts across any set of dimensions.
Schema
Field
Type
Description
user_name
string
The user name of the user who performed the edit.
edit_type
string
Either “create” or “update”.
media_file
string
The name of the edited media file.
categories
list
The (parent) category names that the media file is directly associated with.
primary_categories
list
The top ancestor category names of this row’s media file.
dt
timestamp
The timestamp of the edit.
Sample queries
-- Get all-time aggregated edit counts for a given allow-listed category tree.
SELECT
SUM
edit_count
AS
edit_count
FROM
edits
WHERE
ARRAY_CONTAINS
primary_categories
"My_primary_category"
-- Get a time series that shows the edits to a media file over time.
SELECT
month
dt
AS
year_month
edit_count
FROM
edits
WHERE
media_file
"My_media_file.jpg"
GROUP
BY
month
dt
ORDER
BY
year_month
ASC
-- Get a 2023 rank of the user names with most edits
-- to media files belonging to a given category.
SELECT
user_name
SUM
edit_count
AS
edit_count
FROM
edits
WHERE
dt
BETWEEN
TO_TIMESTAMP
"2023-01"
AND
TO_TIMESTAMP
"2023-12-31 23:59:59.999"
AND
ARRAY_CONTAINS
categories
"My_category"
GROUP
BY
user_name
ORDER
BY
edit_count
DESC
LIMIT
100
Retrieved from "
Commons Impact Metrics/Data Model
Add topic
US