Instrument Configurator

Design Document

Author: Sam Smith Clare Ming

Date authored: 2024/02/05

In Instrument Configurator – Where Do We Build the Instrument Configurator?, we decided to build the instrument configuration management system (herein “instrument configurator”) as a publicly-available, standalone app running in the dse-k8s cluster and a MediaWiki extension to adapt and deliver the output of the app. In this document, we describe, in as much detail as possible, the design of the instrument configurator.

In the context of this document and supporting documents, an “instrument” is a module that records collected user interaction data for the purpose of understanding how users are interacting with one or more features.

Instruments may also be referred to as “analytics instruments” or “user analytics instruments.”

The configurator will:

  1. Be publicly available at https://mpic.wikimedia.org
  2. Enable users to manage the configuration of an instrument, including enabling or disabling it manually
  3. Because the configurator is centralized, it will enable non-technical users to manage instruments across all Wikimedia-hosted sites, including but not limited to: the Wikipedias, the Beta Cluster, and the Wikimedia Portals
  4. Enable users to view the history of the configuration of an instrument
  5. Integrate with existing logging systems in order to maintain the current level of observability for instrument configuration changes, e.g. updating the SAL
  6. Allow superusers to disable one, many, or all instruments with one button
  7. Be performant

Image 1: The MetricsPlatform extension fetches instrument configuration from the app and delivers it to the client libraries running in the browser as a ResourceLoader config bundle. Source (Miro)

The backend will be built in Node.js 20. It will use the following projects:

The frontend will be built using Codex in order to provide an experience consistent with other Wikimedia-hosted apps.

All routes require authentication and authorization except where noted otherwise.

We propose authenticating and authorizing users using OpenID Connect, implemented in openid-client, and CAS-SSO as the OpenID Connect Issuer. Because the app will not make API requests to any third parties, we propose implementing the Authorization Code Flow and storing the user identity, session ID, and an HMAC in an httpOnly session cookie (herein “the session cookie”).

Redirects to GET /instruments

Shows the lists of all instruments as an HTML table.

See "catalog" in Figma for the proposed design.

GET /instrument/$slug
[edit]

Sanitize $slug. Fetch the details of the instrument with slug=$slug. If there is no instrument, then return an HTTP 404 Not Found response. Otherwise, show the details of an instrument.

See "new instrument – modify" in Figma for the proposed design.

PUT /instrument/$slug
[edit]

Sanitize $slug. Fetch the details of the instrument with slug=$slug. If there is no instrument, then return an HTTP 404 Not Found response. Validate the request body. If the request body is not a valid instrument configuration, then return an HTTP 400 Bad Request. Otherwise, merge the request body with the instrument configuration and Insert the corresponding rows into the instrument and instrument_sample_rate tables above.

Shows the HTML form to create an instrument (herein “the form”). Since the form is state-changing, a fresh double-submit cookie is sent with the response.

See "new instrument – launch" in Figma for the proposed design.

GET modify/$slug

Shows the HTML form to modify an instrument. Since the form is state-changing, a fresh double-submit cookie is sent with the response.

See "new instrument – modify" in Figma for the proposed design.

Validate the double-submit cookie. If the double-submit cookie is invalid, then clear the session cookie and redirect to GET /. Otherwise, continue processing the request.

Validate the form. If the form invalid, then show the form with validation errors and stop processing and, since the form is state-changing, send a fresh double-submit cookie with the response. Otherwise, continue processing the request.

Insert the corresponding rows into the instrument and instrument_sample_rate tables above. If the insertion fails, then show the form with an error message and stop processing and, since the form is state-changing, send a fresh double-submit cookie with the response. Otherwise, continue processing the request.

Add an entry to the SAL for the current day. If adding an entry fails, then log the failure and continue processing the request.

Redirect to GET /instrument/$slug.

DELETE /instrument/$id

Validate the double-submit cookie. If the double-submit cookie is invalid, then clear the session cookie and redirect to GET /. Otherwise, continue processing the request.

Remove the instrument whose id is set as a path variable

PATCH /instrument/$id

Validate the double-submit cookie. If the double-submit cookie is invalid, then clear the session cookie and redirect to GET /. Otherwise, continue processing the request.

Enable or disable the instrument whose id is set as a path variable depending on its current status.

GET /api/v1/instruments
[edit]

Shows the list of all instruments as a JSON-encoded list.

This route does not require authentication or authorization.

The instrument configuration includes the default sampling rate and deviations from the default by wiki. Note well that dblists must be expanded by the app because there is no facility to expand dblists in the a MediaWiki extension. This could increase the size of the response considerably.

For example, consider the following two representations for an instrument with a default sampling rate of 10% and a sampling rate of 1% for group2 wikis:

{

 “sampling_unit”: “session”,

 “sampling_rate”: {

   “default”: 0.1,

   // group2 -> 0.01

   “aawiki”: 0.01,

   “abwiki”: 0.01,

   “acewiki”: 0.01

   // …

 }

}

{

 “sampling_unit”: “session”,

 “sampling_rate”: {

   “default”: 0.1,

   // group2 -> 0.01

   “0.01”: [

     “aawiki”,

     “abwiki”,

     “acewiki”

     // …

   ]

 }

}

The group2 dblist contains 333 wikis, with an average name length of 8 bytes. Per Instrument Configurator – Design Document, sampling rates are 4 bytes. Including the required JSON formatting, the predicted size of the sampling_rate value in each of the representations is:

5344 bytes 3688 bytes (~31% smaller)
GET /api/v1/experiments
[edit]

Shows the list of all A/B tests as a JSON-encoded list with all the considerations explained above for GET /api/v1/instruments endpoint

dump_instrument_configurations
[edit]

There is a non-zero chance that the cluster the app is running in could fail catastrophically. There is a chance that the Data Platform PostgreSQL cluster could fail catastrophically.

The dump_instrument_configurations script will export the current instrument configurations to a format compatible with the static configuration scripts in operations/mediawiki-config. The script will be run weekly and its output will be captured in a text file in the Metrics Platform shared GDrive.

PHP (7.4.33)

N/A

The EventLogging MediaWiki extension (herein “EL”) will remain responsible for delivering stream configurations for instruments as a ResourceLoader config bundle. EL gets the stream configurations from the EventStreamConfigs MediaWiki extension (herein “ESC”).

We will update ESC to be able to fetch stream configurations from other sources via a hook. ESC will be responsible for merging the stream configurations.

The Metrics Platform MediaWiki extension will implement a hook handler for that hook. The hook handler will fetch the instrument configuration from the app (via the GET /api/v1/instruments route), adapt the response to match the expected format, and return the result.

The instrument configuration will be fetched from the WAN cache, the app, and then the default configuration in order. At any stage, if the lookup succeeds, then we do not move to the next source. Fetching should last no longer than 250 ms.

  1. WAN cache

If the cache lookup hits, then we verify the cached instrument configuration with the integrity hash cached with it. If the cached instrument configuration is verified, then the lookup is successful.

We expect a cache lookup to take no longer than 10 ms.

We propose a TTL of 1 minute. Ignoring all other caches, this would correspond to an average 1440 requests per day per DC from the extension to the app.

We propose a stale TTL of 1 day.

We expect the size of the cached value to be no larger than O(1 MB). For example, the size of the JSON-encoded stream configuration for the android.product_metrics.article_link_preview_interaction stream  is 1028 bytes. We expect fewer than 100 instruments to be configured at any one time, which would require ~100 kB to store.

  1. The app

We first generate a random number between 0 and 1. If the number is below a configurable threshold, $wgMetricsPlatformAppFetchProbabiliy, then the instrument configuration will be fetched from GET api/v1/instruments.

The timeout of the HTTP request will be configurable with a default value of 240 (ms), i.e. $wgMetricsPlatformAppFetchTimeout = 240; // (ms).

  1. The default configuration, which disables all instruments

ESC will be responsible for merging stream configurations from different sources. We propose that ESC:

  1. Does not deep-merge multiple stream configurations; and
  2. Gives static configuration highest priority

The following database schema will allow us to build the the configurator UI mocks and satisfy the JSONSchema schema for the client library config:

Table: instruments
Field name Field type
id int unsigned primary key auto_increment
name varchar(255) not null
slug varchar(255) not null
description text
creator varchar(255) not null
owner json default null
purpose json default null
created_at datetime default current_timestamp not null
updated_at datetime default current_timestamp not null
utc_start_dt datetime not null
utc_end_dt datetime not null
task varchar(1000) not null
compliance_requirements set('legal', 'gdpr') not null
sample_unit varchar(255) not null
sample_rate json not null
environments set('development', 'staging', 'production', 'external') not null
security_legal_review varchar(1000) not null
status boolean default false
was_activated boolean default false
stream_name varchar(255)
schema_title varchar(255)
schema_type varchar(255)
email_address varchar(255)
type varchar(255)
features json default null
Table: contextual_attributes
Field name Field type
id int unsigned primary key auto_increment
contextual_attribute_name varchar(255) not null
Table: instrument_contextual_attribute_lookup
Field name Field type
id int unsigned primary key auto_increment
instrument_id int unsigned not null
contextual_attribute_id int unsigned not null
Database Traffic Estimates
[edit]

We estimate the write frequency for this database to be < 24 writes per day. We estimate the read frequency for this database to be < 1000 reads per day.

ResourceLoader modules are cached in our CDN for a minimum of 5 minutes and a maximum of 30 days. 5 minutes here comes from the cache lifetime of the ResourceLoader module manifest, which browsers use to determine which versions of which module need fetching.

As detailed in Instrument Configurator – Design Document, we are bundling the instrument config inside a ResourceLoader module. Thus we will only request the instrument config approximately once within any 5 minute window (or roughly 288 reqs/day). If we use a DC-local cache in the extension with a TTL of 1 minute, then, ignoring the CDN, we will only request the instrument config approximately once a minute (or roughly 1440 reqs/day).

N/A

  1. https://wikitech.wikimedia.org/wiki/Metrics_Platform
  2. MP Intro
  3. Metrics Platform FAQs:
  1. Metrics Platform Stream Configuration Syntax
  2. Configurator UI Mocks
  3. Where Do We Build the Instrument Configurator
  4. T331514: [Goal] M1: Metrics Platform: Control Plane: Analytics instrumentation stream management UI
Feb 20,

2025

  • variants field has been renamed to features
Jan 8,

2024

Updated data model:
  • created_at, updated_at are now datetime
  • start_date and end_date have been renamed to utc_start_dt and utc_end_dt, respectively. And both are now datetime
  • Added was_activated as a new boolean column
Nov 22, 2024 Updated data model:
  • Some fields are now optional (stream_name, schema_title, schema_type)
  • Some fields change its data type to support multiple values (owner, purpose)
  • New field added to support A/B test: variants
  • Added new endpoint: GET /api/v1/experiments
Aug 14, 2024
  • Updated used technologies
  • Added new routes: DELETE, PATCH
  • Changed data types for some columns according to the new database we use (MariaDB)
  • Added new fields to the instruments table: duration_amount, duration_description, status, stream_name, schema_title, schema_type, email_address and type
  • Fixed data types to the database schema (finally we are using MariaDB instead of PostgreSQL)
  • Added new tables: contextual_attributes and instrument_contextual_attribute_lookup
  • Removed unused table: instrument_sample_rates
Apr 19, 2024 Updated Data Model to include add new fields and edit existing:

“task” - varchar(1000) not null

“security_legal_review” - varchar(1000) not null

“status” - char(20) not null

Mar 26, 2024 Addressed questions raised during the review from Service Ops, which are summarized here.
Feb 22, 2024
  • Update to reflect conversation with Andrew Otto about updating the EventStreamConfigs extension to allow for dynamic stream configuration.
  • Update to include steps to update the SAL
Feb 21, 2024 Initial draft complete.