Phabricator T388579

Status: Accepted

Date: April 10, 2025 Amended: May 13, 2025

Problem statement

We aim to make a flexible system that allows our editors to be creative and implement complex tools on top of the standard data storage and chart rendering we provide. In particular, we want content authors to be able to procedurally modify or generate tabular datasets as a transformation pipeline on their way into the chart renderer.

Examples of transform modules that could be created and maintained by power users:

  • unit conversions
  • statistical processing (sums, averages, etc)
  • reformatting input data
  • combining multiple datasets (column or row appends)
  • extracting from large datasets (column or row subsets)
  • procedural generation of mathematical datasets
  • one-off or configurable mashups combining data from different table layouts
  • querying other data sources available to Lua such as Wikibase

The previous Graphs system allowed for tabular data to be loaded and modified extensively in both Lua modules for template usage and in JavaScript, but the JavaScript side was considered too dangerous to keep using as-is from a security standpoint and the Lua side had to export JSON manually to the renderer.

Decision outcome

We are implementing a transform layer in JsonConfig for Data: pages via an extension of the existing Lua interface, and exposing transform selection into the Data:*.chart chart format pages in Charts, so each individual chart may use its own sandboxed Lua code for transforming and modifying data.

Initial version will use a transform funciton reference and arguments encapsulated in the Data:*.chart JSON, invoked as normal with optional argument overrides passed with the arg: prefix:

{{#chart:Weekly average temperatures.chart
|data=Weekly temperatures in Los Angeles, California.tab
|arg:units=C
}}

<!-- Using per-invocation arguments -->
{{#chart:Weekly average temperatures.chart
|data=Weekly temperatures in Los Angeles, California.tab
|arg:units=F
}}

with a Data:Weekly average temperatures.chart JSON referencing the raw data and providing defaults for data transformations:

{
	"license": "CC0-1.0",
	"version": 1,
	"type": "bar",
	"xAxis": {
		"title": {
			"en": "Day"
		}
	},
	"yAxis": {
		"title": {
			"en": "Temperature"
		}
	},
	"source": "Sample weekly temperature dataset.tab",
	"transform": {
		"module": "Weekly average temperature chart",
		"function": "convert_temps",
		"args": {
			"units": "C"
		}
	}
}

The sample data set Data:Sample weekly temperature dataset.tab holds data in Celsius:

{
	"license": "CC0-1.0",
	"description": {
		"en": "Sample monthly temperature data"
	},
	"schema": {
		"fields": [
			{
				"name": "month",
				"type": "localized",
				"title": {
					"en": "Month"
				}
			},
			{
				"name": "low",
				"type": "number",
				"title": {
					"en": "Low"
				}
			},
			{
				"name": "high",
				"type": "number",
				"title": {
					"en": "High"
				}
			}
		]
	},
	"data": [
		[
			{
				"en": "January"
			},
			5,
			20
		],
		[
			{
				"en": "July"
			},
			15,
			30
		]
	]
}

and a Module:Weekly average temperature chart Lua module something like:

local p = {}

local function celsius_to_fahrenheit(val)
	return val * 1.8 + 32
end

--
-- input data:
-- * tabular JSON strcuture with 1 label and 2 temp rows stored in C
--
-- arguments:
-- * units: "F" or "C"
--
function p.convert_temps(tab, args)
	if args.units == "C" then
		-- Stored data is in Celsius
		return tab
	elseif args.units == "F" then
		-- Have to convert if asked for Fahrenheit
		for _, row in ipairs(tab.data) do
			-- first column is month
			row[2] = celsius_to_fahrenheit(row[2])
			row[3] = celsius_to_fahrenheit(row[3])
		end
		return tab
	else
		error("Units must be either 'C' or 'F'")
	end
end

return p

This will be exposed internally via JCSingleton::getContentLoader() with new transform() function provide the relevant transform setup as a JCTransform.

The JSON data of the Data: page is converted to a Lua table, as when calling explicitly mw.ext.data.get("Data page", "_"), and the transform function returns it with any modifications to be converted back to JSON on the PHP side before it goes out to the requester. This is a clean separation of the transform from the rendering output, and doesn't require any manual JSON formatting or string manipulation like Graphs did.

Note that as with other wiki resources like templates and images and the tabular data sets, Lua modules can be edited out from under you -- if a previously-working chart invocation becomes invalid it will start falling over gracefully to showing a rendering error. Conditions like a missing or invalid Lua Module: page, or a missing/invalid function must be handled gracefully in this way, as will execution errors in the Lua environment. These will become detectable through the rendering error category for maintenance bots etc to flag.

To be consistent with the centralization of the Data: pages themselves, Lua modules are loaded and executed in the context of the wiki that is the data store (Commons on Wikimedia production systems). This has several advantages over executing them on the client wiki side:

  • no need to duplicate modules used across multiple wikis/languages
  • Data: and Module: pages live on the same wiki, simplifying editing/cross-linking/permissions management
  • a shared repository encourages sharing of common code and localization among power users editing the transform functions
  • compared to a remote fetch and local execution, encapsulating transforms on the centralized wiki means that the Lua functions are free to use shared code and data in additional Module: pages

Internally, when JCContentLoader->load() with a transform option is called, client wikis will make action=jsontransform API requests to the store wiki asking for a specific Data: page to load, a Lua Module: page to load from, a function name to execute, and a set of named arguments to pass in.

The action=jsontransform request will include a jtmodule, jtfunction, and jtargs parameters, where jtmodule and jtfunction are individual strings and args is a multi-string set allowing duplicates, meant to contain key=value pairs.

{
	jsontranslate: {
		data: {
			// tabular dataset JSON structure
			// ...
		},

		// TTL in seconds, may be shorter if certain functions are called
		expires: 86400,

		// Entries to store into `globaljsonlinks` tables for cache invalidation
		dependencies: [
			// Module:Weekly_average_temperature_chart with the Lua code
			{ "namespace": 828, "title": "Weekly_average_temperature_chart" }
			// Other modules, and other Data-namespace pages, may appear here
			// if they are referenced suitably. Note that namespaces are given
			// as the raw internal numbers here for the tracking table, as it
			// may be non-obvious where the remote namespaces come from.
		]
	}
}

The Lua Module: page itself, and any other pages listed as touched in the parser output's templatelinks, will be recorded and passed back as dependencies for the client wiki to record in globaljsonlinks for dependency tracking. This will require slight code changes in JsonConfig to fully support Module: and other namespaces, but the schema is already prepared for it.

NOTA BENE: as with images loaded from Commons, chart format and tabular Data: pages, and any Module: pages loaded for Lua transforms are not automatically protected when a remote page uses them, and it's possible for an unprotected dependency to be edited on Commons, introducing errors, breakage, or vandalism to an otherwise-protected page. It is recommended that we make it easier to expose which dependency pages a given chart instance uses so they can be easily found and fixed/protected if necessary. This will be followed up on with a separate ADR proposal.

Current and expiration timestamps are included for output cache management and are passed on to the consuming ParserOutput. Note that there is currently no caching of the transform itself, which is run fresh on each parse. This may be an opportunity for fragment caching in the future.

Internally JCContentLoader will handle the remote fetches, or the local executions through a JCTransform wrapper for the Lua execution environment.

(Remote transforms could be exposed to Lua Module: code as well in mw.ext.data.get() as an optional parameter, if it would be useful to expose centralized transform code to client wikis' modules as well. However this has not yet been specced and needs to be guarded against recursion if implemented.)

Expensive function counts and dependencies should be transferred from the execution environment to the calling Parser if possible, however only dependencies are a hard requirement (the execution itself counts as an expensive function).

Decision drivers

Lua execution is already well-sandboxed in MediaWiki's Scribunto extension, and we already have a way of exporting tabular data pages as table structures into Lua modules used for templates.

Adding a second place where Lua functions can be called doesn't significantly change our security surface; while this allows for arbitrary functions and parameters to be passed into any load, from a security perspective the API action (extension of action=jsontransform) is roughly equivalent to the API allowing aribtrary wikitext parse requests, where they can pass in the text {{#invoke:ModuleName|funcName|arg1|arg2}}. The system will work through a Parser object and maintains all the same performance counters as invocations through parsing, so sandboxing and rate limiting should be able to apply cleanly.

(Note that null values in Data: JSON do not transfer very cleanly to Lua right now, as Lua does not allow for storing nil in tables and this can cause null cells in a tabular data set to crop off the row early. There are ways around this if we find it necessary to support null values explicitly, such as replacing it with a sigil value or empty table. This is not expected to be a major problem to work around if we do find we need it.)

Other options considered

JavaScript-layer transforms were not seriously considered due to the difficulties of sandboxing arbitrary JavaScript in either browser or node contexts.

We considered a set of fixed filter functions which could be referenced by name and composed, but this did not seem flexible enough at best, and at worst would be creating our own domain-specific language.

We considered making transform selection totally selectable at the {{#chart:}} invocation time as well, but there seems to be consensus that it's better to make the module/function selection in the chart JSON and put parameter overrides at the invocation.

By using the existing facilities in the MediaWiki family via Scribunto/Lua we can keep transform execution centralized and safe.