Reference Check A/B Test Report

Reference Check A/B Test Report
This R notebook analyzes the Reference Check A/B test using the data collected in
collect_enwiki_refcheck_ab_test_data.ipynb
. It is structured to be rerunnable for future experiments by updating the config cell.
Data collection:
collect_enwiki_refcheck_ab_test_data.ipynb
(outputs TSVs in
data//
).
Modular reruns: we update
experiment_name
, buckets,
wiki_list
, and
output_dir
; TSV loaders auto-point to
data//
from the collection notebook.
This report can be rendered via Quarto (front matter above) similar to other edit-check A/B reports.
The A/B test has one key performance indicator with two parts, two optional curiosities to explore if time allows, and four guardrails, the last of which contains two sub-parts. Mapping the metrics to the decision points, KPI 1, KPI 2, Guardrail 1, and Guardrail 2 are the metrics required to enable decisions, as such we’ll focus on those here in this report and share outputs on Guardrail 4 and Curiosities if time allows. For this reporting we will not report on blocks.
Summary in Brief
None
Summary
TL;DR: When Reference Check was shown, edits were much more likely to add a new reference, edits were more often constructive, and reverts declined, with a slight reduction in edit completion; these patterns are especially evident on mobile web.
Brief Summary: When Reference Check is shown, edits are significantly more likely to add a reference, especially on mobile web. Edits shown Reference Check are directionally more constructive and less likely to be reverted within 48 hours, with the strongest and most consistent improvements on mobile web. While Reference Check slightly reduces edit completion, the decrease is modest.
References added: Edits were more likely to add at least one reference when Reference Check was shown, with very large gains on mobile web and clear gains on desktop.
Constructive edits: Edits were more likely to be constructive (not reverted within 48 hours), with the strongest improvement on mobile web.
Reverts: Revert rates declined overall, with the largest reduction on mobile web.
Edit completion: Completion rates decreased slightly across platforms.
More references, fewer reverts, and improved constructiveness on mobile demonstrate the benefits of Reference Check on English Wikipedia and outweigh the observed reduction in completion.
Key Results
None
Key Results Overview
2.0.1
References Added or Acknowledged (KPI #1)
High-level:
When Reference Check was shown, edits were far more likely to add a reference or acknowledge/explain why they did not.
Why KPI #1b:
Direct test–control comparisons for KPI #1 are hard to interpret because the “Decline” option exists only in test. We therefore focus on
KPI #1b
, which removes this imbalance.
KPI #1b — Reference Added (Shown / Eligible):
Desktop:
~2.2× more likely (30.7% → 68.2%)
Mobile web:
~17.5× more likely (2.8% → 48.9%)
Interpretation:
The increase in reference inclusion is large and statistically significant across models and simpler comparisons.
KPI #1b — Reference Added (Availability / ITT):
Overall:
56.3% → 68.3% (+12.1 pp, +21.5%)
Desktop:
60.5% → 70.6% (+10.2 pp, +16.8%)
Mobile web:
~2.2× more likely (22.0% → 47.8%)
Interpretation:
Even under a conservative ITT view, Reference Check increases the likelihood that constructive new-content edits include a reference, with the strongest lift on mobile web.
2.0.2
Constructive Edits (Not Reverted Within 48 Hours) (KPI #2)
Desktop:
75.5% → 77.9% (+3.2% relative; the overall adjusted regression does not find a statistically significant across-platform effect)
Mobile web:
56.4% → 66.7% (+18.2% relative; within-platform contrast statistically significant)
Interpretation:
Results consistently favor the test group, with the clearest improvements on mobile web. While cross-platform differences are not definitive, the pattern suggests larger gains on mobile. When adjusting for reference inclusion, the mobile effect attenuates, indicating part of the benefit operates through added references.
2.0.3
Revert Rate Within 48 Hours (Lower Is Better) (Guardrail #1)
Overall:
−14.5% relative (28.2% → 24.1%)
Desktop:
−9.8% relative (24.5% → 22.1%)
Mobile web:
−23.6% relative (43.6% → 33.3%)
Interpretation:
Edits shown Reference Check were less likely to be reverted across analyses, with the strongest and most reliable reduction on mobile web. Although cross-platform differences are not statistically definitive, within-mobile contrasts and relax analyses show clear reductions. Edits that added a new reference were much less likely to be reverted, supporting the quality mechanism.
2.0.4
Edit Completion (SaveIntent → SaveSuccess) (Guardrail #2)
Overall:
−4.8% relative (88.3% → 84.1%)
Desktop:
−6.8% relative (94.0% → 87.6%)
Mobile web:
−6.3% relative (74.1% → 69.4%)
Interpretation:
Reference Check introduces measurable friction that reduces completion rates, but this trade-off coincides with higher-quality outcomes: increased reference inclusion, fewer reverts, and improved constructiveness on mobile web.
Overview
None
Overview
The Wikimedia Foundation’s
Editing team
is working on a set of improvements for the visual editor to help new volunteers understand and follow some of the policies necessary to make constructive changes to Wikipedia projects.
This work is guided by the Wikimedia Foundation Annual Plan, specifically by the Wiki Experiences 1.1 objective key result: Increase the rate at which editors with ≤100 cumulative edits publish constructive edits on mobile web by 4%, as measured by controlled experiments (by the end of Q2).
In this
A/B test
the Editing team is evaluating the impact of Reference Check. Reference Check invites users who have added more than 50 new characters to an article namespace to include a reference to the edit they’re making if they have not already done so themselves at the time they indicate their intent to save. More information about features of this tool and project updates is available on the
project page
English Wikipedia Reference Check KPI Hypothesis: The amount of constructive edits newcomers will publish will increase because a greater percentage of edits that add new content will include a reference or an explicit acknowledgement as to why these edits lack references. KPI Metrics(s) for evaluation: 1) Proportion of published edits that add new content and include a reference or an explicit acknowledgement of why a citation was not added; 2) Proportion of published edits that add new content
T333714
and are constructive (read: NOT reverted within 48 hours).
From the Edit Check
Reference-Check AB Test report
when Reference Check was shown, edits were 2.2× more likely to include a new reference and be constructive (i.e. not reverted within 48 hours) than otherwise. The English Wikipedia Reference Check A/B test will be looking to how numbers compare to this 2024 finding.
Code
# Load packages
shhh
<-
function
(expr)
suppressPackageStartupMessages
suppressWarnings
suppressMessages
(expr)))
shhh
({
library
(lubridate)
library
(ggplot2)
library
(dplyr)
library
(gt)
library
(IRdisplay)
library
(tidyr)
library
(relax)
library
(tibble)
library
(lme4)
library
(broom)
library
(broom.mixed)
library
(broom.helpers)
NOTE
: We intentionally do NOT attach brms here.
# In some environments, brms/rstan can fail to load due to binary toolchain issues.
# The notebook runs brms models on a best-effort basis via safe_brm() and will skip if unavailable.
set.seed
})
# Preferences
options
dplyr.summarise.inform =
FALSE
# Reduce default plot size (tablet-friendly)
options
repr.plot.width =
repr.plot.height =
5.5
# Colorblind-friendly palette
cbPalette
<-
"#999999"
"#E69F00"
"#56B4E9"
"#009E73"
"#F0E442"
"#0072B2"
"#D55E00"
"#CC79A7"
# Configuration (edit for reruns)
experiment_name
<-
"reference_check_ab_test_2025"
experiment_bucket_test
<-
"2025-09-editcheck-addReference-test"
experiment_bucket_control
<-
"2025-09-editcheck-addReference-control"
experiment_buckets
<-
(experiment_bucket_test, experiment_bucket_control)
wiki_list
<-
"enwiki"
# Bucket label normalization (collapse long labels to control/test)
NOTE
: bucket_map is set for 2025 RC; update for other experiments or set via env.
bucket_map
<-
"2025-09-editcheck-addReference-control"
"control"
"2025-09-editcheck-addReference-test"
"test"
normalize_buckets
<-
function
(df) {
if
is.null
(df)
||
"test_group"
%in%
names
(df))
return
(df)
df
%>%
mutate
test_group =
tg
<-
trimws
as.character
(test_group))
tg
<-
dplyr
::
case_when
tg
==
experiment_bucket_control
"control"
tg
==
experiment_bucket_test
"test"
TRUE
tg
tg
<-
recode
(tg,
!!!
bucket_map,
.default =
tg)
tg
<-
ifelse
grepl
"addreference-control"
, tg,
ignore.case =
TRUE
),
"control"
ifelse
grepl
"addreference-test"
, tg,
ignore.case =
TRUE
),
"test"
, tg))
tg
})
normalize_platforms
<-
function
(df) {
if
is.null
(df)
||
"platform"
%in%
names
(df))
return
(df)
df
%>%
mutate
platform =
pf
<-
trimws
as.character
(platform))
pf
<-
dplyr
::
case_when
tolower
(pf)
%in%
"phone"
"mobile"
"mobile-web"
TRUE
pf
pf
})
# Apply normalization after load too (in case downstream merges introduce new labels)
renorm_buckets
<-
function
(df)
normalize_platforms
normalize_buckets
(df))
# Construct analysis groups per updated methodology
# - For KPI1/KPI2/Guardrail1: test = rows where RC was shown at least once; control = eligible-but-not-shown
make_rc_ab_group_published
<-
function
(df) {
if
is.null
(df))
return
(df)
df
<-
df
%>%
apply_aliases
()
%>%
renorm_buckets
()
need
<-
"test_group"
"was_reference_check_shown"
"was_reference_check_eligible"
if
all
(need
%in%
names
(df)))
return
(df)
df
%>%
mutate
ab_group =
dplyr
::
case_when
test_group
==
"test"
was_reference_check_shown
==
"test"
test_group
==
"control"
was_reference_check_eligible
==
is.na
(was_reference_check_shown)
was_reference_check_shown
!=
"control"
TRUE
NA_character_
),
test_group =
ab_group
%>%
filter
is.na
(test_group))
%>%
add_experience_group
()
# Guardrail2 constraint: test group is shown-only; control group is not restricted by eligibility
make_rc_ab_group_completion
<-
function
(df) {
if
is.null
(df))
return
(df)
df
<-
df
%>%
apply_aliases
()
%>%
renorm_buckets
()
if
all
"test_group"
"was_reference_check_shown"
%in%
names
(df)))
return
(df)
df
%>%
filter
(test_group
==
"test"
is.na
(was_reference_check_shown)
was_reference_check_shown
!=
)))
# Experience group for Guardrail #2 reporting
add_experience_group
<-
function
(df) {
if
is.null
(df))
return
(df)
if
"user_edit_count"
%in%
names
(df))
||
"user_status"
%in%
names
(df)))
return
(df)
df
%>%
mutate
experience_level_group =
dplyr
::
case_when
user_edit_count
==
user_status
==
"registered"
"Newcomer"
user_edit_count
==
user_status
==
"unregistered"
"Unregistered"
user_edit_count
user_edit_count
<=
100
"Junior Contributor"
user_edit_count
100
"Non-Junior Contributor"
TRUE
NA_character_
),
experience_level_group =
factor
experience_level_group,
levels =
"Unregistered"
"Newcomer"
"Junior Contributor"
"Non-Junior Contributor"
# Column aliasing (plug-and-play for drifted names)
col_aliases
<-
list
was_reference_check_shown =
"reference_check_shown"
"rc_shown"
),
was_reference_check_eligible =
"reference_check_eligible"
"rc_eligible"
),
saved_edit =
"save_success"
"saved"
),
was_reverted =
"reverted_48h"
"mw_reverted"
),
was_reference_included =
"reference_added"
"has_reference_added"
"has_reference"
"reference_added_or_acknowledged"
),
has_reference_or_acknowledgement =
"has_reference_or_ack"
),
added_reference_or_acknowledgement =
"added_reference_or_ack"
apply_aliases
<-
function
(df,
aliases =
col_aliases) {
if
is.null
(df))
return
(df)
for
(nm
in
names
(aliases)) {
if
(nm
%in%
names
(df))) {
cand
<-
aliases[[nm]]
hit
<-
cand[cand
%in%
names
(df)][
if
is.na
(hit))
names
(df)[
names
(df)
==
hit]
<-
nm
df
# Timestamp candidates for coverage checks
ts_candidates
<-
"event_dt"
"rev_timestamp"
"mw_timestamp"
"first_edit_time"
"return_time"
"dt"
"timestamp"
# Default sanitize_counts (if not provided elsewhere)
if
exists
"sanitize_counts"
)) {
sanitize_counts
<-
function
(df, cols) {
keep
<-
intersect
(cols,
names
(df))
if
length
(keep)
==
return
(df)
df
%>%
mutate
across
all_of
(keep),
if
is.numeric
(.)) {
as.character
(.)
else
dplyr
::
case_when
is.na
(.)
NA_character_
50
"<50"
TRUE
scales
::
comma
(.,
accuracy =
}))
# Data directory (hard-coded relative to this notebook)
output_dir
<-
file.path
"data"
, experiment_name)
dir.create
(output_dir,
recursive =
TRUE
showWarnings =
FALSE
#message("Data directory: ", normalizePath(output_dir, mustWork = FALSE))
# Expected input files from the collection notebook
files
<-
list
reference_check_save_data =
file.path
(output_dir,
"reference_check_save_data.tsv"
),
constructive_retention_data =
file.path
(output_dir,
"constructive_retention_data.tsv"
),
edit_completion_rate_data =
file.path
(output_dir,
"edit_completion_rate_data.tsv"
),
reference_check_rejects_data =
file.path
(output_dir,
"reference_check_rejects_data.tsv"
Code
# Helper: sanitize counts < 50 per publication guidelines
sanitize_counts
<-
function
(df, cols) {
keep
<-
intersect
(cols,
names
(df))
if
length
(keep)
==
return
(df)
df
%>%
mutate
across
all_of
(keep),
ifelse
is.na
(.)
50
"<50"
as.character
(.))))
Code
# Plot helpers (paste-check style)
pc_theme
<-
function
() {
ggplot2
::
theme_minimal
base_size =
11
ggplot2
::
theme
legend.position =
"bottom"
panel.grid.minor =
ggplot2
::
element_blank
(),
# Padding so titles/labels/annotations do not touch plot edges
plot.margin =
ggplot2
::
margin
10
14
10
10
),
plot.title =
ggplot2
::
element_text
margin =
ggplot2
::
margin
b =
)),
plot.subtitle =
ggplot2
::
element_text
margin =
ggplot2
::
margin
b =
)),
axis.title.x =
ggplot2
::
element_text
margin =
ggplot2
::
margin
t =
)),
axis.title.y =
ggplot2
::
element_text
margin =
ggplot2
::
margin
r =
))
Code
# Table helpers: rates and relative change vs control by platform
make_rate_table
<-
function
(df, value_col,
group_cols =
"test_group"
"platform"
)) {
df
%>%
group_by
across
all_of
(group_cols)))
%>%
summarise
rate =
mean
(.data[[value_col]],
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
# Build a self-explanatory change table: control/test rates + absolute diff (pp) + relative change + Ns
make_rel_change_dim
<-
function
(rate_tbl,
dim_col =
"platform"
rate_col =
"rate"
) {
if
is.null
(rate_tbl)
||
nrow
(rate_tbl)
==
) {
return
tibble
!!
rlang
::
sym
(dim_col)
character
(),
control_rate =
numeric
(),
test_rate =
numeric
(),
abs_diff_pp =
numeric
(),
rel_change =
numeric
(),
n_control =
numeric
(),
n_test =
numeric
()
))
if
(dim_col
%in%
names
(rate_tbl))) {
stop
"make_rel_change_dim: dim_col not found: "
, dim_col)
if
"test_group"
%in%
names
(rate_tbl))) {
stop
"make_rel_change_dim: missing required column: test_group"
if
(rate_col
%in%
names
(rate_tbl))) {
stop
"make_rel_change_dim: rate_col not found: "
, rate_col)
rate_tbl
<-
renorm_buckets
(rate_tbl)
# Work with a stable internal key so joins are never broken by tidy-eval
rt
<-
rate_tbl
%>%
rename
dim =
all_of
(dim_col))
# Rates wide
wide_rate
<-
rt
%>%
select
(dim, test_group,
value =
all_of
(rate_col))
%>%
tidyr
::
pivot_wider
names_from =
test_group,
values_from =
value)
if
"control"
%in%
names
(wide_rate)) wide_rate
control
<-
NA_real_
if
"test"
%in%
names
(wide_rate)) wide_rate
test
<-
NA_real_
out
<-
wide_rate
%>%
transmute
dim,
control_rate =
control,
test_rate =
test,
abs_diff_pp =
(test
control)
100
rel_change =
dplyr
::
if_else
is.na
(control)
control
==
NA_real_
, (test
control)
control)
# Optional: pivot counts if present
if
"n"
%in%
names
(rt)) {
wide_n
<-
rt
%>%
select
(dim, test_group, n)
%>%
tidyr
::
pivot_wider
names_from =
test_group,
values_from =
n)
if
"control"
%in%
names
(wide_n)) wide_n
control
<-
NA_real_
if
"test"
%in%
names
(wide_n)) wide_n
test
<-
NA_real_
out
<-
out
%>%
left_join
(wide_n
%>%
transmute
(dim,
n_control =
control,
n_test =
test),
by =
"dim"
# Optional: pivot denominators if present (used in dismissal rate tables)
if
"distinct_sessions"
%in%
names
(rt)) {
wide_denom
<-
rt
%>%
select
(dim, test_group, distinct_sessions)
%>%
tidyr
::
pivot_wider
names_from =
test_group,
values_from =
distinct_sessions)
if
"control"
%in%
names
(wide_denom)) wide_denom
control
<-
NA_real_
if
"test"
%in%
names
(wide_denom)) wide_denom
test
<-
NA_real_
out
<-
out
%>%
left_join
wide_denom
%>%
transmute
(dim,
distinct_sessions_control =
control,
distinct_sessions_test =
test),
by =
"dim"
out
%>%
rename
!!
rlang
::
sym
(dim_col)
dim)
make_rel_change
<-
function
(rate_tbl,
rate_col =
"rate"
) {
make_rel_change_dim
(rate_tbl,
dim_col =
"platform"
rate_col =
rate_col)
Analysis
4.0.1
Methodology reference
KPI 1: References (included or acknowledged)
We evaluate whether a reference was included or the editor explicitly acknowledged missing references (via one of the four valid decline reasons). We compare outcomes across experiment groups and slices and report uncertainty via regression + Bayesian lift.
KPI 2: Constructive edits
We define an edit as constructive if it is not reverted within 48 hours (1 = not reverted within 48h). We estimate differences across groups and slices via regression + Bayesian lift.
Guardrail 1: Content quality
We examine 48-hour revert rates, including a breakdown stratified by whether a reference was included. We use
relax
for lift and may include
prop.test
as a lightweight audit check.
Guardrail 2: Edit completion
We define completion as the transition from save intent to a successful save (saveIntent → saveSuccess). We analyze completion rates via regression + Bayesian lift, with the test group limited to sessions where Reference Check was shown.
Modeling details
Primary approach: logistic regression (glm) + Bayesian (
relax
brms
optional) for uncertainty. Mixed-effects models are not used in this enwiki-only report.
None
Data basics overview
Buckets present and counts
Wiki coverage
Date span (min/max timestamps)
Code
# Column candidates (per collection notebook conventions)
reference_flag_candidates
<-
# Preferred: explicit reference OR acknowledgement flags (when present)
"has_reference_or_acknowledgement"
"added_reference_or_acknowledgement"
# Fallback currently available in this dataset
"was_reference_included"
# Other historical variants
"has_reference"
"reference_added"
"has_reference_added"
"has_reference_or_ack"
"added_reference_or_ack"
retention_flag_candidates
<-
"retained_7_14d"
"retained_14d"
"retained"
"returned"
# Helper to pick first available column
pick_first
<-
function
(candidates, df) {
hits
<-
intersect
(candidates,
names
(df))
if
length
(hits)
==
return
NULL
hits[[
]]
Code
load_tsv
<-
function
(name) {
path
<-
files[[name]]
if
file.exists
(path)) {
stop
"Missing file: "
, path,
" — check working directory or RC_BASE_DIR"
df
<-
read.delim
(path,
sep =
\t
stringsAsFactors =
FALSE
check.names =
FALSE
attr
(df,
"_load_info"
<-
data.frame
dataset =
name,
rows =
nrow
(df),
path =
path,
stringsAsFactors =
FALSE
df
# Load all datasets
reference_check_save_data
<-
renorm_buckets
apply_aliases
load_tsv
"reference_check_save_data"
)))
constructive_retention_data
<-
renorm_buckets
apply_aliases
load_tsv
"constructive_retention_data"
)))
edit_completion_rate_data
<-
renorm_buckets
apply_aliases
load_tsv
"edit_completion_rate_data"
)))
reference_check_rejects_data
<-
renorm_buckets
apply_aliases
load_tsv
"reference_check_rejects_data"
)))
# Load info summary (optional, tidy display)
load_info
<-
do.call
(rbind,
lapply
list
(reference_check_save_data, constructive_retention_data,
edit_completion_rate_data, reference_check_rejects_data),
function
(df)
attr
(df,
"_load_info"
)))
if
is.null
(load_info)) {
load_info
%>%
gt
()
%>%
tab_header
title =
"Loaded datasets"
%>%
cols_label
dataset =
"Dataset"
rows =
"Rows"
path =
"Path"
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
Loaded datasets
Dataset
Rows
Path
reference_check_save_data
143492
data/reference_check_ab_test_2025/reference_check_save_data.tsv
constructive_retention_data
2454
data/reference_check_ab_test_2025/constructive_retention_data.tsv
edit_completion_rate_data
158800
data/reference_check_ab_test_2025/edit_completion_rate_data.tsv
reference_check_rejects_data
994
data/reference_check_ab_test_2025/reference_check_rejects_data.tsv
Code
# Bucket and wiki coverage checks (gt style)
datasets
<-
list
save =
reference_check_save_data,
retention =
constructive_retention_data,
completion =
edit_completion_rate_data,
rejects =
reference_check_rejects_data
sanitize_counts_safe
<-
function
(df, cols) {
if
exists
"sanitize_counts"
)) {
sanitize_counts
(df, cols)
else
df
render_cov_table
<-
function
(tbl, title, labels) {
tbl
%>%
gt
()
%>%
fmt_number
columns =
where
(is.numeric),
decimals =
use_seps =
TRUE
%>%
tab_header
title =
title)
%>%
cols_label
!!!
labels)
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
render_gt_html
<-
function
(gt_tbl) {
# Force HTML rendering in Jupyter
IRdisplay
::
display_html
(gt
::
as_raw_html
(gt_tbl))
render_or_note
<-
function
(tbl, title, labels, empty_msg) {
if
is.null
(tbl)
||
nrow
(tbl)
==
) {
cat
(empty_msg,
\n
return
invisible
NULL
))
tbl
%>%
render_cov_table
(title, labels)
%>%
render_gt_html
()
bucket_tbl
<-
bind_rows
lapply
names
(datasets),
function
(nm) {
df
<-
datasets[[nm]]
if
is.null
(df)
||
"test_group"
%in%
names
(df))
return
NULL
df
%>%
count
(test_group,
name =
"n"
sort =
TRUE
%>%
mutate
dataset =
nm)
%>%
relocate
(dataset)
}),
.id =
NULL
wiki_tbl
<-
bind_rows
lapply
names
(datasets),
function
(nm) {
df
<-
datasets[[nm]]
if
is.null
(df)
||
"wiki"
%in%
names
(df))
return
NULL
df
%>%
count
(wiki,
name =
"n"
sort =
TRUE
%>%
mutate
dataset =
nm)
%>%
relocate
(dataset)
}),
.id =
NULL
date_tbl
<-
bind_rows
lapply
names
(datasets),
function
(nm) {
df
<-
datasets[[nm]]
if
is.null
(df))
return
NULL
ts_col
<-
intersect
"event_dt"
"rev_timestamp"
"mw_timestamp"
"first_edit_time"
"return_time"
),
names
(df))
if
length
(ts_col)
==
return
NULL
col
<-
ts_col[[
]]
df
%>%
summarise
min =
min
(.data[[col]],
na.rm =
TRUE
),
max =
max
(.data[[col]],
na.rm =
TRUE
))
%>%
mutate
dataset =
nm,
column =
col)
%>%
relocate
(dataset, column)
}),
.id =
NULL
if
is.null
(bucket_tbl)
&&
nrow
(bucket_tbl)
) {
bucket_tbl
<-
bucket_tbl
%>%
arrange
(dataset,
desc
(n))
%>%
sanitize_counts_safe
"n"
render_or_note
(bucket_tbl,
"Counts by group"
dataset =
"Dataset"
test_group =
"Experiment group"
n =
"Count (rows)"
),
"(No bucket info loaded)"
else
cat
"(No bucket info loaded)
\n
if
is.null
(wiki_tbl)
&&
nrow
(wiki_tbl)
) {
wiki_tbl
<-
wiki_tbl
%>%
arrange
(dataset,
desc
(n))
%>%
sanitize_counts_safe
"n"
render_or_note
(wiki_tbl,
"Counts by dataset"
dataset =
"Dataset"
wiki =
"Wiki"
n =
"Count (rows)"
),
"(No wiki info loaded)"
else
cat
"(No wiki info loaded)
\n
if
is.null
(date_tbl)
&&
nrow
(date_tbl)
) {
render_or_note
(date_tbl,
"Date span by dataset"
dataset =
"Dataset"
column =
"Timestamp column"
min =
"Min"
max =
"Max"
),
"(No date span info loaded)"
else
cat
"(No date span info loaded)
\n
Counts by group
Dataset
Experiment group
Count (rows)
completion
test
79995
completion
control
78805
rejects
test
994
retention
control
1273
retention
test
1181
test
72267
control
71225
Counts by dataset
Dataset
Wiki
Count (rows)
completion
enwiki
158800
rejects
enwiki
994
retention
enwiki
2454
enwiki
143492
Date span by dataset
Dataset
Timestamp column
Min
Max
retention
first_edit_time
2025-11-08T00:17:42.403Z
2025-12-08T23:59:21.629Z
4.1
Setup
Code
# Apply <50 sanitization to tables when present
sanitize_obj
<-
function
(obj_name, cols) {
if
exists
(obj_name,
inherits =
FALSE
)) {
df
<-
get
(obj_name,
inherits =
FALSE
assign
(obj_name,
sanitize_counts
(df, cols),
inherits =
FALSE
sanitize_obj
"kpi1_rates"
"n"
))
sanitize_obj
"kpi2_rates"
"n"
))
sanitize_obj
"completion_rates"
"n"
))
sanitize_obj
"dismiss_rates"
"n"
))
sanitize_obj
"ret_rates"
"n"
))
sanitize_obj
"kpi2_slices"
"n"
))
sanitize_obj
"completion_slices"
"n"
))
sanitize_obj
"completion_by_checks"
"n"
))
sanitize_obj
"per_wiki_constructive"
"n"
))
sanitize_obj
"per_wiki_completion"
"n"
))
sanitize_obj
"plat_join"
"total_sessions"
"dismiss_sessions"
))
sanitize_obj
"us_join"
"total_sessions"
"dismiss_sessions"
))
sanitize_obj
"df_kpi1_checks"
"n"
))
sanitize_obj
"df_kpi2_checks"
"n"
))
Code
# KPI/guardrail quick summaries
sanitize_counts_safe
<-
function
(df, cols) {
if
exists
"sanitize_counts"
))
sanitize_counts
(df, cols)
else
df
require_cols
<-
function
(df, cols, label) {
missing
<-
setdiff
(cols,
names
(df))
if
length
(missing)
) {
message
(label,
": missing columns: "
paste
(missing,
collapse =
", "
))
return
FALSE
TRUE
render_pct_table
<-
function
(df, title, labels,
sanitize_cols =
"n"
note_text =
NULL
) {
if
nrow
(df)
==
return
invisible
NULL
))
df
<-
sanitize_counts_safe
(df, sanitize_cols)
if
"pct"
%in%
names
(df)) {
df
<-
df
%>%
mutate
pct =
scales
::
percent
(pct,
accuracy =
0.1
))
gt_tbl
<-
df
%>%
gt
()
%>%
tab_header
title =
title)
%>%
cols_label
!!!
labels)
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
gt_tbl
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_text)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_text))
Code
# KPI tables (platform control vs treatment)
render_rate_rel
<-
function
rate_tbl,
rel_tbl,
title_rate,
title_rel,
rate_labels,
note_rate =
NULL
note_rel =
"Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control."
group_col =
NULL
within_group_order_col =
NULL
within_group_order =
NULL
) {
if
is.null
(rate_tbl)
||
nrow
(rate_tbl)
==
return
invisible
NULL
))
rate_tbl
<-
renorm_buckets
(rate_tbl)
rel_tbl
<-
renorm_buckets
(rel_tbl)
# Rate table formatting
rate_tbl
<-
sanitize_counts_safe
(rate_tbl,
intersect
names
(rate_tbl),
"n"
"n_control"
"n_test"
"distinct_sessions"
"distinct_sessions_control"
"distinct_sessions_test"
)))
if
"rate"
%in%
names
(rate_tbl)) {
rate_tbl
<-
rate_tbl
%>%
mutate
rate =
scales
::
percent
(rate,
accuracy =
0.1
))
# Optional: group rows (e.g., by platform) and order control/test within each group
rate_labels2
<-
rate_labels
gt_rate_builder
<-
rate_tbl
if
is.null
(group_col)
&&
group_col
%in%
names
(gt_rate_builder)) {
if
is.null
(within_group_order_col)
&&
within_group_order_col
%in%
names
(gt_rate_builder)
&&
is.null
(within_group_order)) {
gt_rate_builder[[within_group_order_col]]
<-
factor
as.character
(gt_rate_builder[[within_group_order_col]]),
levels =
within_group_order
if
is.null
(within_group_order_col)
&&
within_group_order_col
%in%
names
(gt_rate_builder)) {
gt_rate_builder
<-
gt_rate_builder
%>%
arrange
(.data[[group_col]], .data[[within_group_order_col]])
else
gt_rate_builder
<-
gt_rate_builder
%>%
arrange
(.data[[group_col]])
# If we group by a column, we avoid labeling that grouped column (it is not displayed as a standard column)
rate_labels2
<-
rate_labels2[
names
(rate_labels2)
!=
group_col]
gt_rate
<-
gt_rate_builder
%>%
gt
groupname_col =
group_col)
%>%
tab_header
title =
title_rate)
%>%
cols_label
!!!
rate_labels2)
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
else
gt_rate
<-
gt_rate_builder
%>%
gt
()
%>%
tab_header
title =
title_rate)
%>%
cols_label
!!!
rate_labels2)
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
gt_rate
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_rate)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_rate))
# Change table formatting
rel_tbl
<-
rel_tbl
%>%
mutate
control_rate =
if
"control_rate"
%in%
names
(.)) scales
::
percent
(control_rate,
accuracy =
0.1
else
control_rate,
test_rate =
if
"test_rate"
%in%
names
(.)) scales
::
percent
(test_rate,
accuracy =
0.1
else
test_rate,
abs_diff_pp =
if
"abs_diff_pp"
%in%
names
(.)) scales
::
number
(abs_diff_pp,
accuracy =
0.1
else
abs_diff_pp,
rel_change =
if
"rel_change"
%in%
names
(.)) scales
::
percent
(rel_change,
accuracy =
0.1
else
rel_change
rel_tbl
<-
sanitize_counts_safe
(rel_tbl,
intersect
names
(rel_tbl),
"n_control"
"n_test"
"distinct_sessions_control"
"distinct_sessions_test"
)))
# Build labels dynamically (only include cols that exist)
rel_labels
<-
platform =
"Platform"
control_rate =
"Control rate"
test_rate =
"Test rate"
abs_diff_pp =
"Absolute difference (pp)"
rel_change =
"Relative change vs control"
n_control =
"N (control)"
n_test =
"N (test)"
distinct_sessions_control =
"Distinct sessions (control)"
distinct_sessions_test =
"Distinct sessions (test)"
rel_labels
<-
rel_labels[
names
(rel_labels)
%in%
names
(rel_tbl)]
gt_rel
<-
rel_tbl
%>%
gt
()
%>%
tab_header
title =
title_rel)
%>%
cols_label
!!!
rel_labels)
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
gt_rel
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_rel)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_rel))
Code
# Model tidiers
# Wrap long strings at word boundaries (used to avoid ultra-wide gt titles)
wrap_50
<-
function
(x,
width =
50
) {
if
is.null
(x)
||
is.na
(x)
||
nzchar
(x))
return
(x)
paste
strwrap
(x,
width =
width),
collapse =
\n
# Stats-table styling (grey) to visually distinguish inferential outputs
style_stats_gt
<-
function
(gt_tbl) {
gt_tbl
%>%
tab_options
heading.background.color =
"#F2F2F2"
column_labels.background.color =
"#D9D9D9"
row_group.background.color =
"#F7F7F7"
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
render_binom_model
<-
function
(model, title,
note_text =
NULL
) {
tidy_res
<-
if
inherits
(model,
"glmerMod"
)) {
broom.mixed
::
tidy
(model,
effects =
"fixed"
conf.int =
TRUE
conf.level =
0.95
exponentiate =
TRUE
else
broom
::
tidy
(model,
conf.int =
TRUE
conf.level =
0.95
exponentiate =
TRUE
tidy_res
<-
tidy_res
%>%
mutate
term =
dplyr
::
recode
(term,
(Intercept)
"Intercept"
))
%>%
select
(term, estimate, conf.low, conf.high, std.error, p.value)
tidy_res
<-
tidy_res
%>%
mutate
across
(estimate, conf.low, conf.high, std.error),
scales
::
number
(.x,
accuracy =
0.001
)),
p.value =
scales
::
pvalue
(p.value,
accuracy =
0.001
))
gt_tbl
<-
tidy_res
%>%
gt
()
%>%
tab_header
title =
wrap_50
(title))
%>%
cols_label
term =
"Term"
estimate =
"OR"
conf.low =
"CI low"
conf.high =
"CI high"
std.error =
"SE"
p.value =
"p-value"
%>%
opt_stylize
%>%
style_stats_gt
()
gt_tbl
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_text)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_text))
# GLM contrasts (e.g., within-platform test vs control)
# Computes an OR/CI/p for the log-odds difference implied by two single-row newdata frames.
tidy_glm_contrast_or
<-
function
(model, newdata_control, newdata_test,
label =
"Treatment vs control"
) {
if
is.null
(model)
||
inherits
(model,
"glm"
))
stop
"tidy_glm_contrast_or: model must be a glm"
if
is.null
(newdata_control)
||
is.null
(newdata_test))
stop
"tidy_glm_contrast_or: newdata must be provided"
# Model matrix rows (use predictors-only terms; newdata won't include the response column)
tt
<-
stats
::
delete.response
(stats
::
(model))
Xc
<-
stats
::
model.matrix
(tt, newdata_control)
Xt
<-
stats
::
model.matrix
(tt, newdata_test)
if
nrow
(Xc)
!=
||
nrow
(Xt)
!=
) {
stop
"tidy_glm_contrast_or: newdata_control/test must each be 1 row"
<-
drop
(Xt
Xc)
beta
<-
stats
::
coef
(model)
<-
stats
::
vcov
(model)
est
<-
sum
(L
beta)
se
<-
sqrt
drop
(L)
%*%
%*%
L))
if
is.na
(se)
||
se
<=
) {
return
(tibble
::
tibble
term =
label,
estimate =
NA_real_
conf.low =
NA_real_
conf.high =
NA_real_
std.error =
NA_real_
p.value =
NA_real_
))
<-
est
se
<-
stats
::
pnorm
abs
(z),
lower.tail =
FALSE
or
<-
exp
(est)
ci
<-
exp
(est
stats
::
qnorm
0.975
se)
tibble
::
tibble
term =
label,
estimate =
or,
conf.low =
ci[[
]],
conf.high =
ci[[
]],
std.error =
se,
p.value =
render_or_contrast_table
<-
function
(df, title,
note_text =
NULL
) {
if
is.null
(df)
||
nrow
(df)
==
return
invisible
NULL
))
out
<-
df
%>%
mutate
across
(estimate, conf.low, conf.high, std.error),
scales
::
number
(.x,
accuracy =
0.001
)),
p.value =
scales
::
pvalue
(p.value,
accuracy =
0.001
))
gt_tbl
<-
out
%>%
gt
()
%>%
tab_header
title =
wrap_50
(title))
%>%
cols_label
term =
"Contrast"
estimate =
"OR"
conf.low =
"CI low"
conf.high =
"CI high"
std.error =
"SE"
p.value =
"p-value"
%>%
opt_stylize
%>%
style_stats_gt
()
gt_tbl
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_text)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_text))
invisible
(df)
# brms helpers
pick_brms_backend
<-
function
() {
# Prefer cmdstanr when available (avoids many rstan toolchain issues)
if
requireNamespace
"cmdstanr"
quietly =
TRUE
)) {
ok
<-
tryCatch
({
<-
cmdstanr
::
cmdstan_version
()
is.null
(v)
},
error =
function
(e)
FALSE
if
isTRUE
(ok))
return
"cmdstanr"
"rstan"
safe_brm
<-
function
(formula, data, prior,
seed =
chains =
cores =
refresh =
) {
# Preflight: if brms or rstan can't be loaded, do NOT attempt compilation/sampling.
can_load_ns
<-
function
(pkg) {
requireNamespace
(pkg,
quietly =
TRUE
&&
isTRUE
tryCatch
({
loadNamespace
(pkg)
TRUE
},
error =
function
(e)
FALSE
))
if
can_load_ns
"brms"
)) {
message
"brms fit skipped: brms namespace could not be loaded in this kernel/environment"
return
NULL
backend
<-
pick_brms_backend
()
# Only require the backend we actually intend to use.
if
(backend
==
"rstan"
) {
if
can_load_ns
"rstan"
)) {
message
"brms fit skipped (backend=rstan): rstan could not be loaded (common cause: binary/toolchain mismatch like TBB linkage).
\n
"Continuing with glm + relax outputs.
\n
"Recommended fix (outside this notebook): use a clean env where rstan loads, or rebuild/reinstall rstan+StanHeaders+tbb from a consistent toolchain."
return
NULL
else
if
(backend
==
"cmdstanr"
) {
if
can_load_ns
"cmdstanr"
)) {
message
"brms fit skipped (backend=cmdstanr): cmdstanr namespace could not be loaded"
return
NULL
ok
<-
tryCatch
({
<-
cmdstanr
::
cmdstan_version
()
is.null
(v)
},
error =
function
(e)
FALSE
if
isTRUE
(ok)) {
message
"brms fit skipped (backend=cmdstanr): CmdStan is not installed/configured in this environment.
\n
"Continuing with glm + relax outputs.
\n
"To run brms here, we install CmdStan via cmdstanr::install_cmdstan() (plus a working C++ toolchain)."
return
NULL
tryCatch
({
brms
::
brm
formula =
formula,
family =
brms
::
bernoulli
link =
"logit"
),
data =
data,
prior =
prior,
seed =
seed,
chains =
chains,
cores =
cores,
refresh =
refresh,
backend =
backend
},
error =
function
(e) {
message
"brms fit skipped (backend="
, backend,
"): "
, e
message,
\n
Continuing with glm + relax outputs.
\n
"To run brms reliably, prefer cmdstanr with CmdStan installed and a stable R toolchain."
NULL
})
# brms confirmation table: OR + (optional) lift in probability space
# - OR summary uses exp(b_test)
# - Lift summary uses posterior_epred() to compute Pr(test) - Pr(control) on the same covariates
render_brms_confirm_table
<-
function
fit,
title,
coef_name =
"b_test_grouptest"
newdata_control =
NULL
newdata_test =
NULL
note_text =
NULL
) {
draws
<-
posterior
::
as_draws_df
(fit)
if
(coef_name
%in%
names
(draws))) {
message
"brms: coefficient not found in draws: "
, coef_name)
return
invisible
NULL
))
# OR (odds ratio)
or_draw
<-
exp
(draws[[coef_name]])
or_tbl
<-
tibble
::
tibble
quantity =
"Effect"
metric =
"Multiplicative effect on odds (OR)"
point =
stats
::
median
(or_draw,
na.rm =
TRUE
),
lower =
stats
::
quantile
(or_draw,
0.025
na.rm =
TRUE
),
upper =
stats
::
quantile
(or_draw,
0.975
na.rm =
TRUE
),
chance_positive =
mean
(or_draw
na.rm =
TRUE
out
<-
or_tbl
# Average lift in probability space (multi-check style): E[p(test) - p(control)]
if
is.null
(newdata_control)
&&
is.null
(newdata_test)) {
ep_ctrl
<-
brms
::
posterior_epred
(fit,
newdata =
newdata_control,
re_formula =
NA
ep_test
<-
brms
::
posterior_epred
(fit,
newdata =
newdata_test,
re_formula =
NA
# each is draws x N; compute per-draw average lift
lift_draw
<-
rowMeans
(ep_test
ep_ctrl,
na.rm =
TRUE
lift_tbl
<-
tibble
::
tibble
quantity =
"Function of parameter(s)"
metric =
"Average lift (Pr[test] - Pr[control])"
point =
stats
::
median
(lift_draw,
na.rm =
TRUE
),
lower =
stats
::
quantile
(lift_draw,
0.025
na.rm =
TRUE
),
upper =
stats
::
quantile
(lift_draw,
0.975
na.rm =
TRUE
),
chance_positive =
mean
(lift_draw
na.rm =
TRUE
out
<-
dplyr
::
bind_rows
(or_tbl, lift_tbl)
gt_tbl
<-
out
%>%
mutate
quantity =
factor
(quantity,
levels =
"Effect"
"Function of parameter(s)"
))
%>%
gt
groupname_col =
"quantity"
%>%
tab_header
title =
gt
::
md
paste0
"**"
wrap_50
(title),
"**"
)),
subtitle =
gt
::
md
"Hierarchical Bayesian logistic regression (brms)"
%>%
cols_label
metric =
"Quantity"
point =
"Point estimate"
lower =
"95% CrI low"
upper =
"95% CrI high"
chance_positive =
"Chance to win"
%>%
fmt_number
columns =
(point, lower, upper),
decimals =
%>%
fmt_number
columns =
chance_positive,
decimals =
%>%
# Format the lift row as percent (probability points)
fmt_percent
columns =
(point, lower, upper),
rows =
metric
==
"Average lift (Pr[test] - Pr[control])"
decimals =
%>%
opt_stylize
%>%
style_stats_gt
()
gt_tbl
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_text)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_text))
invisible
(out)
# Back-compat: OR-only table (kept for older cells)
render_brms_or_table
<-
function
(fit, title,
coef_name =
"b_test_grouptest"
note_text =
NULL
) {
render_brms_confirm_table
fit =
fit,
title =
title,
coef_name =
coef_name,
newdata_control =
NULL
newdata_test =
NULL
note_text =
note_text
Code
# KPI slices and buckets
render_slice
<-
function
(df, title, labels,
rate_col =
"rate"
note_text =
NULL
) {
if
is.null
(df)
||
nrow
(df)
==
return
invisible
NULL
))
df
<-
renorm_buckets
(df)
num_cols
<-
intersect
"n"
),
names
(df))
df
<-
sanitize_counts_safe
(df, num_cols)
if
(rate_col
%in%
names
(df)) {
df[[rate_col]]
<-
scales
::
percent
(df[[rate_col]],
accuracy =
0.1
gt_tbl
<-
df
%>%
gt
()
%>%
tab_header
title =
title)
%>%
cols_label
!!!
labels)
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
gt_tbl
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_text)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_text))
Code
render_prop_test
<-
function
(ctrl_success, ctrl_total, tst_success, tst_total, title) {
pt
<-
prop.test
(ctrl_success, tst_success),
(ctrl_total, tst_total))
df
<-
tibble
group =
"control"
"test"
),
success =
(ctrl_success, tst_success),
total =
(ctrl_total, tst_total),
rate =
(ctrl_success
ctrl_total, tst_success
tst_total)
df
<-
sanitize_counts_safe
(df,
"success"
"total"
))
df
rate
<-
scales
::
percent
(df
rate,
accuracy =
0.1
meta
<-
tibble
metric =
"p_value"
"statistic"
),
value =
format
(pt
p.value,
digits =
),
format
(pt
statistic,
digits =
)))
df
%>%
gt
()
%>%
tab_header
title =
title)
%>%
cols_label
group =
"Group"
success =
"Success"
total =
"Total"
rate =
"Rate"
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
meta
%>%
gt
()
%>%
tab_header
title =
paste0
(title,
" (prop.test)"
))
%>%
cols_label
metric =
"Metric"
value =
"Value"
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
Code
strip_ansi
<-
function
(x) {
# Remove ANSI escape sequences (e.g., tibble color codes)
gsub
"\u001b
\\
[[0-9;]*m"
""
, x,
perl =
TRUE
render_relax
<-
function
(df, title,
metric_type =
"proportion"
better =
"higher"
"lower"
)) {
better
<-
match.arg
(better)
tryCatch
({
res
<-
relax
::
analyze_relative_lift
(df,
metric_type =
metric_type)
# Prefer a compact Paste-Check-style gt table if result is tabular
if
is.data.frame
(res)) {
# Standardize/augment columns to match our edit-check reporting conventions
if
"chance_to_win"
%in%
names
(res)) {
# Interpreting 'Chance to Win' as P(Treatment > Control) for the metric.
# So, for metrics where higher is better:
# P(Treatment better) = chance_to_win
# For metrics where lower is better (e.g., revert rate):
# P(Treatment better) = 1 - chance_to_win
res
<-
res
%>%
mutate
prob_treatment_better =
dplyr
::
if_else
better
==
"higher"
chance_to_win,
chance_to_win
bayes_cols
<-
intersect
"estimate_bayes"
"chance_to_win"
"prob_treatment_better"
"cred_lower"
"cred_upper"
),
names
(res))
freq_cols
<-
intersect
"estimate_freq"
"p_value"
"conf_lower"
"conf_upper"
),
names
(res))
gt_tbl
<-
res
%>%
gt
()
%>%
tab_header
title =
gt
::
md
paste0
"**"
, title,
"**"
)),
subtitle =
gt
::
md
"Relative lift ((Treatment − Control) / Control)"
if
length
(bayes_cols)
) {
gt_tbl
<-
gt_tbl
%>%
tab_spanner
label =
gt
::
md
"**Bayesian Analysis**"
),
columns =
all_of
(bayes_cols))
if
length
(freq_cols)
) {
gt_tbl
<-
gt_tbl
%>%
tab_spanner
label =
gt
::
md
"**Frequentist Analysis**"
),
columns =
all_of
(freq_cols))
gt_tbl
<-
gt_tbl
%>%
cols_label
estimate_bayes =
gt
::
md
"Point Estimate"
),
chance_to_win =
gt
::
md
"Chance to Win"
),
prob_treatment_better =
gt
::
md
"P(Treatment better)"
),
cred_lower =
gt
::
md
"95% CrI Lower"
),
cred_upper =
gt
::
md
"95% CrI Upper"
),
estimate_freq =
gt
::
md
"Point Estimate"
),
p_value =
gt
::
md
"*p*-value"
),
conf_lower =
gt
::
md
"95% CI Lower"
),
conf_upper =
gt
::
md
"95% CI Upper"
%>%
# Convention: 3 decimals everywhere
fmt_number
columns =
everything
(),
decimals =
%>%
tab_options
heading.background.color =
"#F2F2F2"
column_labels.background.color =
"#D9D9D9"
table.border.top.color =
"lightgray"
column_labels.border.bottom.color =
"black"
column_labels.border.bottom.width =
gt
::
px
),
data_row.padding =
gt
::
px
gt_tbl
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
# Add a short interpretation line below
if
"prob_treatment_better"
%in%
names
(res)) {
<-
res
prob_treatment_better[[
]]
if
is.null
(p)
&&
is.na
(p)) {
how
<-
if
(better
==
"higher"
"Chance to Win"
else
"1 - Chance to Win"
IRdisplay
::
display_markdown
paste0
"**Interpretation:** Based on `relax`, the posterior probability that treatment is better than control is "
scales
::
percent
(p,
accuracy =
0.1
),
" (computed as "
, how,
")."
return
invisible
(res))
# Otherwise, print as clean text (no ANSI color)
out
<-
capture.output
(res))
out
<-
strip_ansi
(out)
IRdisplay
::
display_html
paste0
"

"
paste
(out,
collapse =
\n
),
"

"
))
invisible
(res)
},
error =
function
(e) {
message
(title,
" relax error: "
, e
message)
})
4.1.1
Data proportions review
4.1.2
Exposure / delivery — how to read these tables
These exposure tables use
two different denominators
All published edits (all rows in
reference_check_save_data
: not limited to new-content edits.
Published new-content edits only
: rows where
is_new_content == 1
Notes:
editing_session
(editing_session_id) corresponds to a
single edit
in this dataset. It is the id that links together that edit’s steps/events.
“Reference Check exposure” is measured using
VEFU
(VisualEditorFeatureUse) via
was_reference_check_shown == 1
(action =
check-shown-presave
).
Code
# (Deprecated) Exposure / delivery table
# This baseline version is kept only for historical reference.
# The next cell renders the updated and clarified "Reference Check exposure / delivery" table.
Code
# 1) Exposure / delivery (shown vs eligible)
if
is.null
(reference_check_save_data)
&&
require_cols
(reference_check_save_data,
"test_group"
"was_reference_check_shown"
"was_reference_check_eligible"
),
"Exposure summary"
)) {
exposure_summary
<-
reference_check_save_data
%>%
mutate
exposure_bucket =
dplyr
::
case_when
was_reference_check_eligible
==
was_reference_check_shown
==
"shown"
was_reference_check_eligible
==
"eligible_not_shown"
TRUE
"ineligible"
))
%>%
count
(test_group, exposure_bucket)
%>%
group_by
(test_group)
%>%
mutate
pct =
sum
(n))
%>%
arrange
(test_group,
desc
(n))
%>%
ungroup
()
%>%
# Drop eligible_not_shown for test only (tiny number)
filter
(test_group
==
"test"
exposure_bucket
==
"eligible_not_shown"
))
exposure_summary
<-
renorm_buckets
(exposure_summary)
%>%
group_by
(test_group)
render_pct_table
exposure_summary,
"Reference Check exposure / delivery (all published edits)"
test_group =
"Test group"
exposure_bucket =
"Bucket"
n =
"Count (edits)"
pct =
"Percent of published edits"
),
note_text =
paste
"Denominator = ALL published edits (rows) in `reference_check_save_data` within each test group (not limited to new-content edits)."
"Exposure/shown is measured via VEFU only (`was_reference_check_shown == 1`, action=check-shown-presave)."
"Bucket is eligibility-aware: shown = eligible & shown; eligible_not_shown = eligible & not shown; ineligible = not tagged eligible."
"(For display, eligible_not_shown is dropped for the test group.)"
# Do not mutate `reference_check_save_data` here; downstream analysis helpers define shown/eligible groupings as needed.
else
message
"Exposure summary: required columns missing in reference_check_save_data"
Reference Check exposure / delivery (all published edits)
Bucket
Count (edits)
Percent of published edits
control
ineligible
69560
97.7%
eligible_not_shown
1665
2.3%
test
ineligible
71435
98.8%
shown
828
1.1%
Table note:
Denominator = ALL published edits (rows) in
reference_check_save_data
within each test group (not limited to new-content edits). Exposure/shown is measured via VEFU only (
was_reference_check_shown == 1
, action=check-shown-presave). Bucket is eligibility-aware: shown = eligible & shown; eligible_not_shown = eligible & not shown; ineligible = not tagged eligible. (For display, eligible_not_shown is dropped for the test group.)
“Of all edits/rows in this test group, what share were eligible & shown vs eligible & not shown vs ineligible?”
Note: Reference check exposure is measured using VisualEditorFeatureUse (VEFU) events (check-shown-presave). A revision tag (
editcheck-references-shown
) exists as a secondary signal; we keep it only as an audit check and do
not
use it to define exposure/groups in this report.
For simplicity and consistency with existing analyses, we rely on VEFU only and exclude the tiny number of tag-only cases.
“Among published new-content edits in the test bucket, what fraction had Reference Check shown at least once?”
Code
# Reference Check exposure among published new-content edits (test bucket)
# Denominator: published new-content edits only (`is_new_content == 1`).
if
require_cols
(reference_check_save_data,
"test_group"
"is_new_content"
"was_reference_check_shown"
),
"Reference Check exposure (new-content edits)"
)) {
exposure_nc
<-
reference_check_save_data
%>%
renorm_buckets
()
%>%
filter
(test_group
==
"test"
, is_new_content
==
%>%
mutate
exposure_bucket =
dplyr
::
case_when
was_reference_check_shown
==
"shown"
TRUE
"not_shown"
))
%>%
count
(exposure_bucket)
%>%
mutate
pct =
sum
(n))
render_pct_table
exposure_nc,
"Reference Check exposure among published new-content edits (test bucket)"
exposure_bucket =
"Bucket"
n =
"Count (edits)"
pct =
"Percent of published new-content edits"
),
note_text =
paste
"Denominator = published new-content edits (rows) in the test bucket (`test_group == 'test'` and `is_new_content == 1`)."
"Exposure/shown is measured via VEFU only (`was_reference_check_shown == 1`, action=check-shown-presave)."
else
message
"Reference Check exposure (new-content edits): required columns missing in reference_check_save_data"
Reference Check exposure among published new-content edits (test bucket)
Bucket
Count (edits)
Percent of published new-content edits
not_shown
2558
61.9%
shown
1574
38.1%
Table note:
Denominator = published new-content edits (rows) in the test bucket (
test_group == 'test'
and
is_new_content == 1
). Exposure/shown is measured via VEFU only (
was_reference_check_shown == 1
, action=check-shown-presave).
Code
# Reference Check exposure among published new-content edits (test bucket; by editing_session_id)
# Note: `editing_session_id` corresponds to a single edit in this dataset; it links together that edit’s steps/events.
if
require_cols
(reference_check_save_data,
"test_group"
"is_new_content"
"editing_session"
"was_reference_check_shown"
),
"Reference Check exposure (editing_session_id)"
)) {
df_test_nc
<-
reference_check_save_data
%>%
renorm_buckets
()
%>%
filter
(test_group
==
"test"
, is_new_content
==
exposure_sessions
<-
df_test_nc
%>%
group_by
(editing_session)
%>%
summarise
shown_any =
as.integer
any
(was_reference_check_shown
==
na.rm =
TRUE
)),
.groups =
"drop"
%>%
mutate
exposure_bucket =
dplyr
::
if_else
(shown_any
==
L,
"shown"
"not_shown"
))
%>%
count
(exposure_bucket)
%>%
mutate
pct =
sum
(n))
render_pct_table
exposure_sessions,
"Reference Check exposure among published new-content edits (test bucket; by editing_session_id)"
exposure_bucket =
"Bucket"
n =
"Count (edits; distinct editing_session_id)"
pct =
"Percent of published new-content edits"
),
note_text =
paste
"Denominator = distinct editing_session_id (edits) among published new-content edits in the test bucket"
"(`test_group == 'test'` and `is_new_content == 1`)."
"An edit is counted as exposed/shown if any event for that edit has `was_reference_check_shown == 1`"
"(measured via VEFU; action=check-shown-presave)."
Reference Check exposure among published new-content edits (test bucket; by editing_session_id)
Bucket
Count (edits; distinct editing_session_id)
Percent of published new-content edits
not_shown
2558
61.9%
shown
1574
38.1%
Table note:
Denominator = distinct editing_session_id (edits) among published new-content edits in the test bucket (
test_group == 'test'
and
is_new_content == 1
). An edit is counted as exposed/shown if any event for that edit has
was_reference_check_shown == 1
(measured via VEFU; action=check-shown-presave).
“Among published new-content sessions in the test bucket, what fraction had the check shown at least once?”
4.1.3
Diagnostic Review
Code
# KPI & guardrail computations
render_fallback
<-
function
(df, title, labels,
pct_col =
NULL
note_text =
NULL
) {
if
is.null
(df)
||
nrow
(df)
==
return
invisible
NULL
))
df
<-
renorm_buckets
(df)
df
<-
sanitize_counts_safe
(df,
intersect
names
(df),
"n"
"success"
"total"
)))
if
is.null
(pct_col)
&&
pct_col
%in%
names
(df)) {
df[[pct_col]]
<-
scales
::
percent
(df[[pct_col]],
accuracy =
0.1
gt_tbl
<-
df
%>%
gt
()
%>%
tab_header
title =
title)
%>%
cols_label
!!!
labels)
%>%
opt_stylize
%>%
tab_options
table.font.size =
gt
::
px
13
),
data_row.padding =
gt
::
px
))
gt_tbl
%>%
gt
::
as_raw_html
()
%>%
IRdisplay
::
display_html
()
if
is.null
(note_text)) {
IRdisplay
::
display_markdown
paste0
"**Table note:** "
, note_text))
# Helper to pick first available column
pick_first
<-
function
(candidates, df) {
hits
<-
intersect
(candidates,
names
(df))
if
length
(hits)
==
return
NULL
hits[[
]]
Code
# KPI #1: new content with reference included OR acknowledgement (valid decline reason)
if
is.null
(reference_check_save_data)) {
df_base
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
ref_flag
<-
pick_first
(reference_flag_candidates, df_base)
if
is.null
(ref_flag)
&&
all
"test_group"
%in%
names
(df_base))) {
valid_reasons
<-
"edit-check-feedback-reason-common-knowledge"
"edit-check-feedback-reason-irrelevant"
"edit-check-feedback-reason-uncertain"
"edit-check-feedback-reason-other"
if
is.null
(reference_check_rejects_data)
&&
all
"editing_session"
"reject_reason"
%in%
names
(reference_check_rejects_data))
&&
"editing_session"
%in%
names
(df_base))) {
ack_sessions
<-
reference_check_rejects_data
%>%
renorm_buckets
()
%>%
filter
(reject_reason
%in%
valid_reasons)
%>%
distinct
(editing_session)
%>%
mutate
ack_reason_selected =
L)
df_base
<-
df_base
%>%
left_join
(ack_sessions,
by =
"editing_session"
kpi1
<-
df_base
%>%
mutate
ref_included =
dplyr
::
if_else
is.na
(.data[[ref_flag]]),
L,
as.integer
(.data[[ref_flag]]
==
)),
ack_reason_selected =
dplyr
::
if_else
is.na
(ack_reason_selected),
L,
as.integer
(ack_reason_selected
==
)),
has_ref_ack =
dplyr
::
if_else
test_group
==
"test"
as.integer
((ref_included
==
(ack_reason_selected
==
)),
ref_included
%>%
count
(test_group, has_ref_ack)
%>%
group_by
(test_group)
%>%
mutate
pct =
sum
(n))
render_fallback
kpi1,
"KPI #1 (Reference Added or acknowledged why a citation was not added)"
test_group =
"Test group"
has_ref_ack =
"Outcome (0/1)"
n =
"Count (edits)"
pct =
"Percent of new-content edits"
),
pct_col =
"pct"
note_text =
"Percent of new-content edits = share of rows within each analysis group. Outcome=1 if reference included (control) or reference included / valid decline acknowledgement (test). Denominator = new-content edits in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control)."
else
message
"KPI #1: missing reference flag or required columns; update ref_flag candidates if needed"
else
message
"KPI #1: reference_check_save_data missing"
KPI #1 (Reference Added or acknowledged why a citation was not added)
Outcome (0/1)
Count (edits)
Percent of new-content edits
control
1297
77.9%
368
22.1%
test
67
4.3%
1507
95.7%
Table note:
Percent of new-content edits = share of rows within each analysis group. Outcome=1 if reference included (control) or reference included / valid decline acknowledgement (test). Denominator = new-content edits in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control).
Code
# KPI #2: constructive (not reverted) among new-content edits
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
kpi2
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
constructive =
ifelse
(was_reverted
==
))
%>%
group_by
(test_group)
%>%
summarise
rate =
mean
(constructive,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_fallback
kpi2,
"KPI #2 (constructive)"
test_group =
"Test group"
rate =
"Constructive rate"
n =
"Count (edits)"
),
pct_col =
"rate"
note_text =
"Constructive rate = mean(0/1) where 1 = not reverted within 48h. Denominator = new-content edits in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control)."
else
message
"KPI #2: required columns missing in reference_check_save_data"
# Guardrail #1: revert breakdown by reference included (new content only)
if
is.null
(reference_check_save_data)) {
df_g1
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
ref_flag
<-
pick_first
(reference_flag_candidates, df_g1)
if
is.null
(ref_flag)
&&
all
"test_group"
"was_reverted"
%in%
names
(df_g1))) {
guardrail1
<-
df_g1
%>%
mutate
reference_included =
dplyr
::
if_else
is.na
(.data[[ref_flag]]),
L,
as.integer
(.data[[ref_flag]]
==
)))
%>%
count
(test_group, reference_included, was_reverted)
%>%
group_by
(test_group, reference_included)
%>%
mutate
pct =
sum
(n))
render_fallback
guardrail1,
"Guardrail #1: revert by reference included"
test_group =
"Test group"
reference_included =
"Reference included (0/1)"
was_reverted =
"Reverted (0/1)"
n =
"Count (edits)"
pct =
"Percent of edits"
),
pct_col =
"pct"
note_text =
"Percent of edits = share of rows within each (test group × reference included). Denominator = new-content edits in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control)."
else
message
"Guardrail #1: missing reference flag or required columns; update ref_flag candidates if needed"
else
message
"Guardrail #1: reference_check_save_data missing"
# Guardrail #2: completion (saveSuccess) using edit_completion_rate_data
if
is.null
(edit_completion_rate_data)
&&
all
"test_group"
"was_reference_check_shown"
"saved_edit"
%in%
names
(edit_completion_rate_data))) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
completion_summary
<-
ec_df
%>%
mutate
was_reference_check_shown =
ifelse
(was_reference_check_shown
==
"shown"
"not_shown"
),
saved_edit =
ifelse
(saved_edit
==
"saveSuccess"
"other"
%>%
count
(test_group, was_reference_check_shown, saved_edit)
%>%
group_by
(test_group, was_reference_check_shown)
%>%
mutate
pct =
sum
(n))
render_fallback
completion_summary,
"Guardrail #2: completion counts"
test_group =
"Test group"
was_reference_check_shown =
"Reference Check shown"
saved_edit =
"Action"
n =
"Count (events)"
pct =
"Percent of events"
),
pct_col =
"pct"
note_text =
"Percent of events = share of rows (events) in `edit_completion_rate_data` within each (test group × Reference Check shown) after the Guardrail #2 filters (shown-only in test; focus population; unreverted when available)."
else
message
"Guardrail #2: required columns missing in edit_completion_rate_data"
KPI #2 (constructive)
Test group
Constructive rate
Count (edits)
control
71.8%
1665
test
75.9%
1574
Table note:
Constructive rate = mean(0/1) where 1 = not reverted within 48h. Denominator = new-content edits in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control).
Guardrail #1: revert by reference included
Reverted (0/1)
Count (edits)
Percent of edits
control - 0
879
67.8%
418
32.2%
control - 1
317
86.1%
51
13.9%
test - 0
416
68.3%
193
31.7%
test - 1
779
80.7%
186
19.3%
Table note:
Percent of edits = share of rows within each (test group × reference included). Denominator = new-content edits in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control).
Guardrail #2: completion counts
Action
Count (events)
Percent of events
control - not_shown
other
7686
11.7%
saveSuccess
58283
88.3%
test - shown
other
231
15.9%
saveSuccess
1223
84.1%
Table note:
Percent of events = share of rows (events) in
edit_completion_rate_data
within each (test group × Reference Check shown) after the Guardrail #2 filters (shown-only in test; focus population; unreverted when available).
4.2
Key metrics
4.2.1
KPI #1 Reference Added or acknowledged why a citation was not added
Metric
: Proportion of published edits that add new content and either include a reference or explicitly acknowledge why a citation was not added.
Methodology
: We analyze published edits that add new content.
Test group
: The test group includes editing sessions where Reference Check was shown at least once during the editing session. This corresponds to
event.feature = "editCheck-addReference"
and
event.action = "check-shown-presave"
. Only published edits are included. An edit is counted if it either adds at least one net new reference or includes an explicit acknowledgement for missing references by selecting one of the four valid decline reasons.
Control group
: The control group includes published edits identified as eligible but not shown Reference Check. An edit is counted if it includes at least one net new reference.
We compare proportions between experiment groups (control vs test) overall and by platform / user status / checks-shown buckets. For adjusted comparisons we use multivariable logistic regression (glm); for lift/uncertainty we also report Bayesian lift via
relax
when available.
Note
: Similar to
KPI1
in Multi Check
Results:
When Reference Check was shown, edits were far more likely to either add a reference or clearly acknowledge why they didn’t. Given the current KPI1 definition, direct test–control comparisons are difficult to interpret because the “Decline” option is available only in test. To ensure a fair comparison across test and control groups, we focus on KPI1b, which compares control and test without the decline option.
Code
# KPI1 bar (Reference Added or acknowledged why a citation was not added), by test_group
# Updated per KPI #1 methodology:
# - population: new-content edits within analysis groups (shown test vs eligible-not-shown control)
# - outcome: test counts (reference included OR valid decline acknowledgement); control counts (reference included)
if
is.null
(reference_check_save_data)) {
kpi1_df
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
ref_flag
<-
pick_first
(reference_flag_candidates, kpi1_df)
if
is.null
(ref_flag)
&&
all
"test_group"
%in%
names
(kpi1_df))) {
valid_reasons
<-
"edit-check-feedback-reason-common-knowledge"
"edit-check-feedback-reason-irrelevant"
"edit-check-feedback-reason-uncertain"
"edit-check-feedback-reason-other"
# Add acknowledgement (selected decline reason) at the editing_session level, when available
if
is.null
(reference_check_rejects_data)
&&
all
"editing_session"
"reject_reason"
%in%
names
(reference_check_rejects_data))
&&
"editing_session"
%in%
names
(kpi1_df))) {
ack_sessions
<-
reference_check_rejects_data
%>%
renorm_buckets
()
%>%
filter
(reject_reason
%in%
valid_reasons)
%>%
distinct
(editing_session)
%>%
mutate
ack_reason_selected =
L)
kpi1_df
<-
kpi1_df
%>%
left_join
(ack_sessions,
by =
"editing_session"
kpi1_bar_df
<-
kpi1_df
%>%
mutate
ref_included =
dplyr
::
if_else
is.na
(.data[[ref_flag]]),
L,
as.integer
(.data[[ref_flag]]
==
)),
ack_reason_selected =
dplyr
::
if_else
is.na
(ack_reason_selected),
L,
as.integer
(ack_reason_selected
==
)),
has_ref_ack =
dplyr
::
if_else
test_group
==
"test"
as.integer
((ref_included
==
(ack_reason_selected
==
)),
ref_included
%>%
group_by
(test_group)
%>%
summarise
rate =
mean
(has_ref_ack,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
kpi1_bar
<-
kpi1_bar_df
%>%
mutate
label =
scales
::
percent
(rate,
accuracy =
0.1
))
%>%
ggplot
aes
x =
test_group,
y =
rate,
fill =
test_group))
geom_col
()
geom_text
aes
label =
label),
vjust =
0.2
size =
scale_y_continuous
labels =
scales
::
percent_format
(),
expand =
expansion
mult =
0.12
)))
scale_fill_manual
values =
"control"
"#999999"
"test"
"dodgerblue4"
))
labs
title =
"KPI1: Reference Added or acknowledged why a citation was not added"
x =
"Test group"
y =
"Percent of new-content edits"
pc_theme
()
guides
fill =
"none"
(kpi1_bar)
else
message
"KPI1 plot: reference flag or required columns missing"
else
message
"KPI1 plot: data not loaded"
Chart note (definition of Rate / denominator)
KPI 1 (Reference Added or acknowledged why a citation was not added)
Rate
= mean(
has_ref_ack
) where
has_ref_ack
is a 0/1 flag. In the
test
group, outcome=1 if the edit either includes a reference (
was_reference_included == 1
or
the user selected one of the four valid decline reasons. In the
control
group, outcome=1 if the edit includes a reference.
Denominator
= new-content edit rows in
reference_check_save_data
is_new_content == 1
within the analysis groups
(test = RC shown at least once; control = eligible-but-not-shown).
Code
# KPI #1 tables (platform control vs test and deltas)
# (Includes platform + user experience breakdowns)
# Updated per methodology:
# - test = new-content edits where RC was shown at least once
# - control = new-content edits eligible-but-not-shown
# - outcome: test counts (reference included OR valid decline acknowledgement); control counts (reference included)
if
is.null
(reference_check_save_data)) {
kpi1_df
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
ref_flag
<-
pick_first
(reference_flag_candidates, kpi1_df)
needed_cols
<-
"test_group"
"platform"
"is_new_content"
"was_reference_check_shown"
"was_reference_check_eligible"
missing
<-
setdiff
(needed_cols,
if
is.null
(ref_flag))
character
()
else
ref_flag),
names
(kpi1_df))
if
is.null
(ref_flag)) {
message
"KPI #1 tables: reference flag not found in data"
else
if
length
(missing)
) {
message
"KPI #1 tables: missing columns: "
paste
(missing,
collapse =
", "
))
else
valid_reasons
<-
"edit-check-feedback-reason-common-knowledge"
"edit-check-feedback-reason-irrelevant"
"edit-check-feedback-reason-uncertain"
"edit-check-feedback-reason-other"
# Add acknowledgement (selected decline reason) at the editing_session level, when available
if
is.null
(reference_check_rejects_data)
&&
all
"editing_session"
"reject_reason"
%in%
names
(reference_check_rejects_data))
&&
"editing_session"
%in%
names
(kpi1_df))) {
ack_sessions
<-
reference_check_rejects_data
%>%
renorm_buckets
()
%>%
filter
(reject_reason
%in%
valid_reasons)
%>%
distinct
(editing_session)
%>%
mutate
ack_reason_selected =
L)
kpi1_df
<-
kpi1_df
%>%
left_join
(ack_sessions,
by =
"editing_session"
kpi1_df
<-
kpi1_df
%>%
mutate
ref_included =
dplyr
::
if_else
is.na
(.data[[ref_flag]]),
L,
as.integer
(.data[[ref_flag]]
==
)),
ack_reason_selected =
dplyr
::
if_else
is.na
(ack_reason_selected),
L,
as.integer
(ack_reason_selected
==
)),
has_ref_ack =
dplyr
::
if_else
test_group
==
"test"
as.integer
((ref_included
==
(ack_reason_selected
==
)),
ref_included
kpi1_rates
<-
kpi1_df
%>%
make_rate_table
"has_ref_ack"
kpi1_rel
<-
make_rel_change
(kpi1_rates)
render_rate_rel
kpi1_rates, kpi1_rel,
"KPI #1: Reference Added or acknowledged why a citation was not added (by platform)"
"KPI #1: change vs control"
test_group =
"Test group"
platform =
"Platform"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome). Test outcome=1 if reference included OR a valid decline reason was selected; control outcome=1 if reference included. Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform)."
# User experience breakdown (Unregistered / Newcomer / Junior Contributor)
if
"experience_level_group"
%in%
names
(kpi1_df)) {
kpi1_exp_df
<-
kpi1_df
%>%
filter
is.na
(experience_level_group), experience_level_group
%in%
"Unregistered"
"Newcomer"
"Junior Contributor"
))
kpi1_exp_rates
<-
make_rate_table
(kpi1_exp_df,
"has_ref_ack"
group_cols =
"test_group"
"experience_level_group"
))
kpi1_exp_rel
<-
make_rel_change_dim
(kpi1_exp_rates,
dim_col =
"experience_level_group"
render_rate_rel
kpi1_exp_rates, kpi1_exp_rel,
"KPI #1: Reference Added or acknowledged why a citation was not added (by user experience)"
"KPI #1: change vs control (by user experience)"
test_group =
"Test group"
experience_level_group =
"User experience"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome). Test outcome=1 if reference included OR a valid decline reason was selected; control outcome=1 if reference included. Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience)."
else
message
"KPI #1 user experience tables: experience_level_group not available in reference_check_save_data"
else
message
"KPI #1 tables: data not loaded"
KPI #1: Reference Added or acknowledged why a citation was not added (by platform)
Test group
Platform
Rate
Count (edits)
control
desktop
26.5%
1344
control
mobile-web
3.7%
321
test
desktop
95.8%
1295
test
mobile-web
95.3%
279
Table note:
Rate = mean(0/1 outcome). Test outcome=1 if reference included OR a valid decline reason was selected; control outcome=1 if reference included. Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).
KPI #1: change vs control
Platform
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
desktop
26.5%
95.8%
69.3
261.8%
1344
1295
mobile-web
3.7%
95.3%
91.6
2 450.4%
321
279
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
KPI #1: Reference Added or acknowledged why a citation was not added (by user experience)
Test group
User experience
Rate
Count (edits)
control
Unregistered
11.4%
255
control
Newcomer
21.3%
230
control
Junior Contributor
24.6%
1180
test
Unregistered
90.4%
260
test
Newcomer
97.7%
221
test
Junior Contributor
96.6%
1093
Table note:
Rate = mean(0/1 outcome). Test outcome=1 if reference included OR a valid decline reason was selected; control outcome=1 if reference included. Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience).
KPI #1: change vs control (by user experience)
experience_level_group
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Unregistered
11.4%
90.4%
79.0
694.8%
255
260
Newcomer
21.3%
97.7%
76.4
358.8%
230
221
Junior Contributor
24.6%
96.6%
72.0
293.1%
1180
1093
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Code
# KPI #1 by checks shown (bucketed) + user_status slice
# Updated per methodology (shown test vs eligible-not-shown control; test counts ref OR acknowledgement)
if
is.null
(reference_check_save_data)) {
kpi1_df
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
ref_flag
<-
pick_first
(reference_flag_candidates, kpi1_df)
if
is.null
(ref_flag)
&&
all
"test_group"
"n_checks_shown"
%in%
names
(kpi1_df))) {
valid_reasons
<-
"edit-check-feedback-reason-common-knowledge"
"edit-check-feedback-reason-irrelevant"
"edit-check-feedback-reason-uncertain"
"edit-check-feedback-reason-other"
if
is.null
(reference_check_rejects_data)
&&
all
"editing_session"
"reject_reason"
%in%
names
(reference_check_rejects_data))
&&
"editing_session"
%in%
names
(kpi1_df))) {
ack_sessions
<-
reference_check_rejects_data
%>%
renorm_buckets
()
%>%
filter
(reject_reason
%in%
valid_reasons)
%>%
distinct
(editing_session)
%>%
mutate
ack_reason_selected =
L)
kpi1_df
<-
kpi1_df
%>%
left_join
(ack_sessions,
by =
"editing_session"
kpi1_df
<-
kpi1_df
%>%
mutate
ref_included =
dplyr
::
if_else
is.na
(.data[[ref_flag]]),
L,
as.integer
(.data[[ref_flag]]
==
)),
ack_reason_selected =
dplyr
::
if_else
is.na
(ack_reason_selected),
L,
as.integer
(ack_reason_selected
==
)),
has_ref_ack =
dplyr
::
if_else
test_group
==
"test"
as.integer
((ref_included
==
(ack_reason_selected
==
)),
ref_included
df_kpi1_checks
<-
kpi1_df
%>%
# Checks-shown buckets come from RC shown events; we report this slice for the test group only.
filter
(test_group
==
"test"
%>%
mutate
checks_bucket =
case_when
is.na
(n_checks_shown)
"unknown"
n_checks_shown
==
"0"
n_checks_shown
==
"1"
n_checks_shown
==
"2"
n_checks_shown
>=
"3+"
))
%>%
group_by
(checks_bucket)
%>%
summarise
rate =
mean
(has_ref_ack,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
df_kpi1_checks,
"KPI #1 by checks shown (test group only)"
checks_bucket =
"Checks shown"
rate =
"Rate"
n =
"Count (edits)"
),
note_text =
"Rate = mean(0/1 outcome) in the test group only. Outcome=1 if reference included OR a valid decline acknowledgement was selected. Denominator = new-content edits (rows) in `reference_check_save_data` within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream."
# KPI #1 by platform and user experience (Unregistered / Newcomer / Junior Contributor)
if
all
"platform"
"experience_level_group"
%in%
names
(kpi1_df))) {
kpi1_exp_slices
<-
kpi1_df
%>%
filter
is.na
(experience_level_group), experience_level_group
%in%
"Unregistered"
"Newcomer"
"Junior Contributor"
))
%>%
group_by
(test_group, platform, experience_level_group)
%>%
summarise
rate =
mean
(has_ref_ack,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
kpi1_exp_slices,
"KPI #1: Reference Added or acknowledged why a citation was not added by platform and user experience"
test_group =
"Test group"
platform =
"Platform"
experience_level_group =
"User experience"
rate =
"Rate"
n =
"Count (edits)"
),
note_text =
"Rate = mean(0/1 outcome). Test outcome=1 if reference included OR valid decline acknowledgement; control outcome=1 if reference included. Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience)."
else
message
"KPI #1 user experience slice: required columns missing in reference_check_save_data"
else
message
"KPI #1 by checks: reference flag or required columns missing in reference_check_save_data"
else
message
"KPI #1 by checks: data not loaded"
KPI #1 by checks shown (test group only)
Checks shown
Rate
Count (edits)
95.3%
1086
94.7%
208
3+
98.2%
280
Table note:
Rate = mean(0/1 outcome) in the test group only. Outcome=1 if reference included OR a valid decline acknowledgement was selected. Denominator = new-content edits (rows) in
reference_check_save_data
within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream.
KPI #1: Reference Added or acknowledged why a citation was not added by platform and user experience
Test group
Platform
User experience
Rate
Count (edits)
control
desktop
Unregistered
15.6%
180
control
desktop
Newcomer
26.2%
187
control
desktop
Junior Contributor
28.6%
977
control
mobile-web
Unregistered
1.3%
75
control
mobile-web
Newcomer
0.0%
<50
control
mobile-web
Junior Contributor
5.4%
203
test
desktop
Unregistered
91.7%
193
test
desktop
Newcomer
97.4%
191
test
desktop
Junior Contributor
96.4%
911
test
mobile-web
Unregistered
86.6%
67
test
mobile-web
Newcomer
100.0%
<50
test
mobile-web
Junior Contributor
97.8%
182
Table note:
Rate = mean(0/1 outcome). Test outcome=1 if reference included OR valid decline acknowledgement; control outcome=1 if reference included. Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience).
4.2.1.1
Confirming the impact of Reference Check
Code
# KPI #1 model (reference/acknowledgement present)
# Updated per methodology (shown test vs eligible-not-shown control; test counts ref OR acknowledgement)
if
is.null
(reference_check_save_data)) {
df_kpi1
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
ref_flag
<-
pick_first
(reference_flag_candidates, df_kpi1)
if
is.null
(ref_flag)
&&
all
"test_group"
"platform"
%in%
names
(df_kpi1))) {
valid_reasons
<-
"edit-check-feedback-reason-common-knowledge"
"edit-check-feedback-reason-irrelevant"
"edit-check-feedback-reason-uncertain"
"edit-check-feedback-reason-other"
if
is.null
(reference_check_rejects_data)
&&
all
"editing_session"
"reject_reason"
%in%
names
(reference_check_rejects_data))
&&
"editing_session"
%in%
names
(df_kpi1))) {
ack_sessions
<-
reference_check_rejects_data
%>%
renorm_buckets
()
%>%
filter
(reject_reason
%in%
valid_reasons)
%>%
distinct
(editing_session)
%>%
mutate
ack_reason_selected =
L)
df_kpi1
<-
df_kpi1
%>%
left_join
(ack_sessions,
by =
"editing_session"
df_kpi1
<-
df_kpi1
%>%
mutate
ref_included =
dplyr
::
if_else
is.na
(.data[[ref_flag]]),
L,
as.integer
(.data[[ref_flag]]
==
)),
ack_reason_selected =
dplyr
::
if_else
is.na
(ack_reason_selected),
L,
as.integer
(ack_reason_selected
==
)),
has_ref_ack =
dplyr
::
if_else
test_group
==
"test"
as.integer
((ref_included
==
(ack_reason_selected
==
)),
ref_included
tryCatch
({
# Frequentist adjusted comparison (glm)
m_kpi1
<-
glm
(has_ref_ack
test_group
platform,
data =
df_kpi1,
family =
binomial
())
render_binom_model
m_kpi1,
"Table 1. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1 outcome among new-content edits."
note_text =
"Outcome=1 if reference included (control) or reference included / valid decline acknowledgement (test). Adjusted for platform. OR>1 indicates higher odds of the outcome."
# Hierarchical Bayesian confirmation (brms): random intercept by user_id (if available)
if
all
"user_id"
"experience_level_group"
%in%
names
(df_kpi1))) {
df_brm
<-
df_kpi1
%>%
mutate
test_group =
factor
(test_group,
levels =
"control"
"test"
)),
platform =
factor
(platform),
experience_level_group =
droplevels
(experience_level_group)
%>%
filter
is.na
(user_id),
is.na
(has_ref_ack),
is.na
(test_group),
is.na
(platform))
if
(dplyr
::
n_distinct
(df_brm
user_id)
&&
dplyr
::
n_distinct
(df_brm
test_group)
==
) {
if
requireNamespace
"brms"
quietly =
TRUE
)) {
message
"KPI #1 brms: skipped (brms not available / cannot be loaded in this environment)"
else
if
exists
"safe_brm"
mode =
"function"
)) {
message
"KPI #1 brms: skipped (safe_brm not defined; run the setup/helper cells first)"
else
priors
<-
brms
::
set_prior
prior =
"std_normal()"
class =
"b"
),
brms
::
set_prior
"cauchy(0, 5)"
class =
"sd"
fit_brm
<-
safe_brm
has_ref_ack
test_group
platform
experience_level_group
user_id),
data =
df_brm,
prior =
priors,
seed =
chains =
cores =
refresh =
if
is.null
(fit_brm)) {
# Posterior-derived lift (probability space) + OR summary (multi-check style)
nd_ctrl
<-
df_brm
%>%
mutate
test_group =
factor
"control"
levels =
"control"
"test"
)))
nd_test
<-
df_brm
%>%
mutate
test_group =
factor
"test"
levels =
"control"
"test"
)))
render_brms_confirm_table
fit =
fit_brm,
title =
"Table 1B. Hierarchical Bayesian confirmation for KPI #1 outcome among new-content edits."
coef_name =
"b_test_grouptest"
newdata_control =
nd_ctrl,
newdata_test =
nd_test,
note_text =
"Posterior-derived average lift is computed as the per-draw mean of Pr(outcome|test) − Pr(outcome|control) over the observed covariate distribution (platform + experience), using population-level predictions (re_formula = NA)."
else
message
"KPI #1 brms: skipped (insufficient variation in user_id or test_group)"
else
message
"KPI #1 brms: skipped (missing user_id or experience_level_group)"
},
error =
function
(e) {
message
"KPI #1 model error: "
, e
message)
})
else
message
"KPI #1 model: reference flag or required columns missing"
else
message
"KPI #1 model: data not loaded"
Table 1. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1 outcome among new-content edits.
Term
OR
CI low
CI high
SE
p-value
Intercept
0.340
0.301
0.383
0.062
<0.001
test_grouptest
95.515
72.006
128.996
0.149
<0.001
platformmobile-web
0.273
0.198
0.374
0.162
<0.001
Table note:
Outcome=1 if reference included (control) or reference included / valid decline acknowledgement (test). Adjusted for platform. OR>1 indicates higher odds of the outcome.
In file included from stan/lib/stan_math/lib/boost_1.81.0/boost/multi_array/multi_array_ref.hpp:32,
from stan/lib/stan_math/lib/boost_1.81.0/boost/multi_array.hpp:34,
from stan/lib/stan_math/lib/boost_1.81.0/boost/numeric/odeint/algebra/multi_array_algebra.hpp:22,
from stan/lib/stan_math/lib/boost_1.81.0/boost/numeric/odeint.hpp:63,
from stan/lib/stan_math/stan/math/prim/functor/ode_rk45.hpp:9,
from stan/lib/stan_math/stan/math/prim/functor/integrate_ode_rk45.hpp:6,
from stan/lib/stan_math/stan/math/prim/functor.hpp:16,
from stan/lib/stan_math/stan/math/rev/fun.hpp:200,
from stan/lib/stan_math/stan/math/rev.hpp:12,
from stan/lib/stan_math/stan/math.hpp:19,
from stan/src/stan/model/model_header.hpp:4,
from /tmp/Rtmpc4xKiQ/model-3f721868903518.hpp:2:
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:180:45: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
180 | : public boost::functional::detail::unary_function::argument_type,bool>
| ^~~~~~~~~~~~~~
In file included from /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/string:48,
from /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/locale_classes.h:40,
from /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/ios_base.h:41,
from /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/ios:42,
from /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda

-linux-gnu/12.4.0/include/c++/istream:38,
from /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/sstream:38,
from /srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/complex:45,
from stan/lib/stan_math/lib/eigen_3.4.0/Eigen/Core:50,
from stan/lib/stan_math/lib/eigen_3.4.0/Eigen/Dense:1,
from stan/lib/stan_math/stan/math/prim/fun/Eigen.hpp:22,
from stan/lib/stan_math/stan/math/rev.hpp:4:
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~

stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:214:45: warning: 'template struct std::binary_function' is deprecated [-Wdeprecated-declarations]
214 | : public boost::functional::detail::binary_function<
| ^~~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:131:12: note: declared here
131 | struct binary_function
| ^~~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:252:45: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
252 | : public boost::functional::detail::unary_function<
| ^~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:299:45: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
299 | : public boost::functional::detail::unary_function<
| ^~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:345:57: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
345 | class mem_fun_t : public boost::functional::detail::unary_function
| ^~~~~~~~~~~~~~
/srv/home/iflor

ez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:361:58: warning: 'template struct std::binary_function' is deprecated [-Wdeprecated-declarations]
361 | class mem_fun1_t : public boost::functional::detail::binary_function
| ^~~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:131:12: note: declared here
131 | struct binary_function
| ^~~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:377:63: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
377 | class const_mem_fun_t : public boost::functional::detail::unary_function
| ^~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:393:64: warning: 'template struct std::binary_function' is deprecated [-Wdeprecated-declarations]
393 | class const_mem_fun1_t : public boost::functional::detail::binary_function
| ^~~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:131:12: note: declared here
131 | struct binary_function
| ^~~~~~~~~~~~~

~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:438:61: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
438 | class mem_fun_ref_t : public boost::functional::detail::unary_function
| ^~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:454:62: warning: 'template struct std::binary_function' is deprecated [-Wdeprecated-declarations]
454 | class mem_fun1_ref_t : public boost::functional::detail::binary_function
| ^~~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:131:12: note: declared here
131 | struct binary_function
| ^~~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:470:67: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
470 | class const_mem_fun_ref_t : public boost::functional::detail::unary_function
| ^~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:487:68: warning: 'template struct std::binary_function' is deprecated [-Wdeprecated-declarations]
487 | class const_m

em_fun1_ref_t : public boost::functional::detail::binary_function
| ^~~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:131:12: note: declared here
131 | struct binary_function
| ^~~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:533:73: warning: 'template struct std::unary_function' is deprecated [-Wdeprecated-declarations]
533 | class pointer_to_unary_function : public boost::functional::detail::unary_function
| ^~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:117:12: note: declared here
117 | struct unary_function
| ^~~~~~~~~~~~~~
stan/lib/stan_math/lib/boost_1.81.0/boost/functional.hpp:557:74: warning: 'template struct std::binary_function' is deprecated [-Wdeprecated-declarations]
557 | class pointer_to_binary_function : public boost::functional::detail::binary_function
| ^~~~~~~~~~~~~~~
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/gcc/x86_64-conda-linux-gnu/12.4.0/include/c++/bits/stl_function.h:131:12: note: declared here
131 | struct binary_function
| ^~~~~~~~~~~~~~~

Start sampling
Running MCMC with 4 parallel chains...

Chain 3 finished in 15.8 seconds.
Chain 2 finished in 19.7 seconds.
Chain 4 finished in 20.5 seconds.
Chain 1 finished in 20.7 seconds.

All 4 chains finished successfully.
Mean chain execution time: 19.2 seconds.
Total execution time: 20.8 seconds.
Loading required package: rstan

Loading required package: StanHeaders

Error: package or namespace load failed for ‘rstan’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/R/library/rstan/libs/rstan.so':
/srv/home/iflorez/.conda/envs/2025-11-25T19.45.47_iflorez/lib/R/library/rstan/libs/rstan.so: undefined symbol: _ZN3tbb8internal26task_scheduler_observer_v37observeEb

brms fit skipped (backend=cmdstanr): unable to find required package ‘rstan’
Continuing with glm + relax outputs.
To run brms reliably, prefer cmdstanr with CmdStan installed and a stable R toolchain.
Code
# KPI #1 Bayesian lift (relax) — reference/acknowledgement
# Updated per methodology (shown test vs eligible-not-shown control; test counts ref OR acknowledgement)
if
is.null
(reference_check_save_data)) {
kpi1_src
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
ref_flag
<-
pick_first
(reference_flag_candidates, kpi1_src)
if
is.null
(ref_flag)
&&
all
"test_group"
%in%
names
(kpi1_src))) {
valid_reasons
<-
"edit-check-feedback-reason-common-knowledge"
"edit-check-feedback-reason-irrelevant"
"edit-check-feedback-reason-uncertain"
"edit-check-feedback-reason-other"
if
is.null
(reference_check_rejects_data)
&&
all
"editing_session"
"reject_reason"
%in%
names
(reference_check_rejects_data))
&&
"editing_session"
%in%
names
(kpi1_src))) {
ack_sessions
<-
reference_check_rejects_data
%>%
renorm_buckets
()
%>%
filter
(reject_reason
%in%
valid_reasons)
%>%
distinct
(editing_session)
%>%
mutate
ack_reason_selected =
L)
kpi1_src
<-
kpi1_src
%>%
left_join
(ack_sessions,
by =
"editing_session"
kpi1_df
<-
kpi1_src
%>%
transmute
ref_included =
dplyr
::
if_else
is.na
(.data[[ref_flag]]),
L,
as.integer
(.data[[ref_flag]]
==
)),
ack_reason_selected =
dplyr
::
if_else
is.na
(ack_reason_selected),
L,
as.integer
(ack_reason_selected
==
)),
outcome =
dplyr
::
if_else
test_group
==
"test"
as.integer
((ref_included
==
(ack_reason_selected
==
)),
ref_included
),
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
render_relax
(kpi1_df,
"KPI #1"
metric_type =
"proportion"
better =
"higher"
else
message
"KPI #1 relax: reference flag or required columns missing"
else
message
"KPI #1 relax: data not loaded"
KPI #1
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
2.302
1.000
1.000
1.975
2.629
3.332
0.000
2.938
3.725
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 100.0% (computed as Chance to Win).
4.2.2
KPI #1b New reference included
Metric
: Proportion of published edits that add new content, are
constructive
(not reverted within 48 hours), and include at least one net new reference.
Methodology
: We analyze published edits that add new content and exclude edits reverted within 48 hours.
Test group
: The test group includes editing sessions where Reference Check was shown at least once during the editing session.
Control group
: The control group includes published edits identified as eligible but not shown Reference Check.
Population definition (KPI #1b)
: A published edit is included if
is_new_content == 1
and
was_reverted != 1
(i.e., not reverted within 48 hours).
Outcome definition (KPI #1b)
: An edit is counted (outcome=1) if it includes at least one net new reference (in this notebook:
was_reference_included == 1
when available).
Important: population / comparability note
This notebook’s primary KPI #1b reporting uses
shown vs eligible-not-shown
analysis groups:
Test
= new-content edits where Reference Check was shown at least once.
Control
= new-content edits tagged eligible but not shown.
This is an
exposure-style (per-protocol)
estimate: it answers “what is the effect when Reference Check is actually shown?”.
For comparability with prior reports (e.g., 2024), we also report an
availability / intent-to-treat (ITT)
version for KPI #1b:
Test vs Control assignment buckets
, among
all published new-content edits
(not restricted to shown/eligible). The
denominator
is all new-content edits, and the
outcome
is 1 only if the edit both includes a new reference
and
is not reverted within 48 hours.
Why KPI #1b?
KPI #1 counts edits that either include a new reference
or
explicitly acknowledge why a citation was not added. KPI #1b removes the acknowledgement component to isolate the effect on
references actually added
Results:
When Reference Check was shown, editors were much more likely to add a new reference. This effect is large and statistically significant in the pooled adjusted model (adjusted for platform). How big is the change:
Desktop: Editors were ~2.2× more likely to add a new reference (30.7% → 68.2%).
Mobile-web: Editors were ~17.5× more likely to add a new reference (2.8% → 48.9%).
The increase in references added is substantial on both platforms. Across both adjusted models and simpler comparisons, the evidence is clear: Reference Check materially increases the likelihood that editors add a new reference(s).
Note: KPI #1 and KPI #1b above compare edits where Reference Check was shown with edits that were eligible but not shown (exposure-style, per-protocol). For KPI #1b, this is further limited to constructive new-content edits that were not reverted within 48 hours.
In one sentence: Among constructive new‑content edits (not reverted within 48 hours), edits where Reference Check was shown were ~2.2× more likely on desktop (30.7% → 68.2%) and ~17.5× more likely on mobile web (2.8% → 48.9%) to include at least one net new reference compared to eligible edits where Reference Check was not shown.
KPI #1b (Shown/Eligible; reference added on constructive new-content edits):
Overall: ↑ +38.7 pp (26.5% → 65.2%), +146.0% relative (roughly 2.46×).
Desktop: ↑ +37.5 pp (30.7% → 68.2%), +122.1% relative (roughly 2.2×).
Mobile web: ↑ +46.1 pp (2.8% → 48.9%), +1,646.4% relative (~17.5×).
Evidence: glm (Table 1b) OR = 5.56 (95% CI 4.65–6.66), p < 0.001.
Relax (relative lift): Bayesian = +1.23 (95% CrI 1.00–1.46, P(test better)=1.00);
Frequentist = +1.46 (95% CI 1.21–1.71), p < 0.001.
Code
# KPI1b bar (new reference included), by test_group
if
is.null
(reference_check_save_data)) {
kpi1b_df
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
# KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
if
"was_reverted"
%in%
names
(kpi1b_df)) {
kpi1b_df
<-
kpi1b_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
# KPI1b uses reference-included only (no acknowledgement)
ref_col
<-
if
"was_reference_included"
%in%
names
(kpi1b_df)) {
"was_reference_included"
else
pick_first
"was_reference_included"
"reference_added"
"has_reference_added"
"has_reference"
), kpi1b_df)
if
is.null
(ref_col)
&&
all
"test_group"
%in%
names
(kpi1b_df))) {
kpi1b_bar
<-
kpi1b_df
%>%
mutate
ref_included =
ifelse
is.na
(.data[[ref_col]]),
L,
as.integer
(.data[[ref_col]]
==
)))
%>%
group_by
(test_group)
%>%
summarise
rate =
mean
(ref_included,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
%>%
mutate
label =
scales
::
percent
(rate,
accuracy =
0.1
))
%>%
ggplot
aes
x =
test_group,
y =
rate,
fill =
test_group))
geom_col
()
geom_text
aes
label =
label),
vjust =
0.2
size =
scale_y_continuous
labels =
scales
::
percent_format
(),
expand =
expansion
mult =
0.12
)))
scale_fill_manual
values =
"control"
"#999999"
"test"
"dodgerblue4"
))
labs
title =
"KPI1b: New reference included"
x =
"Test group"
y =
"Percent of constructive new-content edits"
pc_theme
()
guides
fill =
"none"
(kpi1b_bar)
else
message
"KPI1b plot: required columns missing (reference flag or test_group)"
else
message
"KPI1b plot: data not loaded"
Chart note (definition of Rate / denominator)
KPI 1b (new reference included)
Rate
= mean(
ref_included
) where
ref_included
is a 0/1 flag (1 = at least one net new reference included).
Denominator
= constructive new-content edit rows in
reference_check_save_data
where
is_new_content == 1
and
was_reverted != 1
(not reverted within 48 hours), within the analysis groups (test = RC shown at least once; control = eligible-but-not-shown).
Code
# KPI #1b tables (platform control vs test and deltas)
# Updated per methodology:
# - test = new-content edits where RC was shown at least once
# - control = new-content edits eligible-but-not-shown
# - outcome: reference included only (no acknowledgement)
if
is.null
(reference_check_save_data)) {
kpi1b_df
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
# KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
if
"was_reverted"
%in%
names
(kpi1b_df)) {
kpi1b_df
<-
kpi1b_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
ref_col
<-
if
"was_reference_included"
%in%
names
(kpi1b_df)) {
"was_reference_included"
else
pick_first
"was_reference_included"
"reference_added"
"has_reference_added"
"has_reference"
), kpi1b_df)
needed_cols
<-
"test_group"
"platform"
missing
<-
setdiff
(needed_cols,
if
is.null
(ref_col))
character
()
else
ref_col),
names
(kpi1b_df))
if
is.null
(ref_col)) {
message
"KPI #1b tables: reference-included flag not found in data"
else
if
length
(missing)
) {
message
"KPI #1b tables: missing columns: "
paste
(missing,
collapse =
", "
))
else
kpi1b_df
<-
kpi1b_df
%>%
mutate
ref_included =
ifelse
is.na
(.data[[ref_col]]),
L,
as.integer
(.data[[ref_col]]
==
)))
# Overall (control vs test) + change vs control
kpi1b_overall_rates
<-
make_rate_table
(kpi1b_df,
"ref_included"
group_cols =
"test_group"
))
%>%
mutate
scope =
"Overall"
kpi1b_overall_rel
<-
make_rel_change_dim
(kpi1b_overall_rates,
dim_col =
"scope"
render_rate_rel
kpi1b_overall_rates,
kpi1b_overall_rel,
"KPI #1b: new reference included (overall)"
"KPI #1b: change vs control (overall)"
test_group =
"Test group"
scope =
"Scope"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control)."
# By platform (control vs test) + change vs control
kpi1b_rates
<-
kpi1b_df
%>%
make_rate_table
"ref_included"
kpi1b_rel
<-
make_rel_change
(kpi1b_rates)
render_rate_rel
kpi1b_rates, kpi1b_rel,
"KPI #1b: new reference included (by platform)"
"KPI #1b: change vs control (by platform)"
test_group =
"Test group"
platform =
"Platform"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform)."
# User experience breakdown (Unregistered / Newcomer / Junior Contributor)
if
"experience_level_group"
%in%
names
(kpi1b_df)) {
kpi1b_exp_df
<-
kpi1b_df
%>%
filter
is.na
(experience_level_group), experience_level_group
%in%
"Unregistered"
"Newcomer"
"Junior Contributor"
))
kpi1b_exp_rates
<-
make_rate_table
(kpi1b_exp_df,
"ref_included"
group_cols =
"test_group"
"experience_level_group"
))
kpi1b_exp_rel
<-
make_rel_change_dim
(kpi1b_exp_rates,
dim_col =
"experience_level_group"
render_rate_rel
kpi1b_exp_rates, kpi1b_exp_rel,
"KPI #1b: new reference included (by user experience)"
"KPI #1b: change vs control (by user experience)"
test_group =
"Test group"
experience_level_group =
"User experience"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience)."
else
message
"KPI #1b user experience tables: experience_level_group not available in reference_check_save_data"
else
message
"KPI #1b tables: data not loaded"
KPI #1b: new reference included (overall)
Test group
Rate
Count (edits)
Scope
control
26.5%
1196
Overall
test
65.2%
1195
Overall
Table note:
Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in
reference_check_save_data
where
is_new_content == 1
and
was_reverted != 1
(not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control).
KPI #1b: change vs control (overall)
scope
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Overall
26.5%
65.2%
38.7
145.9%
1196
1195
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
KPI #1b: new reference included (by platform)
Test group
Platform
Rate
Count (edits)
control
desktop
30.7%
1015
control
mobile-web
2.8%
181
test
desktop
68.2%
1009
test
mobile-web
48.9%
186
Table note:
Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in
reference_check_save_data
where
is_new_content == 1
and
was_reverted != 1
(not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).
KPI #1b: change vs control (by platform)
Platform
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
desktop
30.7%
68.2%
37.4
121.8%
1015
1009
mobile-web
2.8%
48.9%
46.2
1 671.1%
181
186
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
KPI #1b: new reference included (by user experience)
Test group
User experience
Rate
Count (edits)
control
Unregistered
13.8%
160
control
Newcomer
31.6%
136
control
Junior Contributor
28.0%
900
test
Unregistered
63.9%
183
test
Newcomer
69.2%
143
test
Junior Contributor
64.8%
869
Table note:
Rate = mean(0/1 outcome) where outcome=1 means at least one net new reference included (per edit). Denominator = constructive new-content edits (rows) in
reference_check_save_data
where
is_new_content == 1
and
was_reverted != 1
(not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience).
KPI #1b: change vs control (by user experience)
experience_level_group
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Unregistered
13.8%
63.9%
50.2
365.0%
160
183
Newcomer
31.6%
69.2%
37.6
119.0%
136
143
Junior Contributor
28.0%
64.8%
36.8
131.4%
900
869
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Code
# KPI #1b by checks shown (bucketed) + platform/user_status slice
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)) {
kpi1b_df
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
# KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
if
"was_reverted"
%in%
names
(kpi1b_df)) {
kpi1b_df
<-
kpi1b_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
ref_col
<-
if
"was_reference_included"
%in%
names
(kpi1b_df)) {
"was_reference_included"
else
pick_first
"was_reference_included"
"reference_added"
"has_reference_added"
"has_reference"
), kpi1b_df)
if
is.null
(ref_col)
&&
all
"test_group"
"n_checks_shown"
%in%
names
(kpi1b_df))) {
kpi1b_df
<-
kpi1b_df
%>%
mutate
ref_included =
ifelse
is.na
(.data[[ref_col]]),
L,
as.integer
(.data[[ref_col]]
==
)))
df_kpi1b_checks
<-
kpi1b_df
%>%
# Checks-shown buckets come from RC shown events; we report this slice for the test group only.
filter
(test_group
==
"test"
%>%
mutate
checks_bucket =
case_when
is.na
(n_checks_shown)
"unknown"
n_checks_shown
==
"0"
n_checks_shown
==
"1"
n_checks_shown
==
"2"
n_checks_shown
>=
"3+"
))
%>%
group_by
(checks_bucket)
%>%
summarise
rate =
mean
(ref_included,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
df_kpi1b_checks,
"KPI #1b by checks shown (test group only)"
checks_bucket =
"Checks shown"
rate =
"Rate"
n =
"Count (edits)"
),
note_text =
"Rate = mean(0/1 outcome) in the test group only where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream."
# KPI #1b by platform and user experience (Unregistered / Newcomer / Junior Contributor)
if
all
"platform"
"experience_level_group"
%in%
names
(kpi1b_df))) {
kpi1b_exp_slices
<-
kpi1b_df
%>%
filter
is.na
(experience_level_group), experience_level_group
%in%
"Unregistered"
"Newcomer"
"Junior Contributor"
))
%>%
group_by
(test_group, platform, experience_level_group)
%>%
summarise
rate =
mean
(ref_included,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
kpi1b_exp_slices,
"KPI #1b: new reference included by platform and user experience"
test_group =
"Test group"
platform =
"Platform"
experience_level_group =
"User experience"
rate =
"Rate"
n =
"Count (edits)"
),
note_text =
"Rate = mean(0/1 outcome) where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in `reference_check_save_data` where `is_new_content == 1` and `was_reverted != 1` (not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience)."
else
message
"KPI #1b user experience slice: required columns missing in reference_check_save_data"
else
message
"KPI #1b by checks: required columns missing in reference_check_save_data"
else
message
"KPI #1b by checks: data not loaded"
KPI #1b by checks shown (test group only)
Checks shown
Rate
Count (edits)
62.1%
832
66.9%
166
3+
76.6%
197
Table note:
Rate = mean(0/1 outcome) in the test group only where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in
reference_check_save_data
where
is_new_content == 1
and
was_reverted != 1
(not reverted within 48 hours), within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream.
KPI #1b: new reference included by platform and user experience
Test group
Platform
User experience
Rate
Count (edits)
control
desktop
Unregistered
18.6%
118
control
desktop
Newcomer
36.1%
119
control
desktop
Junior Contributor
31.7%
778
control
mobile-web
Unregistered
0.0%
<50
control
mobile-web
Newcomer
0.0%
<50
control
mobile-web
Junior Contributor
4.1%
122
test
desktop
Unregistered
67.2%
137
test
desktop
Newcomer
72.4%
127
test
desktop
Junior Contributor
67.7%
745
test
mobile-web
Unregistered
54.3%
<50
test
mobile-web
Newcomer
43.8%
<50
test
mobile-web
Junior Contributor
47.6%
124
Table note:
Rate = mean(0/1 outcome) where outcome=1 means reference included (no acknowledgement). Denominator = constructive new-content edits (rows) in
reference_check_save_data
where
is_new_content == 1
and
was_reverted != 1
(not reverted within 48 hours), within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience).
4.2.2.1
Confirming the impact of Reference Check
Code
# KPI #1b model (reference included only)
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)) {
df_kpi1b
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
# KPI #1b population: constructive new-content edits only (exclude reverted within 48h)
if
"was_reverted"
%in%
names
(df_kpi1b)) {
df_kpi1b
<-
df_kpi1b
%>%
filter
is.na
(was_reverted)
was_reverted
!=
ref_col
<-
if
"was_reference_included"
%in%
names
(df_kpi1b)) {
"was_reference_included"
else
pick_first
"was_reference_included"
"reference_added"
"has_reference_added"
"has_reference"
), df_kpi1b)
if
is.null
(ref_col)
&&
all
"test_group"
"platform"
%in%
names
(df_kpi1b))) {
df_kpi1b
<-
df_kpi1b
%>%
mutate
ref_included =
ifelse
is.na
(.data[[ref_col]]),
L,
as.integer
(.data[[ref_col]]
==
)))
tryCatch
({
m_kpi1b
<-
glm
(ref_included
test_group
platform,
data =
df_kpi1b,
family =
binomial
())
render_binom_model
m_kpi1b,
"Table 1b. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among constructive new-content edits."
note_text =
"Outcome=1 means at least one net new reference included on a constructive new-content edit (not reverted within 48h). Population is restricted to shown test vs eligible-not-shown control. Adjusted for platform. OR>1 indicates higher odds of the outcome."
},
error =
function
(e) {
message
"KPI #1b model error: "
, e
message)
})
# KPI #1b Bayesian lift (relax) — reference included only
kpi1b_relax_df
<-
df_kpi1b
%>%
transmute
outcome =
ref_included,
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
render_relax
(kpi1b_relax_df,
"KPI #1b"
metric_type =
"proportion"
better =
"higher"
else
message
"KPI #1b model: required columns missing"
else
message
"KPI #1b model: data not loaded"
Table 1b. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among constructive new-content edits.
Term
OR
CI low
CI high
SE
p-value
Intercept
0.414
0.362
0.471
0.067
<0.001
test_grouptest
5.555
4.645
6.659
0.092
<0.001
platformmobile-web
0.301
0.229
0.392
0.137
<0.001
Table note:
Outcome=1 means at least one net new reference included on a constructive new-content edit (not reverted within 48h). Population is restricted to shown test vs eligible-not-shown control. Adjusted for platform. OR>1 indicates higher odds of the outcome.
KPI #1b
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
1.231
1.000
1.000
0.998
1.464
1.459
0.000
1.206
1.713
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 100.0% (computed as Chance to Win).
4.2.2.2
KPI #1b (availability / ITT)
This section reports KPI #1b using
assignment buckets
(control vs test) among
all published new-content edits
(not restricted to shown/eligible). This matches the 2024-style definition where the
denominator
is all new-content edits and the
outcome
is 1 only if the edit both (a) includes a new reference and (b) is not reverted within 48 hours.
Results:
We also report KPI #1b by test vs. control assignment, regardless of whether Reference Check was shown (intent-to-treat). This includes all published new-content edits in the target population and is directly comparable to 2024 Reference Check study data. Under this more conservative view, Reference Check still shows a benefit: edits in the test group were more likely to be constructive new-content edits that included a net new reference. How big is the change (ITT): * Overall: 56.3% → 68.3% (+12.1 pp, +21.5% relative) * Desktop: 60.5% → 70.6% (+10.2 pp, +16.8% relative) * Mobile web: editors were ~2.2× more likely to add a reference (22.0% → 47.8%)
Even under the conservative ITT lens (not restricted to shown/eligible), edits in the test group were more likely to be constructive new-content edits that included a net new reference, especially on mobile-web; this lift is statistically significant in the adjusted ITT model.
Note: In the
2024 Reference Check report
: “Users [i.e., based on an edit-level (edit session) comparison—not unique-user aggregation] were 2.2 times more likely to publish a new content edit that included a reference and was constructive (not reverted within 48 hours) when reference check was shown to eligible edits.” “On mobile, new content edits by contributors were 4.2 times more likely to include a reference and not be reverted when reference check was shown to eligible edits.”
KPI #1b (Availability / ITT; 2024-comparable): ↑ +12.1 pp (56.3% → 68.3%), +21.5% relative.
Evidence: glm (Table 1c) OR = 1.69 (95% CI 1.54–1.86), p < 0.001.
Relax (ITT, relative lift): +0.215 (Bayesian 95% CrI 0.172–0.255; Frequentist 95% CI 0.173–0.256; p < 0.001).
Code
# KPI #1b (availability / ITT; 2024-comparable): constructive reference-including edits
# - Uses assignment buckets (control vs test) among all new-content edits
# - Denominator: all published new-content edits in the target population (<=100 edits or unregistered)
# - Outcome: 1 only if (reference included) AND (not reverted within 48h)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
%in%
names
(reference_check_save_data))) {
df_itt
<-
reference_check_save_data
%>%
renorm_buckets
()
%>%
filter
(is_new_content
==
# Match target population used elsewhere: unregistered OR <= 100 edits
if
all
"user_status"
"user_edit_count"
%in%
names
(df_itt))) {
df_itt
<-
df_itt
%>%
filter
(user_status
==
"unregistered"
is.na
(user_edit_count)
user_edit_count
<=
100
))
# Outcome flag for reference included
ref_col
<-
if
"was_reference_included"
%in%
names
(df_itt)) {
"was_reference_included"
else
pick_first
"was_reference_included"
"reference_added"
"has_reference_added"
"has_reference"
), df_itt)
if
is.null
(ref_col)) {
message
"KPI #1b (ITT): reference-included flag not found in data"
else
df_itt
<-
df_itt
%>%
mutate
ref_included =
ifelse
is.na
(.data[[ref_col]]),
L,
as.integer
(.data[[ref_col]]
==
)),
unreverted_48h =
ifelse
is.na
(was_reverted),
NA_integer_
as.integer
(was_reverted
!=
)),
outcome =
ifelse
is.na
(unreverted_48h),
NA_integer_
as.integer
((ref_included
==
(unreverted_48h
==
))),
test_group =
factor
(test_group,
levels =
"control"
"test"
))
# If edits are not 1:1 with editing_session, dedupe to per-edit first
if
"editing_session"
%in%
names
(df_itt)) {
df_itt
<-
df_itt
%>%
group_by
(editing_session)
%>%
summarise
test_group =
dplyr
::
first
(test_group),
platform =
dplyr
::
first
if
"platform"
%in%
names
(df_itt)) platform
else
NA
),
outcome =
if
all
is.na
(outcome))) {
NA_real_
else
base
::
max
(outcome,
na.rm =
TRUE
},
.groups =
"drop"
# Persist the final ITT per-edit frame for the chart below
df_kpi1b_itt
<-
df_itt
# 1) Overall ITT rate + change vs control
itt_overall_rates
<-
make_rate_table
(df_itt,
"outcome"
group_cols =
"test_group"
))
%>%
mutate
scope =
"Overall"
itt_overall_rel
<-
make_rel_change_dim
(itt_overall_rates,
dim_col =
"scope"
render_rate_rel
itt_overall_rates,
itt_overall_rel,
"Constructive new content edits that include a new reference and are not reverted"
"Constructive new content edits that include a new reference and are not reverted: change vs control"
test_group =
"Experiment group"
scope =
"Scope"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
paste
"Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit)."
"Denominator = all published new-content edits within each assignment bucket (control vs test)"
"in the target population (unregistered OR <=100 edits)."
# 2) By platform ITT rate + change vs control
if
"platform"
%in%
names
(df_itt)) {
df_itt
<-
df_itt
%>%
mutate
platform =
factor
(platform))
itt_rates
<-
make_rate_table
(df_itt,
"outcome"
group_cols =
"test_group"
"platform"
))
itt_rel
<-
make_rel_change
(itt_rates)
render_rate_rel
itt_rates,
itt_rel,
"Constructive new content edits that include a new reference and are not reverted (by platform)"
"Constructive new content edits that include a new reference and are not reverted: change vs control (by platform)"
test_group =
"Experiment group"
platform =
"Platform"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
paste
"Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit)."
"Denominator = all published new-content edits within each (assignment bucket × platform)"
"in the target population (unregistered OR <=100 edits)."
else
message
"KPI #1b (ITT): data not loaded or required columns missing"
Constructive new content edits that include a new reference and are not reverted
Experiment group
Rate
Count (edits)
Scope
control
56.3%
4074
Overall
test
68.3%
4132
Overall
Table note:
Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit). Denominator = all published new-content edits within each assignment bucket (control vs test) in the target population (unregistered OR <=100 edits).
Constructive new content edits that include a new reference and are not reverted: change vs control
scope
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Overall
56.3%
68.3%
12.1
21.5%
4074
4132
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Constructive new content edits that include a new reference and are not reverted (by platform)
Experiment group
Platform
Rate
Count (edits)
control
desktop
60.5%
3629
control
mobile-web
22.0%
445
test
desktop
70.6%
3718
test
mobile-web
47.8%
414
Table note:
Rate = mean(0/1 outcome) where outcome=1 means (new reference included) AND (not reverted within 48h) (per edit). Denominator = all published new-content edits within each (assignment bucket × platform) in the target population (unregistered OR <=100 edits).
Constructive new content edits that include a new reference and are not reverted: change vs control (by platform)
Platform
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
desktop
60.5%
70.6%
10.2
16.8%
3629
3718
mobile-web
22.0%
47.8%
25.8
117.2%
445
414
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Code
# KPI #1b (availability / ITT) bar chart — 2024-comparable
# Constructive new content edits that include a new reference and are not reverted
if
exists
"df_kpi1b_itt"
&&
is.null
(df_kpi1b_itt)
&&
all
"test_group"
"outcome"
%in%
names
(df_kpi1b_itt))) {
kpi1b_itt_bar_df
<-
df_kpi1b_itt
%>%
filter
is.na
(outcome))
%>%
group_by
(test_group)
%>%
summarise
rate =
mean
(outcome,
na.rm =
TRUE
),
n =
dplyr
::
(),
.groups =
"drop"
%>%
mutate
label =
scales
::
percent
(rate,
accuracy =
0.1
))
kpi1b_itt_bar_df
%>%
ggplot
aes
x =
test_group,
y =
rate,
fill =
test_group))
geom_col
width =
0.9
geom_text
aes
label =
label),
vjust =
1.2
color =
"white"
size =
scale_y_continuous
labels =
scales
::
percent)
scale_fill_manual
values =
"control"
"#999999"
"test"
"steelblue2"
))
labs
title =
"Constructive new content edits that include a new reference"
x =
"Experiment Group"
y =
"Percent of new content edits"
pc_theme
()
guides
fill =
"none"
else
message
"KPI #1b (ITT) bar: df_kpi1b_itt not available; run the ITT section above"
Chart note (definition of Rate / denominator)
Rate
= mean(0/1 outcome) where outcome=1 means the edit both
included a new reference
and was
not reverted within 48 hours
Denominator
= all published new-content edits in the target population (unregistered users
or
users with 100 or fewer edits), within each assignment bucket (control vs test).
Interpretation
: this is the 2024-comparable “availability / ITT” view; it is
not restricted
to shown/eligible edits.
4.2.2.3
Confirming the impact of Reference Check
Code
# KPI #1b (availability / ITT) confirmation
# Uses the ITT per-edit frame created above (df_kpi1b_itt)
if
exists
"df_kpi1b_itt"
&&
is.null
(df_kpi1b_itt)
&&
all
"test_group"
"outcome"
%in%
names
(df_kpi1b_itt))) {
df_itt_stats
<-
df_kpi1b_itt
# 1) Adjusted logistic regression (glm)
if
"platform"
%in%
names
(df_itt_stats)) {
tryCatch
({
m_itt
<-
glm
(outcome
test_group
platform,
data =
df_itt_stats,
family =
binomial
())
render_binom_model
m_itt,
"Table 1c. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among new-content edits (availability / ITT)."
note_text =
"Outcome=1 means the edit included a new reference AND was not reverted within 48 hours (per edit). Denominator includes all new-content edits in the target population (unregistered OR <=100 edits) and is not restricted to shown/eligible. Adjusted for platform. OR>1 indicates higher odds of the outcome."
},
error =
function
(e) {
message
"KPI #1b (ITT) model error: "
, e
message)
})
else
message
"KPI #1b (ITT) model: platform not available"
# 2) Bayesian/Frequentist lift (relax)
itt_relax_df
<-
df_itt_stats
%>%
transmute
outcome =
outcome,
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
render_relax
(itt_relax_df,
"KPI #1b (availability / ITT)"
metric_type =
"proportion"
better =
"higher"
else
message
"KPI #1b (ITT) confirmation: df_kpi1b_itt not available; run the ITT section above"
Table 1c. Adjusted odds ratios (ORs) from multivariable logistic regression for KPI #1b outcome among new-content edits (availability / ITT).
Term
OR
CI low
CI high
SE
p-value
Intercept
1.477
1.385
1.576
0.033
<0.001
test_grouptest
1.693
1.544
1.856
0.047
<0.001
platformmobile-web
0.273
0.235
0.317
0.077
<0.001
Table note:
Outcome=1 means the edit included a new reference AND was not reverted within 48 hours (per edit). Denominator includes all new-content edits in the target population (unregistered OR <=100 edits) and is not restricted to shown/eligible. Adjusted for platform. OR>1 indicates higher odds of the outcome.
KPI #1b (availability / ITT)
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
0.214
1.000
1.000
0.172
0.255
0.215
0.000
0.173
0.256
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 100.0% (computed as Chance to Win).
4.2.3
KPI #2 Constructive edits
Metric
: Proportion of published edits that add new content (T333714) and are constructive, defined as not reverted within 48 hours.
Methodology
: This metric is computed on edits where
is_new_content == 1
. Constructive is defined as the edit not being reverted within 48 hours of publication.
Test group
: The test group includes new-content edits where Reference Check was shown at least once during the editing session.
Control group
: The control group includes new-content edits identified as eligible but not shown Reference Check.
Results:
Edits shown Reference Check were more likely to be constructive, especially on mobile-web. On desktop, constructive edits increased modestly from 75.5% to 77.9% (+3.2% relative lift), but this difference is not statistically significant in the regression model. Constructive outcomes trend higher when Reference Check is shown across both platforms, with a much larger improvement on mobile-web. While the most conservative adjusted model (adjusted regression terms (the overall
test_grouptest
effect and the
platform
interaction)) cannot fully rule out chance at this sample size, simpler and relax-based comparisons point in the same direction: the test group performs better. Both overall results and mobile-web-only analyses show meaningful improvements, with strong gains on mobile web. Although we cannot definitively conclude that mobile improvements are larger than desktop, the results consistently suggest this pattern. When we account for whether an edit added a new reference, the mobile-web advantage becomes smaller and no longer statistically clear, indicating that part of the benefit may come from increased reference inclusion. Overall, the evidence suggests Reference Check improves constructive editing outcomes, especially on mobile web.
Constructive Edits (Not Reverted Within 48 Hours), KPI #2 Mobile-Web Only:
On mobile-web, Reference Check meaningfully increases the likelihood that an edit is constructive. Editors were 18.2% more likely to make a constructive edit (56.4% → 66.7%). Results are directionally consistent; the mobile-web within-platform adjusted contrast is significant, indicating Reference Check improves constructive edit outcomes, particularly for mobile-web editors. Because mobile-web is a subgroup slice, we treat these results as strong within-platform evidence, though it aligns with the broader pattern in the unadjusted comparisons. In the conditional model, the mobile-web contrast is not statistically significant, consistent with the interpretation that some of the effect may operate through increased reference inclusion.
In short, consistent with a path where: Reference Check → references added → edits are constructive
KPI #2 (Constructive = not reverted within 48h, new-content edits):
Overall: ↑ +4.1 pp (71.8% → 75.9%), +5.7% relative.
Desktop: ↑ +2.4 pp (75.5% → 77.9%), +3.2% relative.
Mobile web: ↑ +10.3 pp (56.4% → 66.7%), +18.2% relative (within-platform contrast statistically significant).
Evidence: glm (Table 2) overall across-platform treatment term not significant (p=0.146);
mobile-web contrast significant (Table 2A: OR = 1.55, p = 0.010).
Conditional model: mobile-web contrast not significant when adjusting for reference inclusion (Table 2E: OR = 1.16, p = 0.390).
Relax (relative lift): overall +0.057 (p = 0.010); mobile-web-only +0.182 (p = 0.018).
Code
# KPI2 bar (constructive = not reverted within 48h), by test_group
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
kpi2_bar
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
constructive =
ifelse
(was_reverted
==
))
%>%
group_by
(test_group)
%>%
summarise
rate =
mean
(constructive,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
%>%
mutate
label =
scales
::
percent
(rate,
accuracy =
0.1
))
%>%
ggplot
aes
x =
test_group,
y =
rate,
fill =
test_group))
geom_col
()
geom_text
aes
label =
label),
vjust =
0.2
size =
scale_y_continuous
labels =
scales
::
percent_format
(),
expand =
expansion
mult =
0.12
)))
scale_fill_manual
values =
"control"
"#999999"
"test"
"dodgerblue4"
))
labs
title =
"KPI2: Constructive (not reverted within 48h)"
x =
"Test group"
y =
"Percent of new-content edits"
pc_theme
()
guides
fill =
"none"
(kpi2_bar)
else
message
"KPI2 plot: required columns missing in reference_check_save_data"
Chart note (definition of Rate / denominator)
KPI 2 (constructive)
: -
Rate
= mean(
constructive
) where
constructive
is 1 if
was_reverted == 0
(not reverted within 48h) and 0 otherwise. -
Denominator
= new-content edit rows in
reference_check_save_data
is_new_content == 1
within the analysis groups
(test = RC shown at least once; control = eligible-but-not-shown).
Code
# 2) Constructive edits (revert within 48h) among new content
# Align with KPI #2 population (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
constructive_summary
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
reverted_flag =
ifelse
(was_reverted
==
"reverted"
"not_reverted"
))
%>%
count
(test_group, reverted_flag)
%>%
group_by
(test_group)
%>%
mutate
pct =
sum
(n))
%>%
arrange
(test_group,
desc
(n))
constructive_summary
<-
renorm_buckets
(constructive_summary)
render_pct_table
constructive_summary,
"Constructive (48h unreverted) among new content"
test_group =
"Test group"
reverted_flag =
"Revert status"
n =
"Count (edits)"
pct =
"Percent of new-content edits"
),
note_text =
"Percent of new-content edits = share of new-content edits within each analysis group (shown test vs eligible-not-shown control).
`Revert status` is derived from `was_reverted` (within 48h)."
else
message
"Constructive summary: required columns missing in reference_check_save_data"
Constructive (48h unreverted) among new content
Revert status
Count (edits)
Percent of new-content edits
control
not_reverted
1196
71.8%
reverted
469
28.2%
test
not_reverted
1195
75.9%
reverted
379
24.1%
Table note:
Percent of new-content edits = share of new-content edits within each analysis group (shown test vs eligible-not-shown control).
Revert status
is derived from
was_reverted
(within 48h).
Code
# KPI #2 tables (constructive) by platform and deltas
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"platform"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_kpi2
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
constructive =
ifelse
(was_reverted
==
))
# Overall (control vs test) + change vs control
kpi2_overall_rates
<-
make_rate_table
(df_kpi2,
"constructive"
group_cols =
"test_group"
))
%>%
mutate
scope =
"Overall"
kpi2_overall_rel
<-
make_rel_change_dim
(kpi2_overall_rates,
dim_col =
"scope"
render_rate_rel
kpi2_overall_rates,
kpi2_overall_rel,
"KPI #2: constructive (48h unreverted) overall"
"KPI #2: change vs control (overall)"
test_group =
"Test group"
scope =
"Scope"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control)."
# By platform (control vs test) + change vs control
kpi2_rates
<-
df_kpi2
%>%
make_rate_table
"constructive"
kpi2_rel
<-
make_rel_change
(kpi2_rates)
render_rate_rel
kpi2_rates, kpi2_rel,
"KPI #2: constructive (48h unreverted) by platform"
"KPI #2: change vs control (by platform)"
test_group =
"Test group"
platform =
"Platform"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform)."
group_col =
"platform"
within_group_order_col =
"test_group"
within_group_order =
"control"
"test"
# User experience breakdown (Unregistered / Newcomer / Junior Contributor)
if
"experience_level_group"
%in%
names
(df_kpi2)) {
kpi2_exp_df
<-
df_kpi2
%>%
filter
is.na
(experience_level_group), experience_level_group
%in%
"Unregistered"
"Newcomer"
"Junior Contributor"
))
kpi2_exp_rates
<-
make_rate_table
(kpi2_exp_df,
"constructive"
group_cols =
"test_group"
"experience_level_group"
))
kpi2_exp_rel
<-
make_rel_change_dim
(kpi2_exp_rates,
dim_col =
"experience_level_group"
render_rate_rel
kpi2_exp_rates, kpi2_exp_rel,
"KPI #2: constructive (48h unreverted) by user experience"
"KPI #2: change vs control (by user experience)"
test_group =
"Test group"
experience_level_group =
"User experience"
rate =
"Rate"
n =
"Count (edits)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience)."
else
message
"KPI #2 user experience tables: experience_level_group not available in reference_check_save_data"
else
message
"KPI #2 tables: required columns missing in reference_check_save_data"
KPI #2: constructive (48h unreverted) overall
Test group
Rate
Count (edits)
Scope
control
71.8%
1665
Overall
test
75.9%
1574
Overall
Table note:
Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control).
KPI #2: change vs control (overall)
scope
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Overall
71.8%
75.9%
4.1
5.7%
1665
1574
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
KPI #2: constructive (48h unreverted) by platform
Test group
Rate
Count (edits)
desktop
control
75.5%
1344
test
77.9%
1295
mobile-web
control
56.4%
321
test
66.7%
279
Table note:
Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).
KPI #2: change vs control (by platform)
Platform
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
desktop
75.5%
77.9%
2.4
3.2%
1344
1295
mobile-web
56.4%
66.7%
10.3
18.2%
321
279
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
KPI #2: constructive (48h unreverted) by user experience
Test group
User experience
Rate
Count (edits)
control
Unregistered
62.7%
255
control
Newcomer
59.1%
230
control
Junior Contributor
76.3%
1180
test
Unregistered
70.4%
260
test
Newcomer
64.7%
221
test
Junior Contributor
79.5%
1093
Table note:
Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience).
KPI #2: change vs control (by user experience)
experience_level_group
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Unregistered
62.7%
70.4%
7.6
12.2%
255
260
Newcomer
59.1%
64.7%
5.6
9.4%
230
221
Junior Contributor
76.3%
79.5%
3.2
4.2%
1180
1093
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Code
# KPI #2 by platform and user experience
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"platform"
"experience_level_group"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_kpi2
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
constructive =
ifelse
(was_reverted
==
))
kpi2_slices
<-
df_kpi2
%>%
filter
is.na
(experience_level_group), experience_level_group
%in%
"Unregistered"
"Newcomer"
"Junior Contributor"
))
%>%
group_by
(test_group, platform, experience_level_group)
%>%
summarise
rate =
mean
(constructive,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
kpi2_slices,
"KPI #2: constructive by platform and user experience"
test_group =
"Test group"
platform =
"Platform"
experience_level_group =
"User experience"
rate =
"Rate"
n =
"Count (edits)"
),
note_text =
"Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience)."
else
message
"KPI #2 slices: required columns missing in reference_check_save_data"
KPI #2 slices: required columns missing in reference_check_save_data
Code
# KPI #2 by checks shown (bucketed)
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"n_checks_shown"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_kpi2_checks
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
, test_group
==
"test"
%>%
mutate
constructive =
ifelse
(was_reverted
==
),
checks_bucket =
case_when
is.na
(n_checks_shown)
"unknown"
n_checks_shown
==
"0"
n_checks_shown
==
"1"
n_checks_shown
==
"2"
n_checks_shown
>=
"3+"
%>%
group_by
(checks_bucket)
%>%
summarise
rate =
mean
(constructive,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
%>%
mutate
checks_bucket =
factor
(checks_bucket,
levels =
"unknown"
"0"
"1"
"2"
"3+"
)))
%>%
arrange
(checks_bucket)
render_slice
df_kpi2_checks,
"KPI #2 by checks shown (test group only)"
checks_bucket =
"Checks shown"
rate =
"Rate"
n =
"Count (edits)"
),
note_text =
"Rate = mean(0/1 outcome) in the test group only where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream."
else
message
"KPI #2 by checks: required columns missing in reference_check_save_data"
KPI #2 by checks shown (test group only)
Checks shown
Rate
Count (edits)
76.6%
1086
79.8%
208
3+
70.4%
280
Table note:
Rate = mean(0/1 outcome) in the test group only where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream.
Code
# Per-wiki sanity: constructive
# Only render when multiple wikis are present.
if
is.null
(reference_check_save_data)
&&
all
"wiki"
"test_group"
"is_new_content"
"was_reverted"
%in%
names
(reference_check_save_data))) {
if
(dplyr
::
n_distinct
(reference_check_save_data
wiki)
<=
) {
message
"Per-wiki constructive: skipped (single wiki)"
else
per_wiki_constructive
<-
reference_check_save_data
%>%
filter
(is_new_content
==
%>%
mutate
constructive =
ifelse
(was_reverted
==
))
%>%
group_by
(wiki, test_group)
%>%
summarise
rate =
mean
(constructive,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
per_wiki_constructive,
"Per-wiki constructive"
wiki =
"Wiki"
test_group =
"Test group"
rate =
"Rate"
n =
"Count (edits)"
),
note_text =
"Rate = mean(0/1 outcome) where outcome=1 means not reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` for each (wiki × test group)."
else
message
"Per-wiki constructive: required columns missing in reference_check_save_data"
Per-wiki constructive: skipped (single wiki)
4.2.3.1
Confirming the impact of Reference Check
Model A
estimates the primary KPI #2 estimand: the total effect of treatment on the constructive rate.
Model B
is an optional conditional analysis, reported only when
editcheck-newreference
is directly observed, and estimates the treatment effect holding reference inclusion constant.
Code
# KPI #2 model (constructive = not reverted within 48h)
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"platform"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_kpi2
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
constructive =
ifelse
(was_reverted
==
),
test_group =
factor
(test_group,
levels =
"control"
"test"
)),
platform =
factor
(platform)
tryCatch
({
IRdisplay
::
display_markdown
"**Model A (total effect)**"
# Model A (total effect): primary KPI #2 estimand
if
(dplyr
::
n_distinct
(df_kpi2
platform)
) {
m_kpi2_a
<-
glm
(constructive
test_group
platform,
data =
df_kpi2,
family =
binomial
())
else
m_kpi2_a
<-
glm
(constructive
test_group
platform,
data =
df_kpi2,
family =
binomial
())
render_binom_model
m_kpi2_a,
"Table 2. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits."
note_text =
paste
"Model A estimates the primary KPI #2 estimand: the total effect of treatment on the constructive rate."
"Outcome=1 means not reverted within 48h on a new-content edit."
"Population is restricted to shown test vs eligible-not-shown control."
"Includes a test_group×platform interaction when platform has multiple levels."
"OR>1 indicates higher odds of the outcome."
sep =
" "
# Table 2A: mobile-web treatment vs control (Model A)
if
"mobile-web"
%in%
levels
(df_kpi2
platform)) {
nd_ctrl
<-
data.frame
test_group =
factor
"control"
levels =
levels
(df_kpi2
test_group)),
platform =
factor
"mobile-web"
levels =
levels
(df_kpi2
platform))
nd_test
<-
nd_ctrl
%>%
mutate
test_group =
factor
"test"
levels =
levels
(df_kpi2
test_group)))
mw_contrast
<-
tidy_glm_contrast_or
model =
m_kpi2_a,
newdata_control =
nd_ctrl,
newdata_test =
nd_test,
label =
"Mobile-web: treatment vs control"
render_or_contrast_table
mw_contrast,
"Table 2A. Mobile-web treatment vs control (constructive outcome) among new-content edits."
note_text =
"This contrast is computed from Table 2 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value."
else
message
"KPI #2 Table 2A skipped: platform level 'mobile-web' not present in df_kpi2"
IRdisplay
::
display_markdown
"**Model B (conditional; only when editcheck-newreference is observed)**"
# Model B (conditional): only when editcheck-newreference is directly observed
if
"was_reference_included"
%in%
names
(df_kpi2)) {
df_kpi2
<-
df_kpi2
%>%
mutate
was_reference_included =
ifelse
is.na
(was_reference_included),
L,
as.integer
(was_reference_included
==
)))
if
(dplyr
::
n_distinct
(df_kpi2
platform)
) {
m_kpi2_b
<-
glm
(constructive
test_group
platform
was_reference_included,
data =
df_kpi2,
family =
binomial
())
else
m_kpi2_b
<-
glm
(constructive
test_group
platform
was_reference_included,
data =
df_kpi2,
family =
binomial
())
render_binom_model
m_kpi2_b,
"Table 2D. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits (conditional on reference inclusion)."
note_text =
"Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed."
# Table 2E: mobile-web treatment vs control (Model B)
if
"mobile-web"
%in%
levels
(df_kpi2
platform)) {
ref_mean_mw
<-
mean
(df_kpi2
was_reference_included[df_kpi2
platform
==
"mobile-web"
],
na.rm =
TRUE
nd_ctrl_b
<-
data.frame
test_group =
factor
"control"
levels =
levels
(df_kpi2
test_group)),
platform =
factor
"mobile-web"
levels =
levels
(df_kpi2
platform)),
was_reference_included =
ref_mean_mw
nd_test_b
<-
nd_ctrl_b
%>%
mutate
test_group =
factor
"test"
levels =
levels
(df_kpi2
test_group)))
mw_contrast_b
<-
tidy_glm_contrast_or
model =
m_kpi2_b,
newdata_control =
nd_ctrl_b,
newdata_test =
nd_test_b,
label =
"Mobile-web: treatment vs control (conditional)"
render_or_contrast_table
mw_contrast_b,
"Table 2E. Mobile-web treatment vs control (constructive outcome) among new-content edits (conditional on reference inclusion)."
note_text =
"This contrast is computed from Table 2D (Model B), holding was_reference_included at its mobile-web mean."
else
message
"KPI #2 Model B skipped: was_reference_included (editcheck-newreference) not available in df_kpi2"
},
error =
function
(e) {
message
"KPI #2 model error: "
, e
message)
})
else
message
"KPI #2 model: required columns missing or data not loaded"
Model A (total effect)
Table 2. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits.
Term
OR
CI low
CI high
SE
p-value
Intercept
3.085
2.728
3.498
0.063
<0.001
test_grouptest
1.144
0.955
1.371
0.092
0.146
platformmobile-web
0.419
0.325
0.540
0.129
<0.001
test_grouptest:platformmobile-web
1.353
0.927
1.978
0.193
0.118
Table note:
Model A estimates the primary KPI #2 estimand: the total effect of treatment on the constructive rate. Outcome=1 means not reverted within 48h on a new-content edit. Population is restricted to shown test vs eligible-not-shown control. Includes a test_group×platform interaction when platform has multiple levels. OR>1 indicates higher odds of the outcome.
Table 2A. Mobile-web treatment vs control (constructive outcome) among new-content edits.
Contrast
OR
CI low
CI high
SE
p-value
Mobile-web: treatment vs control
1.547
1.109
2.157
0.170
0.010
Table note:
This contrast is computed from Table 2 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value.
Model B (conditional; only when editcheck-newreference is observed)
Table 2D. Adjusted odds ratios (ORs) from multivariable logistic regression for constructive outcome (not reverted within 48h) among new-content edits (conditional on reference inclusion).
Term
OR
CI low
CI high
SE
p-value
Intercept
2.613
2.296
2.981
0.067
<0.001
test_grouptest
0.873
0.718
1.060
0.099
0.170
platformmobile-web
0.482
0.374
0.624
0.131
<0.001
was_reference_included
2.069
1.716
2.501
0.096
<0.001
test_grouptest:platformmobile-web
1.332
0.910
1.954
0.195
0.142
Table note:
Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed.
Table 2E. Mobile-web treatment vs control (constructive outcome) among new-content edits (conditional on reference inclusion).
Contrast
OR
CI low
CI high
SE
p-value
Mobile-web: treatment vs control (conditional)
1.162
0.825
1.637
0.175
0.390
Table note:
This contrast is computed from Table 2D (Model B), holding was_reference_included at its mobile-web mean.
Code
# KPI #2 Bayesian confirmation (brms) + Bayesian lift (relax) — constructive (not reverted within 48h)
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_kpi2_brm
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
constructive =
ifelse
(was_reverted
==
),
test_group =
factor
(test_group,
levels =
"control"
"test"
))
# Hierarchical Bayesian regression when user_id is present
if
all
"user_id"
"platform"
"experience_level_group"
%in%
names
(df_kpi2_brm))) {
df_brm
<-
df_kpi2_brm
%>%
mutate
platform =
factor
(platform),
experience_level_group =
droplevels
(experience_level_group)
%>%
filter
is.na
(user_id),
is.na
(constructive),
is.na
(test_group),
is.na
(platform))
if
(dplyr
::
n_distinct
(df_brm
user_id)
&&
dplyr
::
n_distinct
(df_brm
test_group)
==
) {
if
requireNamespace
"brms"
quietly =
TRUE
)) {
message
"KPI #2 brms: skipped (brms not available / cannot be loaded in this environment)"
else
if
exists
"safe_brm"
mode =
"function"
)) {
message
"KPI #2 brms: skipped (safe_brm not defined; run the setup/helper cells first)"
else
priors
<-
brms
::
set_prior
prior =
"std_normal()"
class =
"b"
),
brms
::
set_prior
"cauchy(0, 5)"
class =
"sd"
fit_brm
<-
safe_brm
constructive
test_group
platform
experience_level_group
user_id),
data =
df_brm,
prior =
priors,
seed =
chains =
cores =
refresh =
if
is.null
(fit_brm)) {
# Posterior-derived lift (probability space) + OR summary (multi-check style)
nd_ctrl
<-
df_brm
%>%
mutate
test_group =
factor
"control"
levels =
"control"
"test"
)))
nd_test
<-
df_brm
%>%
mutate
test_group =
factor
"test"
levels =
"control"
"test"
)))
render_brms_confirm_table
fit =
fit_brm,
title =
"Table 2B. Hierarchical Bayesian confirmation for constructive outcome among new-content edits."
coef_name =
"b_test_grouptest"
newdata_control =
nd_ctrl,
newdata_test =
nd_test,
note_text =
"Posterior-derived average lift is computed as the per-draw mean of Pr(outcome|test) − Pr(outcome|control) over the observed covariate distribution (platform + experience), using population-level predictions (re_formula = NA)."
else
message
"KPI #2 brms: skipped (insufficient variation in user_id or test_group)"
else
message
"KPI #2 brms: skipped (missing user_id/platform/experience_level_group)"
# relax lift table (audit + uncertainty)
IRdisplay
::
display_markdown
"**Bayesian Analysis / Frequentist Analysis (overall)**"
kpi2_df
<-
df_kpi2_brm
%>%
transmute
outcome =
constructive,
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
render_relax
(kpi2_df,
"KPI #2"
metric_type =
"proportion"
better =
"higher"
IRdisplay
::
display_markdown
"**Bayesian Analysis / Frequentist Analysis (mobile-web only)**"
# Mobile-web only: explicit control vs treatment inference
if
"platform"
%in%
names
(df_kpi2_brm)) {
df_kpi2_mw
<-
df_kpi2_brm
%>%
filter
(platform
==
"mobile-web"
if
nrow
(df_kpi2_mw)
==
) {
message
"KPI #2 (mobile-web only): skipped (no rows after filtering platform == 'mobile-web')"
else
if
(dplyr
::
n_distinct
(df_kpi2_mw
test_group)
) {
message
"KPI #2 (mobile-web only): skipped (need both control and test groups)"
else
# relax (mobile-web)
kpi2_mw_df
<-
df_kpi2_mw
%>%
transmute
outcome =
constructive,
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
render_relax
(kpi2_mw_df,
"KPI #2 (mobile-web only)"
metric_type =
"proportion"
better =
"higher"
# brms confirmation (mobile-web) when user_id is present
if
all
"user_id"
"experience_level_group"
%in%
names
(df_kpi2_mw))) {
df_brm_mw
<-
df_kpi2_mw
%>%
mutate
experience_level_group =
droplevels
(experience_level_group))
%>%
filter
is.na
(user_id),
is.na
(constructive),
is.na
(test_group))
if
(dplyr
::
n_distinct
(df_brm_mw
user_id)
&&
dplyr
::
n_distinct
(df_brm_mw
test_group)
==
) {
if
requireNamespace
"brms"
quietly =
TRUE
)) {
message
"KPI #2 brms (mobile-web): skipped (brms not available / cannot be loaded in this environment)"
else
if
exists
"safe_brm"
mode =
"function"
)) {
message
"KPI #2 brms (mobile-web): skipped (safe_brm not defined; run the setup/helper cells first)"
else
priors
<-
brms
::
set_prior
prior =
"std_normal()"
class =
"b"
),
brms
::
set_prior
"cauchy(0, 5)"
class =
"sd"
fit_brm_mw
<-
safe_brm
constructive
test_group
experience_level_group
user_id),
data =
df_brm_mw,
prior =
priors,
seed =
chains =
cores =
refresh =
if
is.null
(fit_brm_mw)) {
nd_ctrl
<-
df_brm_mw
%>%
mutate
test_group =
factor
"control"
levels =
"control"
"test"
)))
nd_test
<-
df_brm_mw
%>%
mutate
test_group =
factor
"test"
levels =
"control"
"test"
)))
render_brms_confirm_table
fit =
fit_brm_mw,
title =
"Table 2C. Hierarchical Bayesian confirmation for constructive outcome among mobile-web new-content edits."
coef_name =
"b_test_grouptest"
newdata_control =
nd_ctrl,
newdata_test =
nd_test,
note_text =
"Mobile-web-only brms model adjusts for experience group (platform is constant and omitted). Posterior-derived average lift is computed as the per-draw mean of Pr(outcome|test) − Pr(outcome|control) over the observed covariate distribution (experience), using population-level predictions (re_formula = NA)."
else
message
"KPI #2 brms (mobile-web): skipped (insufficient variation in user_id or test_group)"
else
message
"KPI #2 brms (mobile-web): skipped (missing user_id/experience_level_group)"
else
message
"KPI #2 relax: required columns missing or data not loaded"
Start sampling
Running MCMC with 4 parallel chains...

Chain 3 finished in 23.6 seconds.
Chain 1 finished in 24.0 seconds.
Chain 4 finished in 24.0 seconds.
Chain 2 finished in 24.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 24.0 seconds.
Total execution time: 24.4 seconds.
Loading required package: rstan

brms fit skipped (backend=cmdstanr): unable to find required package ‘rstan’
Continuing with glm + relax outputs.
To run brms reliably, prefer cmdstanr with CmdStan installed and a stable R toolchain.
Bayesian Analysis / Frequentist Analysis (overall)
KPI #2
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
0.057
0.995
0.995
0.013
0.100
0.057
0.010
0.014
0.100
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 99.5% (computed as Chance to Win).
Bayesian Analysis / Frequentist Analysis (mobile-web only)
KPI #2 (mobile-web only)
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
0.171
0.989
0.989
0.026
0.317
0.182
0.018
0.032
0.333
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 98.9% (computed as Chance to Win).
Start sampling
Running MCMC with 4 parallel chains...

Chain 1 finished in 3.0 seconds.
Chain 4 finished in 3.1 seconds.
Chain 2 finished in 3.3 seconds.
Chain 3 finished in 3.8 seconds.

All 4 chains finished successfully.
Mean chain execution time: 3.3 seconds.
Total execution time: 3.9 seconds.
Loading required package: rstan

brms fit skipped (backend=cmdstanr): unable to find required package ‘rstan’
Continuing with glm + relax outputs.
To run brms reliably, prefer cmdstanr with CmdStan installed and a stable R toolchain.
4.2.4
Guardrail #1 Content quality (reverts)
Metric
: Proportion of published new-content edits that are reverted within 48 hours.
Methodology
: We review revert rates for all published new-content edits and compare the test and control groups.
Test group
: The test group includes published new-content edits where Reference Check was shown at least once during the editing session.
Control group
: The control group includes published new-content edits identified as eligible but not shown Reference Check.
Additional analysis
: We include a breakdown of revert rates for published edits with a reference added and published edits without a reference added.
Note
: Similar to
KPI2
in Multi Check
Results:
Edits shown Reference Check were less likely to be reverted within 48hours (-14.5% relative: 28.2% control, 24.1% test). How big is the change: * Desktop: revert rates declined 9.8% relative (24.5% → 22.1%). * Mobile-web: revert rates declined 23.6% relative (43.6% → 33.3%).
Overall, edits were less likely to be reverted when editors were shown Reference Check compared with the control group. This reduction is supported by the overall and mobile-web-only relax results. The decrease was most pronounced on mobile web where within-platform comparisons show a clear reduction in reverts. While the results suggest a larger reduction on mobile web than desktop, we cannot establish this difference because the platform interaction term is not statistically significant. Importantly, edits that include a new reference were much less likely to be reverted, reinforcing the proposed quality mechanism behind the feature. In the overall adjusted regression, the across-platform treatment term and the platform-difference term are not statistically significant, so we treat “mobile improves more than desktop” as suggestive. However, the within-mobile-web adjusted contrast is statistically significant (Table 3A), and the overall and mobile-web-only relax analyses show a statistically significant reduction in reverts.
Note: In the
2024 Reference Check report
, New content edit revert rate decreased by 8.6% if reference check was available. While some nonconstructive new content edits with a reference were introduced by this feature (5 percentage point (pp) increase), there was a higher proportion of constructive new content edits with a reference added (23.4 pp increase).
Guardrail #1 (Revert within 48h; lower is better):
Overall: ↓ −4.1 pp (28.2% → 24.1%), −14.5% relative.
Desktop: ↓ −2.4 pp (24.5% → 22.1%), −9.8% relative.
Mobile-web: ↓ −10.3 pp (43.6% → 33.3%), −23.6% relative.
Evidence: glm (Table 3) overall across-platform treatment term not significant (p=0.146);
mobile-web contrast significant (Table 3A p=0.010).
Relax (relative lift): overall −0.141 (p=0.004). [Optional: mobile-web-only −0.236 (p=0.004).]
Code
# Revert rate bar (guardrail quality) by test group
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"was_reverted"
"is_new_content"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
revert_plot
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
reverted =
ifelse
(was_reverted
==
))
%>%
group_by
(test_group)
%>%
summarise
rate =
mean
(reverted,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
%>%
mutate
label =
scales
::
percent
(rate,
accuracy =
0.1
))
%>%
ggplot
aes
x =
test_group,
y =
rate,
fill =
test_group))
geom_col
()
geom_text
aes
label =
label),
vjust =
0.2
size =
scale_y_continuous
labels =
scales
::
percent_format
(),
expand =
expansion
mult =
0.12
)))
scale_fill_manual
values =
"control"
"#999999"
"test"
"dodgerblue4"
))
labs
title =
"Revert rate by test group"
x =
"Test group"
y =
"Percent reverted (48h)"
pc_theme
()
guides
fill =
"none"
(revert_plot)
else
message
"Revert plot: required columns missing in reference_check_save_data"
Chart note (definition of Rate / denominator)
Revert rate by test group
: Rate = mean(
reverted
) where
reverted
is 1 if
was_reverted == 1
and 0 otherwise. Denominator = rows (edits) in
reference_check_save_data
for each test group.
Code
# Guardrail #1 tables (revert rate) by platform and deltas
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"platform"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
gr1_df
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
reverted =
ifelse
(was_reverted
==
))
# Overall (control vs test) + change vs control
gr1_overall_rates
<-
make_rate_table
(gr1_df,
"reverted"
group_cols =
"test_group"
))
%>%
mutate
scope =
"Overall"
gr1_overall_rel
<-
make_rel_change_dim
(gr1_overall_rates,
dim_col =
"scope"
render_rate_rel
gr1_overall_rates,
gr1_overall_rel,
"Guardrail #1: revert rate (48h) overall"
"Guardrail #1: change vs control (overall)"
test_group =
"Test group"
scope =
"Scope"
rate =
"Revert rate"
n =
"Count (edits)"
),
note_rate =
"Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control)."
# By platform (control vs test) + change vs control
guardrail1_rates
<-
gr1_df
%>%
make_rate_table
"reverted"
guardrail1_rel
<-
make_rel_change
(guardrail1_rates)
render_rate_rel
guardrail1_rates, guardrail1_rel,
"Guardrail #1: revert rate (48h) by platform"
"Guardrail #1: change vs control (by platform)"
test_group =
"Test group"
platform =
"Platform"
rate =
"Revert rate"
n =
"Count (edits)"
),
note_rate =
"Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform)."
# Guardrail #1 by checks shown (bucketed)
if
"n_checks_shown"
%in%
names
(gr1_df)) {
gr1_checks
<-
gr1_df
%>%
# Checks-shown buckets come from RC shown events; we report this slice for the test group only.
filter
(test_group
==
"test"
%>%
mutate
checks_bucket =
case_when
is.na
(n_checks_shown)
"unknown"
n_checks_shown
==
"0"
n_checks_shown
==
"1"
n_checks_shown
==
"2"
n_checks_shown
>=
"3+"
))
%>%
group_by
(checks_bucket)
%>%
summarise
rate =
mean
(reverted,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
%>%
mutate
checks_bucket =
factor
(checks_bucket,
levels =
"unknown"
"0"
"1"
"2"
"3+"
)))
%>%
arrange
(checks_bucket)
render_slice
gr1_checks,
"Guardrail #1 by checks shown (test group only)"
checks_bucket =
"Checks shown"
rate =
"Revert rate"
n =
"Count (edits)"
),
note_text =
"Revert rate = mean(0/1 outcome) in the test group only where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream."
else
message
"Guardrail #1 by checks: required columns missing in reference_check_save_data"
# User experience breakdown (Unregistered / Newcomer / Junior Contributor)
if
"experience_level_group"
%in%
names
(gr1_df)) {
gr1_exp_df
<-
gr1_df
%>%
filter
is.na
(experience_level_group), experience_level_group
%in%
"Unregistered"
"Newcomer"
"Junior Contributor"
))
gr1_exp_rates
<-
make_rate_table
(gr1_exp_df,
"reverted"
group_cols =
"test_group"
"experience_level_group"
))
gr1_exp_rel
<-
make_rel_change_dim
(gr1_exp_rates,
dim_col =
"experience_level_group"
render_rate_rel
gr1_exp_rates, gr1_exp_rel,
"Guardrail #1: revert rate (48h) by user experience"
"Guardrail #1: change vs control (by user experience)"
test_group =
"Test group"
experience_level_group =
"User experience"
rate =
"Revert rate"
n =
"Count (edits)"
),
note_rate =
"Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience)."
# Guardrail #1 by platform and user experience
gr1_exp_slices
<-
gr1_exp_df
%>%
group_by
(test_group, platform, experience_level_group)
%>%
summarise
rate =
mean
(reverted,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
gr1_exp_slices,
"Guardrail #1: revert rate (48h) by platform and user experience"
test_group =
"Test group"
platform =
"Platform"
experience_level_group =
"User experience"
rate =
"Revert rate"
n =
"Count (edits)"
),
note_text =
"Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in `reference_check_save_data` within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience)."
else
message
"Guardrail #1 user experience tables: experience_level_group not available in reference_check_save_data"
else
message
"Guardrail #1 tables: required columns missing in reference_check_save_data"
Guardrail #1: revert rate (48h) overall
Test group
Revert rate
Count (edits)
Scope
control
28.2%
1665
Overall
test
24.1%
1574
Overall
Table note:
Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control).
Guardrail #1: change vs control (overall)
scope
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Overall
28.2%
24.1%
-4.1
-14.5%
1665
1574
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Guardrail #1: revert rate (48h) by platform
Test group
Platform
Revert rate
Count (edits)
control
desktop
24.5%
1344
control
mobile-web
43.6%
321
test
desktop
22.1%
1295
test
mobile-web
33.3%
279
Table note:
Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform).
Guardrail #1: change vs control (by platform)
Platform
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
desktop
24.5%
22.1%
-2.4
-9.8%
1344
1295
mobile-web
43.6%
33.3%
-10.3
-23.6%
321
279
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Guardrail #1 by checks shown (test group only)
Checks shown
Revert rate
Count (edits)
23.4%
1086
20.2%
208
3+
29.6%
280
Table note:
Revert rate = mean(0/1 outcome) in the test group only where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within the shown test group for each checks-shown bucket. Control is excluded because it has no comparable checks-shown event stream.
Guardrail #1: revert rate (48h) by user experience
Test group
User experience
Revert rate
Count (edits)
control
Unregistered
37.3%
255
control
Newcomer
40.9%
230
control
Junior Contributor
23.7%
1180
test
Unregistered
29.6%
260
test
Newcomer
35.3%
221
test
Junior Contributor
20.5%
1093
Table note:
Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × user experience).
Guardrail #1: change vs control (by user experience)
experience_level_group
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Unregistered
37.3%
29.6%
-7.6
-20.5%
255
260
Newcomer
40.9%
35.3%
-5.6
-13.6%
230
221
Junior Contributor
23.7%
20.5%
-3.2
-13.6%
1180
1093
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Guardrail #1: revert rate (48h) by platform and user experience
Test group
Platform
User experience
Revert rate
Count (edits)
control
desktop
Unregistered
34.4%
180
control
desktop
Newcomer
36.4%
187
control
desktop
Junior Contributor
20.4%
977
control
mobile-web
Unregistered
44.0%
75
control
mobile-web
Newcomer
60.5%
<50
control
mobile-web
Junior Contributor
39.9%
203
test
desktop
Unregistered
29.0%
193
test
desktop
Newcomer
33.5%
191
test
desktop
Junior Contributor
18.2%
911
test
mobile-web
Unregistered
31.3%
67
test
mobile-web
Newcomer
46.7%
<50
test
mobile-web
Junior Contributor
31.9%
182
Table note:
Revert rate = mean(0/1 outcome) where outcome=1 means reverted within 48h (per edit). Denominator = new-content edits (rows) in
reference_check_save_data
within each analysis group (shown test vs eligible-not-shown control) for each (test group × platform × user experience).
Code
# Guardrail #1 model (revert rate)
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"platform"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_g1
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
reverted =
ifelse
(was_reverted
==
),
test_group =
factor
(test_group,
levels =
"control"
"test"
)),
platform =
factor
(platform)
if
all
"test_group"
"platform"
"reverted"
%in%
names
(df_g1))) {
tryCatch
({
# Model A (total effect): primary revert-rate estimand
if
(dplyr
::
n_distinct
(df_g1
platform)
) {
m_g1_a
<-
glm
(reverted
test_group
platform,
data =
df_g1,
family =
binomial
())
else
m_g1_a
<-
glm
(reverted
test_group
platform,
data =
df_g1,
family =
binomial
())
render_binom_model
m_g1_a,
"Table 3. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits."
note_text =
paste
"Model A estimates the total effect of treatment on revert rate."
"Outcome=1 means reverted within 48h on a new-content edit."
"Population is restricted to shown test vs eligible-not-shown control."
"Includes a test_group×platform interaction when platform has multiple levels."
"OR>1 indicates higher odds of revert."
sep =
" "
# Table 3A: mobile-web treatment vs control (Model A)
if
"mobile-web"
%in%
levels
(df_g1
platform)) {
nd_ctrl
<-
data.frame
test_group =
factor
"control"
levels =
levels
(df_g1
test_group)),
platform =
factor
"mobile-web"
levels =
levels
(df_g1
platform))
nd_test
<-
nd_ctrl
%>%
mutate
test_group =
factor
"test"
levels =
levels
(df_g1
test_group)))
mw_contrast
<-
tidy_glm_contrast_or
model =
m_g1_a,
newdata_control =
nd_ctrl,
newdata_test =
nd_test,
label =
"Mobile-web: treatment vs control"
render_or_contrast_table
mw_contrast,
"Table 3A. Mobile-web treatment vs control (revert rate within 48h) among new-content edits."
note_text =
"This contrast is computed from Table 3 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value."
else
message
"Guardrail #1 Table 3A skipped: platform level 'mobile-web' not present in df_g1"
# Model B (conditional): only when editcheck-newreference is directly observed
if
"was_reference_included"
%in%
names
(df_g1)) {
df_g1
<-
df_g1
%>%
mutate
was_reference_included =
ifelse
is.na
(was_reference_included),
L,
as.integer
(was_reference_included
==
)))
if
(dplyr
::
n_distinct
(df_g1
platform)
) {
m_g1_b
<-
glm
(reverted
test_group
platform
was_reference_included,
data =
df_g1,
family =
binomial
())
else
m_g1_b
<-
glm
(reverted
test_group
platform
was_reference_included,
data =
df_g1,
family =
binomial
())
render_binom_model
m_g1_b,
"Table 3B. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits (conditional on reference inclusion)."
note_text =
"Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed."
# Table 3C: mobile-web treatment vs control (Model B)
if
"mobile-web"
%in%
levels
(df_g1
platform)) {
ref_mean_mw
<-
mean
(df_g1
was_reference_included[df_g1
platform
==
"mobile-web"
],
na.rm =
TRUE
nd_ctrl_b
<-
data.frame
test_group =
factor
"control"
levels =
levels
(df_g1
test_group)),
platform =
factor
"mobile-web"
levels =
levels
(df_g1
platform)),
was_reference_included =
ref_mean_mw
nd_test_b
<-
nd_ctrl_b
%>%
mutate
test_group =
factor
"test"
levels =
levels
(df_g1
test_group)))
mw_contrast_b
<-
tidy_glm_contrast_or
model =
m_g1_b,
newdata_control =
nd_ctrl_b,
newdata_test =
nd_test_b,
label =
"Mobile-web: treatment vs control (conditional)"
render_or_contrast_table
mw_contrast_b,
"Table 3C. Mobile-web treatment vs control (revert rate within 48h) among new-content edits (conditional on reference inclusion)."
note_text =
"This contrast is computed from Table 3B (Model B), holding was_reference_included at its mobile-web mean."
else
message
"Guardrail #1 Model B skipped: was_reference_included (editcheck-newreference) not available in df_g1"
},
error =
function
(e) {
message
"Guardrail #1 model error: "
, e
message)
})
else
message
"Guardrail #1 model: required columns missing after aliasing"
else
message
"Guardrail #1 model: required columns missing or data not loaded"
Table 3. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits.
Term
OR
CI low
CI high
SE
p-value
Intercept
0.324
0.286
0.367
0.063
<0.001
test_grouptest
0.874
0.730
1.048
0.092
0.146
platformmobile-web
2.386
1.851
3.073
0.129
<0.001
test_grouptest:platformmobile-web
0.739
0.506
1.078
0.193
0.118
Table note:
Model A estimates the total effect of treatment on revert rate. Outcome=1 means reverted within 48h on a new-content edit. Population is restricted to shown test vs eligible-not-shown control. Includes a test_group×platform interaction when platform has multiple levels. OR>1 indicates higher odds of revert.
Table 3A. Mobile-web treatment vs control (revert rate within 48h) among new-content edits.
Contrast
OR
CI low
CI high
SE
p-value
Mobile-web: treatment vs control
0.646
0.464
0.902
0.170
0.010
Table note:
This contrast is computed from Table 3 (Model A) as the log-odds difference between (test, mobile-web) and (control, mobile-web), converted to an OR with a Wald 95% CI and two-sided p-value.
Table 3B. Adjusted odds ratios (ORs) from multivariable logistic regression for reverted within 48h among new-content edits (conditional on reference inclusion).
Term
OR
CI low
CI high
SE
p-value
Intercept
0.383
0.335
0.435
0.067
<0.001
test_grouptest
1.146
0.943
1.393
0.099
0.170
platformmobile-web
2.073
1.604
2.676
0.131
<0.001
was_reference_included
0.483
0.400
0.583
0.096
<0.001
test_grouptest:platformmobile-web
0.751
0.512
1.099
0.195
0.142
Table note:
Model B is an optional conditional analysis that estimates the treatment effect holding reference inclusion constant, and is reported only when editcheck-newreference is directly observed.
Table 3C. Mobile-web treatment vs control (revert rate within 48h) among new-content edits (conditional on reference inclusion).
Contrast
OR
CI low
CI high
SE
p-value
Mobile-web: treatment vs control (conditional)
0.861
0.611
1.212
0.175
0.390
Table note:
This contrast is computed from Table 3B (Model B), holding was_reference_included at its mobile-web mean.
Code
# Guardrail #1 Bayesian lift (relax) — revert rate among new content
# Updated per methodology (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_g1_relax
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
%>%
mutate
reverted =
ifelse
(was_reverted
==
),
test_group =
factor
(test_group,
levels =
"control"
"test"
))
# Overall: control vs treatment
g1_df
<-
df_g1_relax
%>%
transmute
outcome =
reverted,
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
IRdisplay
::
display_markdown
"**Bayesian Analysis / Frequentist Analysis (overall)**"
render_relax
(g1_df,
"Guardrail #1"
metric_type =
"proportion"
better =
"lower"
IRdisplay
::
display_markdown
"**Bayesian Analysis / Frequentist Analysis (mobile-web only)**"
if
"platform"
%in%
names
(df_g1_relax)) {
df_g1_mw
<-
df_g1_relax
%>%
filter
(platform
==
"mobile-web"
if
nrow
(df_g1_mw)
==
) {
message
"Guardrail #1 (mobile-web only): skipped (no rows after filtering platform == 'mobile-web')"
else
if
(dplyr
::
n_distinct
(df_g1_mw
test_group)
) {
message
"Guardrail #1 (mobile-web only): skipped (need both control and test groups)"
else
g1_mw_df
<-
df_g1_mw
%>%
transmute
outcome =
reverted,
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
render_relax
(g1_mw_df,
"Guardrail #1 (mobile-web only)"
metric_type =
"proportion"
better =
"lower"
else
message
"Guardrail #1 relax: required columns missing or data not loaded"
Bayesian Analysis / Frequentist Analysis (overall)
Guardrail #1
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
−0.141
0.002
0.998
−0.239
−0.043
−0.145
0.004
−0.245
−0.046
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 99.8% (computed as 1 - Chance to Win).
Bayesian Analysis / Frequentist Analysis (mobile-web only)
Guardrail #1 (mobile-web only)
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
−0.220
0.002
0.998
−0.373
−0.067
−0.236
0.004
−0.395
−0.077
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 99.8% (computed as 1 - Chance to Win).
Code
# Quick proportion tests for Guardrail #1 (revert) and Guardrail #2 (completion)
# Note: these are lightweight audit checks; primary inference is via regression + relax.
# Guardrail #1: revert rate among new-content edits (shown test vs eligible-not-shown control)
if
is.null
(reference_check_save_data)
&&
all
"test_group"
"is_new_content"
"was_reverted"
"was_reference_check_shown"
"was_reference_check_eligible"
%in%
names
(reference_check_save_data))) {
df_g1
<-
reference_check_save_data
%>%
make_rc_ab_group_published
()
%>%
filter
(is_new_content
==
prop_df
<-
df_g1
%>%
group_by
(test_group)
%>%
summarise
success =
sum
(was_reverted
==
na.rm =
TRUE
),
total =
(),
.groups =
"drop"
ctrl
<-
prop_df
%>%
filter
(test_group
==
"control"
tst
<-
prop_df
%>%
filter
(test_group
==
"test"
if
nrow
(ctrl)
==
&&
nrow
(tst)
==
) {
render_prop_test
(ctrl
success, ctrl
total, tst
success, tst
total,
"Prop test (Guardrail #1, revert rate)"
Prop test (Guardrail #1, revert rate)
Group
Success
Total
Rate
control
469
1665
28.2%
test
379
1574
24.1%
Prop test (Guardrail #1, revert rate) (prop.test)
Metric
Value
p_value
0.00916
statistic
6.79
4.2.5
Guardrail #2 Edit completion
Metric
: Proportion of edits that reach the point where Reference Check was shown or would have been shown and are successfully published, defined as
event.action = "saveSuccess"
Eligibility
: Eligible editing sessions are those where a user clicks publish, defined as
event.action = "saveIntent"
, and successfully publishes the edit, defined as
event.action = "saveSuccess"
Test group constraint
: In the test group, analysis is limited to edits where Reference Check was shown at least once.
Methodology
: We review the proportion of edits by newcomers, junior contributors with fewer than 100 edits, and unregistered users that reach saveIntent and successfully publish. Analysis is limited to edits that are not reverted within 48 hours.
Test group
: The test group includes edits where Reference Check was shown.
Control group
: The control group includes all edits that reach saveIntent. The control group cannot be limited to eligible-but-not-shown edits because eligibility is only tagged on published edits.
Note
: Similar to
Guardrail #1 in Reference Check 2024
Results:
We did not observe any drastic decreases in edit completion rate.Reference Check slightly reduces the likelihood that an edit is completed (-4.8 % relative: 88.3% control and 84.1% test) and this effect is statistically significant. How big is the change by platform: * Desktop: Completion decreased 6.8% relative (94.0% → 87.6%). * Mobile-web: Completion decreased 6.3% relative (74.1% → 69.4%).
This guardrail shows a real and statistically meaningful decrease in completion. Overall, Reference Check introduces measurable friction that leads to lower completion rates, but this trade-off coincides with higher-quality outcomes: more references added, fewer reverts, and improved constructive edits on mobile-web.
Note: In the
2024 Reference Check report
, there was a 10% decrease in edit completion rate for edits where reference check was shown compared to the control group. There was a higher observed decrease in edit completion rate on mobile compared to desktop. On mobile, edit completion rate decreased by -24.3% (-13.5pp) while on desktop it decreased by only -3.1% (-2.3pp). Note: The completion rates reported in this 2024 report includes saved edits that were reverted.
Guardrail #2 (Completion = saveIntent → saveSuccess):
Overall: ↓ −4.2 pp (88.3% → 84.1%), −4.8% relative.
Desktop: ↓ −6.4 pp (94.0% → 87.6%), −6.8% relative.
Mobile-web: ↓ −4.7 pp (74.1% → 69.4%), −6.3% relative.
Evidence: glm (Table 4) OR = 0.58 (95% CI 0.49–0.67), p < 0.001.
Relax (relative lift): −0.048 (p < 0.001).
Code
# Completion (saveSuccess) bar by test group
# Updated per methodology (shown-only in test; focus population; unreverted when available)
if
is.null
(edit_completion_rate_data)
&&
all
"test_group"
"saved_edit"
"was_reference_check_shown"
%in%
names
(edit_completion_rate_data))) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
completion_plot
<-
ec_df
%>%
group_by
(test_group)
%>%
summarise
rate =
mean
(saved_edit,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
%>%
mutate
label =
scales
::
percent
(rate,
accuracy =
0.1
))
%>%
ggplot
aes
x =
test_group,
y =
rate,
fill =
test_group))
geom_col
()
geom_text
aes
label =
label),
vjust =
0.2
size =
scale_y_continuous
labels =
scales
::
percent_format
(),
expand =
expansion
mult =
0.12
)))
scale_fill_manual
values =
"control"
"#999999"
"test"
"dodgerblue4"
))
labs
title =
"Completion (saveSuccess) by test group"
x =
"Test group"
y =
"Percent saveSuccess"
pc_theme
()
guides
fill =
"none"
(completion_plot)
else
message
"Completion plot: required columns missing in edit_completion_rate_data"
Chart note (definition of Rate / denominator)
Completion (saveSuccess) by experiment group
Rate
= mean(
saved_edit
) where
saved_edit
is a 0/1 outcome (1 = saveSuccess).
Denominator
= rows (events) in
edit_completion_rate_data
within each experiment group after Guardrail #2 filters (shown-only in test; focus population).
Note: these completion results
exclude edits reverted within 48 hours
when the
was_reverted
flag is available in
edit_completion_rate_data
Code
# 3) Edit completion rate (saveIntent -> saveSuccess)
# Updated to match Guardrail #2 methodology:
# - test rows are shown-only; control includes all saveIntent rows
# - focus population: Newcomer / Junior Contributor / Unregistered
# - exclude edits reverted within 48 hours when available
if
is.null
(edit_completion_rate_data)
&&
all
"test_group"
"saved_edit"
"was_reference_check_shown"
%in%
names
(edit_completion_rate_data))) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
completion_summary
<-
ec_df
%>%
mutate
was_reference_check_shown =
ifelse
(was_reference_check_shown
==
"shown"
"not_shown"
),
saved_edit =
ifelse
(saved_edit
==
"saved"
"not_saved"
%>%
count
(test_group, was_reference_check_shown, saved_edit)
%>%
group_by
(test_group, was_reference_check_shown)
%>%
mutate
pct =
sum
(n))
%>%
arrange
(test_group, was_reference_check_shown,
desc
(n))
completion_summary
<-
renorm_buckets
(completion_summary)
render_pct_table
completion_summary,
"Edit completion (saveIntent → saveSuccess)"
test_group =
"Test group"
was_reference_check_shown =
"Reference Check shown"
saved_edit =
"Outcome"
n =
"Count (events)"
pct =
"Percent of events"
),
note_text =
"Percent of events = share of rows (events) in `edit_completion_rate_data` within each (test group × Reference Check shown) after the Guardrail #2 filters (shown-only in test; focus population). This table is a breakdown of events into `saved` vs `not_saved` (saveSuccess vs not saveSuccess). Note: these completion results exclude edits reverted within 48 hours when the `was_reverted` flag is available."
else
message
"Completion summary: required columns missing in edit_completion_rate_data"
Edit completion (saveIntent → saveSuccess)
Outcome
Count (events)
Percent of events
control - not_shown
saved
58283
88.3%
not_saved
7686
11.7%
test - shown
saved
1223
84.1%
not_saved
231
15.9%
Table note:
Percent of events = share of rows (events) in
edit_completion_rate_data
within each (test group × Reference Check shown) after the Guardrail #2 filters (shown-only in test; focus population). This table is a breakdown of events into
saved
vs
not_saved
(saveSuccess vs not saveSuccess). Note: these completion results exclude edits reverted within 48 hours when the
was_reverted
flag is available.
Code
# Completion tables by platform and deltas (Guardrail #2)
# Updated per methodology:
# - test group is limited to rows where RC was shown at least once
# - control group includes all rows reaching saveIntent (as represented in edit_completion_rate_data)
# - population focus: Newcomer / Junior (<=100 edits) / Unregistered; limit to unreverted within 48h when available
if
is.null
(edit_completion_rate_data)
&&
all
"test_group"
"platform"
"saved_edit"
"was_reference_check_shown"
%in%
names
(edit_completion_rate_data))) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
# Overall (control vs test) + change vs control
completion_overall_rates
<-
make_rate_table
(ec_df,
"saved_edit"
group_cols =
"test_group"
))
%>%
mutate
scope =
"Overall"
completion_overall_rel
<-
make_rel_change_dim
(completion_overall_rates,
dim_col =
"scope"
render_rate_rel
completion_overall_rates,
completion_overall_rel,
"Completion (saveSuccess) overall"
"Completion: change vs control (overall)"
test_group =
"Test group"
scope =
"Scope"
rate =
"Rate"
n =
"Count (events)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in `edit_completion_rate_data` within each experiment group after these filters. Note: these completion results exclude edits reverted within 48 hours when the `was_reverted` flag is available."
# By platform (control vs test) + change vs control
completion_rates
<-
ec_df
%>%
make_rate_table
"saved_edit"
completion_rel
<-
make_rel_change
(completion_rates)
render_rate_rel
completion_rates, completion_rel,
"Completion (saveSuccess) by platform"
"Completion: change vs control (by platform)"
test_group =
"Test group"
platform =
"Platform"
rate =
"Rate"
n =
"Count (events)"
),
note_rate =
"Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in `edit_completion_rate_data` for each (test group × platform) after these filters. Note: these completion results exclude edits reverted within 48 hours when the `was_reverted` flag is available."
else
message
"Completion tables: required columns missing in edit_completion_rate_data"
Completion (saveSuccess) overall
Test group
Rate
Count (events)
Scope
control
88.3%
65969
Overall
test
84.1%
1454
Overall
Table note:
Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in
edit_completion_rate_data
within each experiment group after these filters. Note: these completion results exclude edits reverted within 48 hours when the
was_reverted
flag is available.
Completion: change vs control (overall)
scope
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
Overall
88.3%
84.1%
-4.2
-4.8%
65969
1454
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Completion (saveSuccess) by platform
Test group
Platform
Rate
Count (events)
control
desktop
94.0%
47267
control
mobile-web
74.1%
18702
test
desktop
87.6%
1176
test
mobile-web
69.4%
278
Table note:
Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test group is shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered. Denominator = rows in
edit_completion_rate_data
for each (test group × platform) after these filters. Note: these completion results exclude edits reverted within 48 hours when the
was_reverted
flag is available.
Completion: change vs control (by platform)
Platform
Control rate
Test rate
Absolute difference (pp)
Relative change vs control
N (control)
N (test)
desktop
94.0%
87.6%
-6.4
-6.8%
47267
1176
mobile-web
74.1%
69.4%
-4.7
-6.3%
18702
278
Table note:
Absolute difference (pp) = (test − control) × 100. Relative change = (test − control) / control.
Code
# Guardrail #2 model (completion = saveSuccess vs saveIntent)
# Updated per methodology (shown-only in test; control includes all saveIntent rows; focus population)
if
is.null
(edit_completion_rate_data)) {
df_g2
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
%>%
mutate
completed =
saved_edit)
if
"experience_level_group"
%in%
names
(df_g2)) {
df_g2
<-
df_g2
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(df_g2)) {
df_g2
<-
df_g2
%>%
filter
is.na
(was_reverted)
was_reverted
!=
if
all
"test_group"
"platform"
"completed"
%in%
names
(df_g2))
&&
"experience_level_group"
%in%
names
(df_g2))) {
tryCatch
({
# Include was_reference_check_shown only if it varies after filtering (avoid perfect collinearity)
f_g2
<-
completed
test_group
platform
experience_level_group
if
"was_reference_check_shown"
%in%
names
(df_g2)
&&
length
unique
(df_g2
was_reference_check_shown))
) {
f_g2
<-
completed
test_group
platform
experience_level_group
was_reference_check_shown
m_g2
<-
glm
(f_g2,
data =
df_g2,
family =
binomial
())
render_binom_model
m_g2,
"Table 4. Adjusted odds ratios (ORs) from multivariable logistic regression for saveSuccess among saveIntent events."
note_text =
"Outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Adjusted for platform and experience group (and for Reference Check shown when it varies). OR>1 indicates higher odds of the outcome."
},
error =
function
(e) {
message
"Guardrail #2 model error: "
, e
message)
})
else
message
"Guardrail #2 model: required columns missing or data not loaded"
else
message
"Guardrail #2 model: data not loaded"
Table 4. Adjusted odds ratios (ORs) from multivariable logistic regression for saveSuccess among saveIntent events.
Term
OR
CI low
CI high
SE
p-value
Intercept
6.242
5.903
6.603
0.029
<0.001
test_grouptest
0.575
0.494
0.671
0.078
<0.001
platformmobile-web
0.212
0.201
0.222
0.026
<0.001
experience_level_groupNewcomer
1.243
1.146
1.349
0.042
<0.001
experience_level_groupJunior Contributor
3.495
3.301
3.700
0.029
<0.001
was_reference_check_shown
NA
NA
NA
NA
NA
Table note:
Outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Adjusted for platform and experience group (and for Reference Check shown when it varies). OR>1 indicates higher odds of the outcome.
Code
# Completion by platform and user_status
# Updated per methodology (shown-only in test; focus population; unreverted when available)
if
is.null
(edit_completion_rate_data)) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
if
all
"test_group"
"saved_edit"
"platform"
"user_status"
%in%
names
(ec_df))) {
completion_slices
<-
ec_df
%>%
group_by
(test_group, platform, user_status)
%>%
summarise
rate =
mean
(saved_edit,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
completion_slices,
"Completion by platform and user status"
test_group =
"Test group"
platform =
"Platform"
user_status =
"User status"
rate =
"Rate"
n =
"Count (events)"
),
note_text =
"Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Denominator = rows in `edit_completion_rate_data` for each (test group × platform × user status) after these filters."
# Completion by experience group (explicitly matches methodology)
if
"experience_level_group"
%in%
names
(ec_df)) {
completion_exp
<-
ec_df
%>%
group_by
(test_group, experience_level_group)
%>%
summarise
rate =
mean
(saved_edit,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
completion_exp,
"Completion by experience group"
test_group =
"Test group"
experience_level_group =
"Experience group"
rate =
"Rate"
n =
"Count (events)"
),
note_text =
"Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Denominator = rows in `edit_completion_rate_data` for each (test group × experience group) after the same filters used above."
else
message
"Completion slices: required columns missing in edit_completion_rate_data"
else
message
"Completion slices: required columns missing in edit_completion_rate_data"
# Completion by number of checks shown (bucketed)
if
is.null
(edit_completion_rate_data)) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
if
all
"test_group"
"saved_edit"
"n_checks_shown"
%in%
names
(ec_df))) {
ec_df
<-
ec_df
%>%
mutate
checks_bucket =
case_when
is.na
(n_checks_shown)
"unknown"
n_checks_shown
==
"0"
n_checks_shown
==
"1"
n_checks_shown
==
"2"
n_checks_shown
>=
"3+"
))
completion_by_checks
<-
ec_df
%>%
# Checks-shown buckets come from RC shown events; we report this slice for the test group only.
filter
(test_group
==
"test"
%>%
group_by
(checks_bucket)
%>%
summarise
rate =
mean
(saved_edit,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
completion_by_checks,
"Completion by checks shown (test group only)"
checks_bucket =
"Checks shown"
rate =
"Rate"
n =
"Count (events)"
),
note_text =
"Rate = mean(0/1 outcome) in the test group only where outcome=1 means saveSuccess (per event). Denominator = rows in `edit_completion_rate_data` within the shown test group for each checks-shown bucket after the same filters used above. Control is excluded because it has no comparable checks-shown event stream."
else
message
"Completion by checks: required columns missing in edit_completion_rate_data"
else
message
"Completion by checks: required columns missing in edit_completion_rate_data"
Completion by platform and user status
Test group
Platform
User status
Rate
Count (events)
control
desktop
registered
96.1%
41697
control
desktop
unregistered
78.0%
5570
control
mobile-web
registered
76.3%
14291
control
mobile-web
unregistered
66.8%
4411
test
desktop
registered
89.1%
1001
test
desktop
unregistered
78.9%
175
test
mobile-web
registered
68.9%
212
test
mobile-web
unregistered
71.2%
66
Table note:
Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Population is restricted to Newcomer/Junior/Unregistered and to unreverted rows when available. Denominator = rows in
edit_completion_rate_data
for each (test group × platform × user status) after these filters.
Completion by experience group
Test group
Experience group
Rate
Count (events)
control
Unregistered
73.0%
9981
control
Newcomer
79.4%
5615
control
Junior Contributor
92.4%
50373
test
Unregistered
76.8%
241
test
Newcomer
83.4%
175
test
Junior Contributor
85.9%
1038
Table note:
Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Denominator = rows in
edit_completion_rate_data
for each (test group × experience group) after the same filters used above.
Completion by checks shown (test group only)
Checks shown
Rate
Count (events)
86.3%
985
84.7%
202
3+
75.7%
267
Table note:
Rate = mean(0/1 outcome) in the test group only where outcome=1 means saveSuccess (per event). Denominator = rows in
edit_completion_rate_data
within the shown test group for each checks-shown bucket after the same filters used above. Control is excluded because it has no comparable checks-shown event stream.
Code
# Per-wiki sanity: completion
# Updated per methodology (shown-only in test; focus population; unreverted when available)
# Only render when multiple wikis are present.
if
is.null
(edit_completion_rate_data)
&&
all
"wiki"
"test_group"
"saved_edit"
"was_reference_check_shown"
%in%
names
(edit_completion_rate_data))) {
if
(dplyr
::
n_distinct
(edit_completion_rate_data
wiki)
<=
) {
message
"Per-wiki completion: skipped (single wiki)"
else
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
per_wiki_completion
<-
ec_df
%>%
group_by
(wiki, test_group)
%>%
summarise
rate =
mean
(saved_edit,
na.rm =
TRUE
),
n =
(),
.groups =
"drop"
render_slice
per_wiki_completion,
"Per-wiki completion"
wiki =
"Wiki"
test_group =
"Test group"
rate =
"Rate"
n =
"Count (events)"
),
note_text =
"Rate = mean(0/1 outcome) where outcome=1 means saveSuccess (per event). Test rows are shown-only; control includes all saveIntent rows. Denominator = rows in `edit_completion_rate_data` for each (wiki × test group) after these filters."
else
message
"Per-wiki completion: required columns missing in edit_completion_rate_data"
Per-wiki completion: skipped (single wiki)
Code
# Dismissal significance: prop.test by platform and by user_status
# (Normalize to control/test for readability.)
if
is.null
(reference_check_rejects_data)
&&
all
"test_group"
"platform"
"user_status"
"reject_reason"
"editing_session"
%in%
names
(reference_check_rejects_data))) {
dismiss_df
<-
reference_check_rejects_data
%>%
renorm_buckets
()
%>%
filter
(reject_reason
%in%
"edit-check-feedback-reason-common-knowledge"
"edit-check-feedback-reason-irrelevant"
"edit-check-feedback-reason-uncertain"
"edit-check-feedback-reason-other"
))
base_df
<-
reference_check_rejects_data
%>%
renorm_buckets
()
# By platform
plat_sessions
<-
base_df
%>%
group_by
(test_group, platform)
%>%
summarise
total_sessions =
n_distinct
(editing_session),
.groups =
"drop"
plat_dismiss
<-
dismiss_df
%>%
group_by
(test_group, platform)
%>%
summarise
dismiss_sessions =
n_distinct
(editing_session),
.groups =
"drop"
plat_join
<-
plat_sessions
%>%
left_join
(plat_dismiss,
by =
"test_group"
"platform"
))
%>%
mutate
dismiss_sessions =
coalesce
(dismiss_sessions,
))
if
all
"control"
"test"
%in%
plat_join
test_group)) {
ctrl
<-
plat_join
%>%
filter
(test_group
==
"control"
tst
<-
plat_join
%>%
filter
(test_group
==
"test"
for
(p
in
intersect
(ctrl
platform, tst
platform)) {
c_row
<-
ctrl
%>%
filter
(platform
==
p)
t_row
<-
tst
%>%
filter
(platform
==
p)
if
nrow
(c_row)
==
&&
nrow
(t_row)
==
) {
cat
\n
Prop test (Dismissal rate) platform ="
, p,
\n
prop.test
(t_row
dismiss_sessions, c_row
dismiss_sessions),
(t_row
total_sessions, c_row
total_sessions)
))
# By user_status
us_sessions
<-
base_df
%>%
group_by
(test_group, user_status)
%>%
summarise
total_sessions =
n_distinct
(editing_session),
.groups =
"drop"
us_dismiss
<-
dismiss_df
%>%
group_by
(test_group, user_status)
%>%
summarise
dismiss_sessions =
n_distinct
(editing_session),
.groups =
"drop"
us_join
<-
us_sessions
%>%
left_join
(us_dismiss,
by =
"test_group"
"user_status"
))
%>%
mutate
dismiss_sessions =
coalesce
(dismiss_sessions,
))
if
all
"control"
"test"
%in%
us_join
test_group)) {
ctrl2
<-
us_join
%>%
filter
(test_group
==
"control"
tst2
<-
us_join
%>%
filter
(test_group
==
"test"
for
(u
in
intersect
(ctrl2
user_status, tst2
user_status)) {
c_row
<-
ctrl2
%>%
filter
(user_status
==
u)
t_row
<-
tst2
%>%
filter
(user_status
==
u)
if
nrow
(c_row)
==
&&
nrow
(t_row)
==
) {
cat
\n
Prop test (Dismissal rate) user_status ="
, u,
\n
prop.test
(t_row
dismiss_sessions, c_row
dismiss_sessions),
(t_row
total_sessions, c_row
total_sessions)
))
else
message
"Dismissal prop tests: required columns missing in reference_check_rejects_data"
Code
# Guardrail #2 Bayesian lift (relax) — completion (saveSuccess vs saveIntent)
# Updated per methodology (shown-only in test; focus population; unreverted when available)
if
is.null
(edit_completion_rate_data)
&&
all
"test_group"
"saved_edit"
"was_reference_check_shown"
%in%
names
(edit_completion_rate_data))) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
guardrail2_df
<-
ec_df
%>%
transmute
outcome =
saved_edit,
variation =
dplyr
::
case_when
test_group
==
"control"
"control"
test_group
==
"test"
"treatment"
TRUE
as.character
(test_group)
render_relax
(guardrail2_df,
"Guardrail #2"
metric_type =
"proportion"
better =
"higher"
else
message
"Guardrail #2 relax: required columns missing or data not loaded"
Guardrail #2
Relative lift ((Treatment − Control) / Control)
Bayesian Analysis
Frequentist Analysis
Point Estimate
Chance to Win
P(Treatment better)
95% CrI Lower
95% CrI Upper
Point Estimate
-value
95% CI Lower
95% CI Upper
−0.048
0.000
0.000
−0.069
−0.026
−0.048
0.000
−0.069
−0.027
Interpretation:
Based on
relax
, the posterior probability that treatment is better than control is 0.0% (computed as Chance to Win).
Code
# Guardrail #2: completion rate (test shown-only; control all saveIntent rows)
if
is.null
(edit_completion_rate_data)
&&
all
"test_group"
"saved_edit"
"was_reference_check_shown"
%in%
names
(edit_completion_rate_data))) {
ec_df
<-
edit_completion_rate_data
%>%
make_rc_ab_group_completion
()
%>%
add_experience_group
()
if
"experience_level_group"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
(experience_level_group
%in%
"Newcomer"
"Junior Contributor"
"Unregistered"
))
if
"was_reverted"
%in%
names
(ec_df)) {
ec_df
<-
ec_df
%>%
filter
is.na
(was_reverted)
was_reverted
!=
prop_df2
<-
ec_df
%>%
group_by
(test_group)
%>%
summarise
success =
sum
(saved_edit
==
na.rm =
TRUE
),
total =
(),
.groups =
"drop"
ctrl
<-
prop_df2
%>%
filter
(test_group
==
"control"
tst
<-
prop_df2
%>%
filter
(test_group
==
"test"
if
nrow
(ctrl)
==
&&
nrow
(tst)
==
) {
render_prop_test
(ctrl
success, ctrl
total, tst
success, tst
total,
"Prop test (Guardrail #2, completion)"
Prop test (Guardrail #2, completion)
Group
Success
Total
Rate
control
58283
65969
88.3%
test
1223
1454
84.1%
Prop test (Guardrail #2, completion) (prop.test)
Metric
Value
p_value
8.56e-07
statistic
24.2
Additional metrics
Retention
Chart note (definition of Rate / denominator)
Retention (7–14d) by platform and test group
: Rate = mean(
returned
) where
returned
is a 0/1 per-user flag (1 = returned in the window, 0 = did not). Denominator = users (rows) in
constructive_retention_data
for each (test group × platform).
Dismissal
Chart note (definition of Rate / denominator)
Dismissal reason charts (overall / by platform / by user_status)
: The plotted percent is the
share of dismissals
selecting each reason within the slice shown. Denominator = dismissal rows in
reference_check_rejects_data
(filtered to the four valid reasons) for that slice.
Reference
None
How to read this report
Focus: Reference Check A/B test KPIs (reference added or acknowledged why a citation was not added, constructive edits) in addition to guardrails (revert rate, completion, dismissals, retention).
Dimensions: group (control, test), platform (mobile web, desktop), user status (registered, unregistered), and # checks-shown buckets. (This report is enwiki-only.)
Statistical meaningfulness: Primary inference uses multivariable logistic regression (glm; binomial). Statistical meaningfulness was defined a priori as a two-sided p<0.05. As a robustness check, we fit a relax Bayesian model; effects with posterior probability >95% of a non-null association were considered corroborated.
None
General Methodology
In this AB test, users in the test group will be shown Reference Check if attempting an edit that meets the
requirements
for the check to be shown in VisualEditor. The control group is provided the default editing experience where no Reference Check is shown.
We collected AB test events logged between 8 November 2025 and 8 December 2025 on English Wikipedia.
We relied on events logged in EditAttemptStep, VisualEditorFeatureUse, and change tags recorded in the revision tags table.
Published edits eligible for Reference Check are identified by the
editcheck-references
revison tag.
For filtering to new content edits we use
editcheck-newcontent
To identify edits where Reference Check was shown we use VisualEditorFeatureUse events:
event.feature
editCheck-addReference
event.action
check-shown-presave
action-reject
: editor dismissed Reference Check
edit-check-feedback-reason
-*: Reason for dismissal
For calculating Edit Completion Rate we make an assumption and posit that all edits reaching saveIntent are eligible.
For calculating Revert Rate, published edits eligible for Reference Check are identified by the
editcheck-references
revision tag
See the
instrumentation spec
for more details.
Data was limited to mobile web and desktop edits completed on a main page namespace using VisualEditor on English Wikipedia. We also limited to edits completed by unregistered users and users with 100 or fewer edits as those are the users that would be shown Reference Check under the default
config settings
For each metric, we reviewed the following dimensions: by experiment group (test and control), by platform (mobile web or desktop), by user experience and status. We also reviewed some indicators such as edit completion rate by the number of checks shown within a single editing session.
Note: For the by user experience analysis, we split newer editors into three experience level groups: (1) unregistered, (2) newcomer (registered user making their first edit on Wikipedia), and (3) junior contributor, a registered contributor with >0 and ≤100 edits (i.e., 1–100).
None
Data & Methods
Data collection:
collect_enwiki_refcheck_ab_test_data.ipynb
Styling: paste-check-aligned plot/table defaults; method-note callout CSS included.
Models: logistic regression (glm) + Bayesian lift (
relax
) for inference/uncertainty.
None
Definitions
New-content edit
: An edit where
is_new_content == 1
in the dataset (i.e., edits tagged/flagged as new-content in instrumentation).
As indicated in https://www.mediawiki.org/wiki/Edit_check/Tags, “Tag applied to all edits in which new content is added. Where”new content” in this context is defined by the conditions that were defined in
T324730
and are now codified in editcheck/modules/init.js.:”
Constructive
: A new-content edit that was
not reverted within 48 hours
. In code:
constructive = 1
when
was_reverted != 1
Returned (retention)
: A per-user 0/1 flag
returned
where
1 = the user made at least one subsequent saveSuccess 7–14 days after their first eligible edit
, and 0 otherwise.
Dismissal rate
Numerator
: count of dismissal events (filtered to the 4 valid reasons).
Denominator
: distinct
editing_session
count in the same slice (e.g., by test group × platform).
None
Reference tags & instrumentation
Revision tags (from measurement plan):
editcheck-references
(Reference Check eligible)
editcheck-references-shown
(Reference Check shown; we treat this as a secondary/audit signal)
editcheck-newcontent
(new content edit)
editcheck-newreference
(net new reference added)
mw-reverted
(reverted)
Dataset column mapping used in this notebook:
is_new_content == 1
↔︎
editcheck-newcontent
was_reference_check_eligible == 1
↔︎
editcheck-references
was_reference_check_shown == 1
↔︎ VisualEditorFeatureUse (VEFU)
event.feature = editCheck-addReference
event.action = check-shown-presave
was_reference_included == 1
↔︎
editcheck-newreference
was_reverted == 1
↔︎
mw-reverted
within 48 hours (per the collection definition)
Key events used for engagement / dismissal breakdowns (feature
editCheck-addReference
unless noted):
check-shown-presave
(RC shown),
action-accept
action-reject
edit-check-feedback-shown
(survey),
edit-check-feedback-reason-*
with valid RC reasons: other, uncertain, common-knowledge, irrelevant.
editCheckDialog
window-open-from-check-[moment]
(sidebar open; used as an auxiliary signal when needed).
Note:
relevant-paste
ignored-paste-*
, and
check-learn-more
are Paste Check events; we keep them in the reference list for cross-notebook consistency, but we do not use them for Reference Check metrics.
Control nuance from multi-check 2024: control had RC available; we use treatment-only comparisons there.
None
Column names (from collection notebook)
Core identifiers:
wiki
test_group
user_id
user_status
(registered vs unregistered),
user_edit_count
experience_level_group
(Unregistered / Newcomer / Junior Contributor),
editing_session
(EditAttemptStep editing_session_id),
platform
Reference/dismissal-reason-acknowledgement (spelled out: acknowledgement for why a citation was not added) flags (pick-first):
has_reference_or_acknowledgement
added_reference_or_acknowledgement
has_reference
reference_added
has_reference_added
was_reference_included
Retention flags (pick-first):
retained_7_14d
retained_14d
retained
returned
(retention flag: qualifying return edit in 7–14d window after first RC shown/eligible; if absent, use
retention_flag_candidates
).
Revert flag:
was_reverted
(48h window from collection queries).
New content flag:
is_new_content
Shown/eligible flags:
was_reference_check_shown
n_checks_shown
was_reference_check_eligible
reference_check_shown
Completion / outcome:
saved_edit
event_action
(expect
saveIntent
saveSuccess
).
Dismissals:
was_reference_check_rejected
n_rejects
reject_reason
Retention aggregates:
return_editors
(returned users in window),
editors
(total users in grouping; retention denominator).
None
Retention note
For this A/B test, we use the typical second-day retention: editor returns 7–14 days after first being shown Reference Check.
The code looks for a retention flag in
retention_flag_candidates
(e.g.,
retained_7_14d
retained_14d
, etc.).
None
Related previous reports
Previous Reference Check Reports:
Multi_check_ab_test_report
Multi Check Indicators Ticket
Multi Check Leading Indicators Report
Reference Check AB Test
Previous Check Reports:
Paste_check_leading_indicators Gitlab
Paste_check_leading_indicators
Paste Check Ticket
Methodology references: multi-check (2024), paste check (2025), edit-check references (2023). Note: in multi-check, the control also had Reference Check enabled; we referenced treatment-group-only comparisons from that report and avoided direct control deltas from that work for this study.