⚓ T301433 Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases.
Page Menu
Phabricator
Create Task
Maniphest
T301433
Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases.
Closed, Resolved
Public
PRODUCTION ERROR
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
Dbrant
Authored By
jeena
Feb 10 2022, 12:40 AM
2022-02-10 00:40:08 (UTC+0)
Tags
Wikimedia-production-error
(Feb 2022)
Wikipedia-Android-App-Backlog (Android Release - FY2023-24)
(Ready for release)
Patch-For-Review
MW-1.38-notes (1.38.0-wmf.22; 2022-02-14)
Referenced Files
F34947422: image.png
Feb 10 2022, 11:17 AM
2022-02-10 11:17:54 (UTC+0)
F34947178: Screenshot 2022-02-10 at 04.34.11.png
Feb 10 2022, 4:59 AM
2022-02-10 04:59:11 (UTC+0)
Subscribers
aaron
Aklapper
AntiCompositeNumber
cooltey
Dbrant
jeena
Krinkle
View All 14 Subscribers
Description
Error
mwversion:
1.38.0-wmf.21
reqId:
758cb992-d0ea-4cab-9eba-8ba30d48f191
Find reqId in Logstash
normalized_message
[{reqId}] {exception_url} Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases.
exception.trace
from /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(1055)
#0 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(1342): Wikimedia\Rdbms\Database->assertIsWritablePrimary()
#1 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(1293): Wikimedia\Rdbms\Database->executeQuery(string, string, integer)
#2 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/Database.php(2636): Wikimedia\Rdbms\Database->query(string, string, integer)
#3 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/DBConnRef.php(69): Wikimedia\Rdbms\Database->update(string, array, string, string)
#4 /srv/mediawiki/php-1.38.0-wmf.21/includes/libs/rdbms/database/DBConnRef.php(381): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#5 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/LinksUpdate/LinksUpdate.php(600): Wikimedia\Rdbms\DBConnRef->update(string, array, array, string)
#6 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/LinksUpdate/LinksUpdate.php(301): MediaWiki\Deferred\LinksUpdate\LinksUpdate->updateLinksTimestamp()
#7 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/LinksUpdate/LinksUpdate.php(251): MediaWiki\Deferred\LinksUpdate\LinksUpdate->doIncrementalUpdate()
#8 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/DeferredUpdates.php(536): MediaWiki\Deferred\LinksUpdate\LinksUpdate->doUpdate()
#9 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/RefreshSecondaryDataUpdate.php(103): DeferredUpdates::attemptUpdate(MediaWiki\Deferred\LinksUpdate\LinksUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#10 /srv/mediawiki/php-1.38.0-wmf.21/includes/deferred/DeferredUpdates.php(536): RefreshSecondaryDataUpdate->doUpdate()
#11 /srv/mediawiki/php-1.38.0-wmf.21/includes/Storage/DerivedPageDataUpdater.php(1778): DeferredUpdates::attemptUpdate(RefreshSecondaryDataUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#12 /srv/mediawiki/php-1.38.0-wmf.21/includes/page/WikiPage.php(2211): MediaWiki\Storage\DerivedPageDataUpdater->doSecondaryDataUpdates(array)
#13 /srv/mediawiki/php-1.38.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(210): WikiPage->doSecondaryDataUpdates(array)
#14 /srv/mediawiki/php-1.38.0-wmf.21/includes/jobqueue/jobs/RefreshLinksJob.php(137): RefreshLinksJob->runForTitle(Title)
#15 /srv/mediawiki/php-1.38.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(79): RefreshLinksJob->run()
#16 /srv/mediawiki/rpc/RunSingleJob.php(76): MediaWiki\Extension\EventBus\JobExecutor->execute(array)
#17 {main}
Impact
more than 300 errors in the last 30 min
Notes
Details
Request URL
Related Changes in Gerrit:
Subject
Repo
Branch
Lines +/-
LinksTable: Cast all array keys to string
mediawiki/core
master
+84
-13
WikiPage: Cast the category values to string in updateCategoryCounts
mediawiki/core
master
+3
-0
WikiPage: Cast the category values to string in updateCategoryCounts
mediawiki/core
wmf/1.38.0-wmf.21
+3
-0
Short circut updating stats when the page is not reviewable
mediawiki/extensions/FlaggedRevs
wmf/1.38.0-wmf.20
+3
-0
Short circut updating stats when the page is not reviewable
mediawiki/extensions/FlaggedRevs
wmf/1.38.0-wmf.21
+3
-0
Short circut updating stats when the page is not reviewable
mediawiki/extensions/FlaggedRevs
master
+3
-0
Customize query in gerrit
Related Objects
Search...
Task Graph
Mentions
Status
Subtype
Assigned
Task
Resolved
Release
jeena
T300197
1.38.0-wmf.21 deployment blockers
Resolved
PRODUCTION ERROR
Dbrant
T301433
Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases.
Mentioned In
T221795: Refactor Category::refreshCounts logic to a job and simplify
T301742: Increase log level of rowsAffected > 1000 on database writes in transaction profiler to warning or error
T301506: Resetting an option to default value with api's action=options triggers unneeded delete in UserOptionsManager
T300197: 1.38.0-wmf.21 deployment blockers
Mentioned Here
T301742: Increase log level of rowsAffected > 1000 on database writes in transaction profiler to warning or error
T301506: Resetting an option to default value with api's action=options triggers unneeded delete in UserOptionsManager
T301310: CommonsMetadata extension is triggering a duplicate parse in commons
T252899: Add push notification support to Echo
Event Timeline
jeena
created this task.
Feb 10 2022, 12:40 AM
2022-02-10 00:40:08 (UTC+0)
Restricted Application
added a subscriber:
Aklapper
View Herald Transcript
Feb 10 2022, 12:40 AM
2022-02-10 00:40:08 (UTC+0)
jeena
updated the task description.
(Show Details)
Feb 10 2022, 12:40 AM
2022-02-10 00:40:39 (UTC+0)
jeena
added a parent task:
T300197: 1.38.0-wmf.21 deployment blockers
jeena
mentioned this in
T300197: 1.38.0-wmf.21 deployment blockers
Feb 10 2022, 12:45 AM
2022-02-10 00:45:15 (UTC+0)
Ladsgroup
subscribed.
Feb 10 2022, 1:02 AM
2022-02-10 01:02:52 (UTC+0)
Comment Actions
It seems enwiki has emitting read-only error quite a lot (
) but that's wmf.20 so not related to this (probably some of the maint work or increase in read has caused it).
For commons, it seems it's also going read-only for a bit as well. Strangely the next one is elwiktionary. Something is probably causing issues in wmf.21 and it needs inspection before moving forward.
AntiCompositeNumber
subscribed.
Feb 10 2022, 1:06 AM
2022-02-10 01:06:35 (UTC+0)
Comment Actions
Was noted in #wikimedia-operations yesterday, before the train rolled to group0. I've been anecdotally seeing more than usual since last week (first one in my bot's logs was Feb 3).
[16:56:30]

hmm.. DBReadOnlyError's happening more than usual. 708 in the last 15 minutes.
[16:56:38]

FIXME:
Figure out what "usual" is
[16:57:10]

Fading away now
[16:57:12]

approximately never
[16:57:27]

hehe. I see them all the time
...
[20:17:41]

!log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.21 refs T300197
AntiCompositeNumber
added a subscriber:
dancy
Feb 10 2022, 1:07 AM
2022-02-10 01:07:07 (UTC+0)
Ladsgroup
added a comment.
Feb 10 2022, 1:15 AM
2022-02-10 01:15:45 (UTC+0)
Comment Actions
for enwiki read-only (slightly unrelated). Something is writing a lot there, rows written hitting 50k/s is just too much:
I'm running actor migration there but it takes a two second break after writing 200 rows, it can't be it. I will dig into binlogs.
For other wikis (wmf.21), I'm not sure what's going on.
Ladsgroup
added a subscriber:
tstarling
Feb 10 2022, 2:27 AM
2022-02-10 02:27:37 (UTC+0)
Comment Actions
It seems CategoryMembership changes are adding a lot of pressure atm:
This probably can be moved to onTransactionorIdle (if not already) or moved to a job?
Ladsgroup
added a comment.
Feb 10 2022, 2:41 AM
2022-02-10 02:41:37 (UTC+0)
Comment Actions
For the s1 issue: If you look at
and toggle url, almost all are
/w/api.php?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options
. Is that something with DIP?
Ladsgroup
added a comment.
Feb 10 2022, 2:53 AM
2022-02-10 02:53:51 (UTC+0)
Comment Actions
binlog of s1 has a LOT of this:
DELETE
/* MediaWiki\User\UserOptionsManager::saveOptionsInternal */
FROM
`user_properties`
WHERE
up_user
AND
up_property
IN
'echo-subscriptions-push-edit-user-talk'
'echo-subscriptions-push-login-fail'
'echo-subscriptions-push-mention'
'echo-subscriptions-push-thank-you-edit'
'echo-subscriptions-push-reverted'
'echo-subscriptions-push-edit-thank'
'echo-cross-wiki-notifications'
And flaggedrevs is making a lot of writes there as well which is weird. I investigate that. But someone need to check the user options flood.
gerritbot
added a comment.
Feb 10 2022, 3:19 AM
2022-02-10 03:19:23 (UTC+0)
Comment Actions
Change 761512 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/extensions/FlaggedRevs@master] Short circut updating stats when the page is not reviewable
gerritbot
added a project:
Patch-For-Review
Feb 10 2022, 3:19 AM
2022-02-10 03:19:24 (UTC+0)
gerritbot
added a comment.
Feb 10 2022, 4:26 AM
2022-02-10 04:26:20 (UTC+0)
Comment Actions
Change 761512
merged
by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Short circut updating stats when the page is not reviewable
gerritbot
added a comment.
Feb 10 2022, 4:30 AM
2022-02-10 04:30:06 (UTC+0)
Comment Actions
Change 761408 had a related patch set uploaded (by Jforrester; author: Amir Sarabadani):
[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.21] Short circut updating stats when the page is not reviewable
gerritbot
added a comment.
Feb 10 2022, 4:30 AM
2022-02-10 04:30:52 (UTC+0)
Comment Actions
Change 761409 had a related patch set uploaded (by Jforrester; author: Amir Sarabadani):
[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.20] Short circut updating stats when the page is not reviewable
Krinkle
added subscribers:
cooltey
Dbrant
Krinkle
Edited
Feb 10 2022, 4:59 AM
2022-02-10 04:59:11 (UTC+0)
Comment Actions
In
T301433#7699669
@Ladsgroup
wrote:
For the s1 issue: If you look at
and toggle url, almost all are
/w/api.php?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options
. Is that something with DIP?
As of writing, 100% of these are debug-level messages of the form:
Transaction spent {time_ms}ms in writes, under the 3s limit
. This seems to simply be recording every single database change, no matter what, so long as it took non-zero milliseconds of time, and remained
under
the warning limit. (
Source code
). These probably don't need to be enabled in production, but maybe I'm missing something.
Having said that, it is odd that these are exclusively from
POST /w/api.php?..&errorsuselocal=1&action=options
, because I don't see anything in the Rdbms logging code that indicates a reason to not be logging e.g. all edits, logins and other just normal db changes, so there must be something special about these that I'm missing, but they're
not slow
I checked
Codesearch for errorsuselocal
and found none that concide with action=options. I checked Global Search tool for on-wiki use of errorsuselocal, and none there either.
I checked
Logstash: mediawiki dashboard
with the query
channel:DBPerformance AND normalized_message:"Transaction spent"
over the last 30 days. And then added to the query
url:"action=options"
, which yielded the same results. So in fact 100% of these are from this weird API query. All
3 million a day
and for at least 90 days, across different wikis.
I then checked
api.log
on
mwlog1002
to get some samples, and noticed they are all exactly the same:
change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1
. Although on different wikis and from different user names and IP addresses. These appear related to Echo, and match the DB binlog sample that
@Ladsgroup
posted above, which confirms that that was related and representative of the same cause:
/srv/mw-log/api.log
krinkle@mwlog1002:/srv/mw-log$ ack POST api.log | fgrep errorsuselocal=1 | fgrep action=options | head
mw1362 zhwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2
mw1412 arwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2
..
I then tapped into the raw webrequest stream to capture a few samples, hoping to find a representative referer in order to understand what might be causing it, given that a codesearch led nowhere. Perhaps the code is obfuscated or coming from something third-party.
I got
no referer
, however there was a telling user-agent instead:
kafkacat webrequest_text
krinkle at stat1004.eqiad.wmnet in ~
$ kafkacat -C -b
'kafka-jumbo1005.eqiad.wmnet'
-o
'-100000000'
-t webrequest_text
fgrep errorsuselocal
fgrep
'"POST"'
head
"http_method"
"POST"
"uri_host"
"en.wikipedia.org"
"uri_path"
"/w/api.php"
"uri_query"
"?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options"
***,
"user_agent"
"WikipediaApp/2.7.50392-r-2022-01-24 (Android ***"
***
$ kafkacat -C -b
'kafka-jumbo1005.eqiad.wmnet'
-o
'-100000000'
-t webrequest_text
fgrep errorsuselocal
fgrep
'"POST"'
head
jq -r .user_agent
WikipediaApp/2.7.50392-r-2022-01-24
Android
11
***
WikipediaApp/2.7.50382-r-2021-10-25
Android
11
***
WikipediaApp/2.7.50392-r-2022-01-24
Android
11
***
WikipediaApp/2.7.50392-r-2022-01-24
Android
***
I then checked
GitHub search errorsuselocal org:wikimedia
and noticed one result that wasn't in our Codesearch:
apps-android-wikipedia:/dataclient/Service.kt
The relevant query appears to be in
apps-android-wikipedia:/push/WikipediaFirebaseMessagingService.kt
\cc
@cooltey
@Dbrant
Can you check whether there is something wrong here that may explain why the Wikipedia for Android app is sending 3 million identical uncached API action=options DB write queries a day, almost on par with the number of page views it serves? (~7 million a day per
stats
). It seems like there is likely a bug causing the app to be repeatedly setting or changing user preferences that either are or should be the default. In particular,
change apps-android-wikipedia#2780
looks like it may be a culprit.
My vague memory of the push notification architecture is that there is a dedicated API endpoint for the apps to send a token, which the service then uses to send messages via the intermediate gateways (
wikitech
mw.org
T252899
). I'm not aware of it using the core databases or user preference system as part of its hot code path.
ReleaseTaggerBot
added a project:
MW-1.38-notes (1.38.0-wmf.22; 2022-02-14)
Feb 10 2022, 5:00 AM
2022-02-10 05:00:20 (UTC+0)
gerritbot
added a comment.
Feb 10 2022, 10:51 AM
2022-02-10 10:51:48 (UTC+0)
Comment Actions
Change 761408
merged
by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.21] Short circut updating stats when the page is not reviewable
Stashbot
added a comment.
Feb 10 2022, 10:58 AM
2022-02-10 10:58:38 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2022-02-10T10:58:37Z] Synchronized php-1.38.0-wmf.21/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:761408|Short circut updating stats when the page is not reviewable (
T301433
)]] (duration: 00m 50s)
gerritbot
added a comment.
Feb 10 2022, 10:58 AM
2022-02-10 10:58:38 (UTC+0)
Comment Actions
Change 761409
merged
by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@wmf/1.38.0-wmf.20] Short circut updating stats when the page is not reviewable
ReleaseTaggerBot
edited projects, added
MW-1.38-notes (1.38.0-wmf.21; 2022-02-07)
; removed
MW-1.38-notes (1.38.0-wmf.22; 2022-02-14)
Feb 10 2022, 11:00 AM
2022-02-10 11:00:19 (UTC+0)
Stashbot
added a comment.
Feb 10 2022, 11:01 AM
2022-02-10 11:01:17 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2022-02-10T11:01:16Z] Synchronized php-1.38.0-wmf.20/extensions/FlaggedRevs/backend/FlaggedRevs.php: Backport: [[gerrit:761409|Short circut updating stats when the page is not reviewable (
T301433
)]] (duration: 00m 49s)
Ladsgroup
added a comment.
Edited
Feb 10 2022, 11:02 AM
2022-02-10 11:02:41 (UTC+0)
Comment Actions
Before and after deployment of FR fix:
root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | wc -l
1516
root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | grep -a -i FlaggedRevs | wc -l
190
After:
root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | wc -l
1536
root@db2112:/srv/sqldata# tail db2112-bin.004905 --lines 5000 | grep -a enwiki | grep -a -i FlaggedRevs | wc -l
Note that to get the actual number of FR queries, you need to multiply it by three because the other two don't have FlaggedRevs in them, they have "Flaggable" instead.
Ladsgroup
added a comment.
Feb 10 2022, 11:10 AM
2022-02-10 11:10:12 (UTC+0)
Comment Actions
Thank you for the investigation of the other issue
@Krinkle
Awesome work!
Maintenance_bot
removed a project:
Patch-For-Review
Feb 10 2022, 11:10 AM
2022-02-10 11:10:17 (UTC+0)
Ladsgroup
added a comment.
Feb 10 2022, 11:17 AM
2022-02-10 11:17:54 (UTC+0)
Comment Actions
Ladsgroup
added a comment.
Feb 10 2022, 1:25 PM
2022-02-10 13:25:12 (UTC+0)
Comment Actions
To recap:
We do have several code paths putting pressure on primary causing it to go read-only from time to time. The first and the second just are big floods of fast write queries, the third is not massive but it's really slow.
The iOS issue:
T301433#7699716
FlaggedRevs (fixed)
CategoryMembership
We also have code paths that are putting pressure on replicas (which because of pt-heartbeat can come back and cause issues for primary)
Duplicate parses of commons:
T301310: CommonsMetadata extension is triggering a duplicate parse in commons
When it goes read-only for that specific request, it triggers a lot of exceptions (in the linked req-id, we have 40 errors) but it's still in one request so the error numbers is not very representative
I think it's okay to move forward with the train because it's not new code paths but this issue in itself is UBN and must be fixed ASAP.
Dbrant
added a comment.
Feb 10 2022, 1:33 PM
2022-02-10 13:33:41 (UTC+0)
Comment Actions
Thanks for the heads up! I do indeed see some code paths that would lead to this API call being made unnecessarily. We'll make an update to the app and get it released asap.
Pull request:
Dbrant
claimed this task.
Feb 10 2022, 1:49 PM
2022-02-10 13:49:09 (UTC+0)
Dbrant
triaged this task as
High
priority.
Dbrant
added a project:
Wikipedia-Android-App-Backlog
Ladsgroup
added subscribers:
Marostegui
LSobanski
Feb 10 2022, 2:01 PM
2022-02-10 14:01:39 (UTC+0)
dancy
unsubscribed.
Feb 10 2022, 4:07 PM
2022-02-10 16:07:31 (UTC+0)
thcipriani
raised the priority of this task from
High
to
Unbreak Now!
Feb 10 2022, 4:22 PM
2022-02-10 16:22:18 (UTC+0)
thcipriani
subscribed.
Comment Actions
(UBN! since it's a train blocker)
Zabe
subscribed.
Feb 10 2022, 4:24 PM
2022-02-10 16:24:43 (UTC+0)
LGoto
edited projects, added
Wikipedia-Android-App-Backlog (Android Release - FY2023-24)
; removed
Wikipedia-Android-App-Backlog
Feb 10 2022, 5:06 PM
2022-02-10 17:06:00 (UTC+0)
LGoto
moved this task from
Epics in Progress
to
Code Review
on the
Wikipedia-Android-App-Backlog (Android Release - FY2023-24)
board.
Dbrant
moved this task from
Code Review
to
Ready for release
on the
Wikipedia-Android-App-Backlog (Android Release - FY2023-24)
board.
Feb 10 2022, 6:14 PM
2022-02-10 18:14:30 (UTC+0)
Comment Actions
We've deployed an updated version of the app, however it will take a little while to be approved in the Play Store, and propagate to our install base. (I would estimate a few hours for approval, and then ~24 hours for an appreciable fraction of the install base to get the update and see a reduction in these API calls.) Sorry for any inconvenience.
Krinkle
added a comment.
Feb 10 2022, 6:15 PM
2022-02-10 18:15:26 (UTC+0)
Comment Actions
Thanks for the quick action
@Dbrant
, much appreciated.
jeena
added a comment.
Feb 10 2022, 7:29 PM
2022-02-10 19:29:40 (UTC+0)
Comment Actions
In
T301433#7701819
@Dbrant
wrote:
We've deployed an updated version of the app, however it will take a little while to be approved in the Play Store, and propagate to our install base. (I would estimate a few hours for approval, and then ~24 hours for an appreciable fraction of the install base to get the update and see a reduction in these API calls.) Sorry for any inconvenience.
I'm taking this to mean that we can proceed to deploy wmf.21 to all wikis today. Please let me know if there are any objections.
Umherirrender
mentioned this in
T301506: Resetting an option to default value with api's action=options triggers unneeded delete in UserOptionsManager
Feb 10 2022, 7:42 PM
2022-02-10 19:42:28 (UTC+0)
Umherirrender
subscribed.
Feb 10 2022, 7:46 PM
2022-02-10 19:46:57 (UTC+0)
Comment Actions
In
T301433#7699716
@Krinkle
wrote:
In
T301433#7699669
@Ladsgroup
wrote:
For the s1 issue: If you look at
and toggle url, almost all are
/w/api.php?format=json&formatversion=2&errorformat=html&errorsuselocal=1&action=options
. Is that something with DIP?
As of writing, 100% of these are debug-level messages of the form:
Transaction spent {time_ms}ms in writes, under the 3s limit
. This seems to simply be recording every single database change, no matter what, so long as it took non-zero milliseconds of time, and remained
under
the warning limit. (
Source code
). These probably don't need to be enabled in production, but maybe I'm missing something.
Having said that, it is odd that these are exclusively from
POST /w/api.php?..&errorsuselocal=1&action=options
, because I don't see anything in the Rdbms logging code that indicates a reason to not be logging e.g. all edits, logins and other just normal db changes, so there must be something special about these that I'm missing, but they're
not slow
I checked
Codesearch for errorsuselocal
and found none that concide with action=options. I checked Global Search tool for on-wiki use of errorsuselocal, and none there either.
I checked
Logstash: mediawiki dashboard
with the query
channel:DBPerformance AND normalized_message:"Transaction spent"
over the last 30 days. And then added to the query
url:"action=options"
, which yielded the same results. So in fact 100% of these are from this weird API query. All
3 million a day
and for at least 90 days, across different wikis.
I then checked
api.log
on
mwlog1002
to get some samples, and noticed they are all exactly the same:
change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1
. Although on different wikis and from different user names and IP addresses. These appear related to Echo, and match the DB binlog sample that
@Ladsgroup
posted above, which confirms that that was related and representative of the same cause:
/srv/mw-log/api.log
krinkle@mwlog1002:/srv/mw-log$ ack POST api.log | fgrep errorsuselocal=1 | fgrep action=options | head
mw1362 zhwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2
mw1412 arwiki 1.38.0-wmf.20 api INFO: API POST ******* ******* action=options format=json errorformat=html errorsuselocal=1 change=echo-subscriptions-push-edit-user-talk%3D1%7Cecho-subscriptions-push-login-fail%3D1%7Cecho-subscriptions-push-mention%3D1%7Cecho-subscriptions-push-thank-you-edit%3D1%7Cecho-subscriptions-push-reverted%3D1%7Cecho-subscriptions-push-edit-thank%3D1%7Cecho-cross-wiki-notifications%3D1 token=[redacted] formatversion=2
..
There is no need to trigger deletes if the values are already the default values, see
T301506
Ladsgroup
added a comment.
Feb 11 2022, 5:21 PM
2022-02-11 17:21:31 (UTC+0)
Comment Actions
The commons read-only hasn't improved yet. I think this has something to do the links update refactor:
Krinkle
moved this task from
Untriaged
to
Feb 2022
on the
Wikimedia-production-error
board.
Feb 11 2022, 6:50 PM
2022-02-11 18:50:58 (UTC+0)
Ladsgroup
added a comment.
Feb 12 2022, 11:22 PM
2022-02-12 23:22:13 (UTC+0)
Comment Actions
@Marostegui
can it be semi-sync? I found these:
| | wikiuser | 10.64.16.151:xxxx | commonswiki | Query | 0 | Waiting for semi-sync ACK from slave | UPDATE /* UserEditCountUpdate::doUpdate */ `user` SET user_editcount=user_editcount+1 WHERE user_i | 0.000 |
That would explain sudden read-only errors we see. Maybe we need to tune the parameters.
Marostegui
added a comment.
Edited
Feb 13 2022, 5:53 AM
2022-02-13 05:53:29 (UTC+0)
Comment Actions
That's probably a consequence of the replication lag we are seeing often on the replicas.
Keep in mind that semi-sync config hasn't been changed in years (and we haven't had a master swap in months for commons).
Do you have a timestamp for that event? We can try to see if it happens at the same time one of the hosts had replication lag.
On Friday I saw pretty often 10 seconds lag in lots of slaves, which I would assume comes from a burst of writes on the master, but I wasn't able to identify what kind of writes they were.
gerritbot
added a comment.
Feb 13 2022, 6:52 PM
2022-02-13 18:52:11 (UTC+0)
Comment Actions
Change 762128 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/core@master] WikiPage: Cast the category values to string in updateCategoryCounts
gerritbot
added a project:
Patch-For-Review
Feb 13 2022, 6:52 PM
2022-02-13 18:52:12 (UTC+0)
Ladsgroup
added a subscriber:
aaron
Feb 13 2022, 6:57 PM
2022-02-13 18:57:34 (UTC+0)
Comment Actions
I found what's causing this issue, the bad news is that it's been causing data corruption.
This happens when a category that is only number is being added or removed. e.g. In this case I found, it was Category:1941. You can see that the query becomes this
and
UPDATE /* WikiPage::updateCategoryCounts */ `category` SET cat_pages = cat_pages - 1,cat_files = cat_files - 1 WHERE cat_title = 1941
Mysql interprets cat_title = 1941 (not cat_title = '1941') as any category starting with 1941 meaning updating 2000 other categories as well. This is because addQuote() doesn't cast into string if the value is an int.
Why int is being sent to WikiPage::updateCategoryCounts(), looking at the usages, all are coming from
CategoryLinksTable
class which has had a refactor recently. So cc
@tstarling
and
@aaron
for finding the underlying problem.
I tested this patch in mwdebug and confirm that it fixes the issue.
gerritbot
added a comment.
Feb 13 2022, 7:00 PM
2022-02-13 19:00:15 (UTC+0)
Comment Actions
Change 761755 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):
[mediawiki/core@wmf/1.38.0-wmf.21] WikiPage: Cast the category values to string in updateCategoryCounts
Ladsgroup
added a comment.
Feb 13 2022, 7:09 PM
2022-02-13 19:09:11 (UTC+0)
Comment Actions
Other cases happening because of this.
gerritbot
added a comment.
Feb 13 2022, 7:16 PM
2022-02-13 19:16:56 (UTC+0)
Comment Actions
Change 761755
merged
by jenkins-bot:
[mediawiki/core@wmf/1.38.0-wmf.21] WikiPage: Cast the category values to string in updateCategoryCounts
gerritbot
added a comment.
Feb 13 2022, 7:18 PM
2022-02-13 19:18:23 (UTC+0)
Comment Actions
Change 762128
merged
by jenkins-bot:
[mediawiki/core@master] WikiPage: Cast the category values to string in updateCategoryCounts
Umherirrender
added a comment.
Feb 13 2022, 7:20 PM
2022-02-13 19:20:35 (UTC+0)
Comment Actions
There was some
strval
before the refactor from
, but no tests for it.
The new patch set is deeper in the stack and close before the update happen, that sounds better.
Stashbot
added a comment.
Feb 13 2022, 7:26 PM
2022-02-13 19:26:37 (UTC+0)
Comment Actions
Mentioned in SAL (#wikimedia-operations)
[2022-02-13T19:26:37Z] Synchronized php-1.38.0-wmf.21/includes/page/WikiPage.php: Backport: [[gerrit:761755|WikiPage: Cast the category values to string in updateCategoryCounts (
T301433
)]] (duration: 00m 49s)
Ladsgroup
added a comment.
Feb 13 2022, 7:37 PM
2022-02-13 19:37:12 (UTC+0)
Comment Actions
Further work:
Add regression test
Find the underlying cause and possibly check for similar issues in other places
Increase the level of rowsAffected > 1000 warnings to error so it would show up in trains.
ReleaseTaggerBot
edited projects, added
MW-1.38-notes (1.38.0-wmf.22; 2022-02-14)
; removed
MW-1.38-notes (1.38.0-wmf.21; 2022-02-07)
Feb 13 2022, 8:00 PM
2022-02-13 20:00:17 (UTC+0)
tstarling
added a comment.
Edited
Feb 14 2022, 3:59 AM
2022-02-14 03:59:55 (UTC+0)
Comment Actions
The underlying cause is PHP's array key storage model, which doesn't preserve types but attempts to reconstruct them on access:
php > $categories = [];
php > $categories['1941'] = true;
php > foreach ( $categories as $name => $value ) var_dump( $name );
int(1941)
ParserOutput::getCategories() returns the category names in the array keys, so every caller has to cast to string to recover the correct type.
I have a CategoryLinksTable patch which redundantly fixes the bug, which I will upload as soon as I figure out how to write a test for it.
gerritbot
added a comment.
Feb 14 2022, 4:59 AM
2022-02-14 04:59:48 (UTC+0)
Comment Actions
Change 762141 had a related patch set uploaded (by Tim Starling; author: Tim Starling):
[mediawiki/core@master] LinksTable: Cast all array keys to string
gerritbot
added a comment.
Feb 14 2022, 11:09 PM
2022-02-14 23:09:14 (UTC+0)
Comment Actions
Change 762141
merged
by jenkins-bot:
[mediawiki/core@master] LinksTable: Cast all array keys to string
Ladsgroup
closed this task as
Resolved
Feb 14 2022, 11:14 PM
2022-02-14 23:14:09 (UTC+0)
Ladsgroup
mentioned this in
T301742: Increase log level of rowsAffected > 1000 on database writes in transaction profiler to warning or error
Comment Actions
We haven't seen any read-only errors since the deploy. I filed
T301742: Increase log level of rowsAffected > 1000 on database writes in transaction profiler to warning or error
and the rest seems done now.
Krinkle
mentioned this in
T221795: Refactor Category::refreshCounts logic to a job and simplify
Mar 22 2025, 9:28 AM
2025-03-22 09:28:46 (UTC+0)
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct.
Wikimedia Foundation
Code of Conduct
Disclaimer
CC-BY-SA
GPL
Credits