⚓ T387130 CVE-2025-32699: Potential javascript injection attack enabled by Unicode normalization in Action API
Page Menu
Phabricator
Create Task
Maniphest
T387130
CVE-2025-32699: Potential javascript injection attack enabled by Unicode normalization in Action API
Closed, Resolved
Public
Security
Actions
Edit Task
Edit Related Tasks...
Create Subtask
Edit Parent Tasks
Edit Subtasks
Merge Duplicates In
Close As Duplicate
Edit Related Objects...
Edit Commits
Edit Mocks
Mute Notifications
Protect as security issue
Assigned To
cscott
Authored By
zoe
Feb 24 2025, 3:30 PM
2025-02-24 15:30:38 (UTC+0)
Tags
Security-Team
(In Progress)
Security
MediaWiki-Action-API
(Unsorted)
Vuln-XSS
Vuln-Inject
(Tracked)
SecTeam-Processed
(Completed)
Content-Transform-Team (Work In Progress)
(Backlog)
Essential-Work
MW-Interfaces-Team
(Incoming (Needs Triage))
Patch-For-Review
Referenced Files
F59025694: 02-T387130-2.patch
Apr 9 2025, 2:30 PM
2025-04-09 14:30:00 (UTC+0)
F59025660: 02-T387130.patch
Apr 9 2025, 2:30 PM
2025-04-09 14:30:00 (UTC+0)
F58631089: REL1_39-0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch
Mar 7 2025, 3:39 AM
2025-03-07 03:39:03 (UTC+0)
F58626453: REL1_42-0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch
Mar 6 2025, 10:02 PM
2025-03-06 22:02:29 (UTC+0)
F58625014: REL1_43-0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch
Mar 6 2025, 8:18 PM
2025-03-06 20:18:40 (UTC+0)
F58513634: 0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch
Feb 27 2025, 6:32 PM
2025-02-27 18:32:36 (UTC+0)
F58513114: 0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch
Feb 27 2025, 5:09 PM
2025-02-27 17:09:53 (UTC+0)
F58511482: 0001-Avoid-attacks-in-API-related-to-unicode-NFC-and-deco.patch
Feb 27 2025, 11:07 AM
2025-02-27 11:07:04 (UTC+0)
View All 17 Files
Subscribers
ABreault-WMF
acooper
Aklapper
Bawolff
cscott
dchan
DLynch
View All 20 Subscribers
Description
TL;DR
The MediaWiki Action API converts output to Unicode Normalization Form C. Unfortunately, for HTML strings this is unsafe, because the sequence ‘
’ + U+0338 gets replaced by U+226F, breaking the tag end and potentially allowing injection attacks.
Steps to reproduce
Visit any wiki page, such that
mw.Api()
has loaded
Open the console
Perform any mw.Api call that generates HTML, such that you can make the first character inside a tag be
U+0338 COMBINING LONG SOLIDUS OVERLAY
In other words, you want the API to send you some HTML that contains ‘
’ followed by U+0338.
For example, call
{ action: 'visualeditor', paction: 'parsefragment' }
as follows:
const
COMBINING_LONG_SOLIDUS
'\u0338'
new
mw
Api
().
post
action
'visualeditor'
paction
'parsefragment'
page
'Test'
wikitext
COMBINING_LONG_SOLIDUS
' onmouseover="alert(42)" >content'
).
done
data
=>
const
content
data
visualeditor
content
document
body
innerHTML
content
console
log
'Content:'
content
);
).
fail
err
=>
console
error
err
);
This can also be reproduced without JavaScript – note the missing
after
id="mwAg"
curl -s -d
action
visualeditor -d
paction
parsefragment -d
page
Test -d
wikitext
$'\u0338 onmouseover="alert(42)" >content '
-d
format
json https://en.wikipedia.org/w/api.php
jq -r .visualeditor.content
hexdump -C
00000000 3c 70 20 69 64 3d 22 6d 77 41 67 22 e2 89 af 20 |
00000010 6f 6e 6d 6f 75 73 65 6f 76 65 72 3d 22 61 6c 65 |onmouseover="ale|
00000020 72 74 28 34 32 29 22 20 3e 63 6f 6e 74 65 6e 74 |rt(42)" >content|
00000030 20 3c 2f 70 3e 0a |
00000036
Compare this with a normal slash in the input:
curl -s -d
action
visualeditor -d
paction
parsefragment -d
page
Test -d
wikitext
'/ onmouseover="alert(42)" >content '
-d
format
json https://en.wikipedia.org/w/api.php
jq -r .visualeditor.content
hexdump -C
00000000 3c 70 20 69 64 3d 22 6d 77 41 67 22 3e 2f 20 6f |
/ o|
00000010 6e 6d 6f 75 73 65 6f 76 65 72 3d 22 61 6c 65 72 |nmouseover="aler|
00000020 74 28 34 32 29 22 20 3e 63 6f 6e 74 65 6e 74 20 |t(42)" >content |
00000030 3c 2f 70 3e 0a |
00000035
Actual behaviour
The sequence ‘
’ + U+0338 gets replaced with the combined character U+226F
≯ NOT GREATER THAN
. This is due to applying Unicode Normalization Form C. But (surprisingly!) that breaks the HTML tag, which potentially allows a Javascript injection attack, for instance:
'
content
'Expected behaviour
The HTML arrives with the sequence ‘
’ + U+0338 intact.
'
\u0338 onmouseover="alert(42)" >content
'Note that this is a regular '
' symbol closing the HTML tag, followed by a U+0338
◌̸ COMBINING LONG SOLIDUS OVERLAY
, such that the contents of the
tag are
"\u0338 onclick="alert(42)" >content"
Debugging note
Both expected and actual output may look similar or identical when rendered in the console:
content
# actual̸ onmouseover="alert(42)" >content
# expectedThe best way to see for sure what’s there is to escape non-ASCII characters with a function:
function
showUnicode
text
return
text
replace
/[^\x00-\x7F]/g
ch
=>
'\\u'
ch
charCodeAt
).
toString
16
).
padStart
'0'
);
text
'
content
'console
log
showUnicode
text
);
//
content
Cause
The cause was identified by
@dchan
in the
related ticket
. The function
MediaWiki::Api::ApiResult::validateValue
is not just catching invalid UTF-8. It is also applying Unicode Normalization Form C. Unfortunately, as we have seen, it is unsafe to do this on HTML strings if they might contain ‘
’ + U+0338.
MediaWiki::Api::ApiResult::addValue
MediaWiki::Api::ApiResult::validateValue
MediaWiki::Language::normalize
UtfNormal::Validator::cleanUp
normalizer_normalize( $string, Normalizer::FORM_C )
Details
Risk Rating
High
Author Affiliation
WMF Product
Related Changes in Gerrit:
Subject
Repo
Branch
Lines +/-
Replace isolated combining characters
utfnormal
master
+134
-14
SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization
mediawiki/core
master
+116
-9
SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization
mediawiki/core
REL1_42
+115
-10
SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization
mediawiki/core
REL1_43
+116
-9
SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization
mediawiki/core
REL1_39
+47
-10
Update wikimedia/parsoid to 0.16.5
mediawiki/vendor
REL1_39
+81
-64
Update wikimedia/parsoid to 0.19.2
mediawiki/vendor
REL1_42
+37
-47
Update wikimedia/parsoid to 0.20.2
mediawiki/vendor
REL1_43
+32
-30
Customize query in gerrit
Related Objects
Search...
Task Graph
Mentions
Status
Subtype
Assigned
Task
Resolved
None
T382316
Release MediaWiki 1.39.12/1.42.6/1.43.1
Restricted Task
Resolved
Security
cscott
T387130
CVE-2025-32699: Potential javascript injection attack enabled by Unicode normalization in Action API
Mentioned In
T382756: Error When Editing Pages with Specific Unicode Character in Visual Editor
Mentioned Here
T382316: Release MediaWiki 1.39.12/1.42.6/1.43.1
T354361: HtmlHelper::modifyElements(…, $html5format = false) incorrectly encodes HTML entities
T363764: Refactor dependency injection (DI) in OutputTransform stages
T343994: OutputPage::setPageTitle() should not accept Message objects, introduce OutputPage::setPageTitleMsg()
T266140: HTML entity replaced by the Unicode character in an edit
T346197: Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at end of string (3 byte sequence)
T17261: Trimmed multibyte characters result in invalid XML
T18262: File metadata containing invalid characters produce bad-formed XML
T324431: Parsoid: displaytitle HTML now appearing in
Event Timeline
There are a very large number of changes, so older changes are hidden.
Show Older Changes
ihurbain
added a comment.
Feb 27 2025, 9:29 AM
2025-02-27 09:29:47 (UTC+0)
Comment Actions
@cscott
I think you haven't attached the OutputTransform patches correctly, here I can only see their file names
dchan
added a comment.
Feb 27 2025, 10:26 AM
2025-02-27 10:26:24 (UTC+0)
Comment Actions
In
T387130#10585557
@cscott
wrote:
Here is a patch for 1b from above (
T387130#10584304
). This patches wikimedia/utfnormal to prefix isolated combining characters, using the unicode database [...]
0004-Replace-isolated-combining-characters.patch
134 KB
$canPrecedeCombining
[];
[...]
public
function
testU0338
()
$text
\u
{0338}<
\u
{0338}>
\u
{0338}"
$expect
\u
{25CC}
\u
{0338}
\u
{226E}
\u
{226F}"
$this
->
assertEquals
bin2hex
$expect
),
bin2hex
Validator
::
cleanUp
$text
);
Please can we remove '
' and '
' from
$canPrecedeCombining
? that would render html invulnerable, without breaking any combining sequence except
">\x{0338}"
and
"<\x{0338}"
— both of which have precomposed alternatives and are bad things to be floating around our ecosystem.
Also that way I don't have to fix anything in ApiVisualEditor.php 😁
Then the correct test would be:
$text
\u
{0338}<
\u
{0338}>
\u
{0338}"
$expect
\u
{25CC}
\u
{0338}<
\u
{25CC}
\u
{0338}>
\u
{25CC}
\u
{0338}"
Bawolff
added a comment.
Edited
Feb 27 2025, 10:28 AM
2025-02-27 10:28:35 (UTC+0)
Comment Actions
Yes, I agree that would be better. What I don't know is why we choose to normalize html (or wikitext) to NFC there. Is that a step we actually need?
I looked it up. It was to fix
T17261
and
T18262
r45749
That said, it does seem like it only really needs valid unicode not normalized unicode.
ihurbain
added a comment.
Edited
Feb 27 2025, 10:58 AM
2025-02-27 10:58:05 (UTC+0)
Comment Actions
Nag on the back of my mind: can we have a potential vector around source ranges?
We have things like
T346197
(and related "bad UTF-8" issues) that hint that we occasionally try to access character ranges that were not the ones we actually wanted to; there might be a way to craft things around these?
I think we might need a patch on PHPUtils::safeSubstr in Parsoid to avoid returning a string starting with \u0338? (which we do right now:
$ php run.php shell
> use Wikimedia\Parsoid\Utils\PHPUtils;
> PHPUtils::safeSubstr("\u{0338}aaa", 0, 5);
= "̸aaa"
dchan
added a comment.
Feb 27 2025, 10:59 AM
2025-02-27 10:59:18 (UTC+0)
Comment Actions
In
T387130#10585603
@Bawolff
wrote:
The requirements are:
combining slash is sometimes valid, so we cant outright ban it
< followed by combining slash will almost always be malicious in output because most keyboards output the precomposed form and we also run NFC on all user input
we cant generically replace combining slash with entity at the normalization stage as we dont know if the output is html or not.
Yes, very clearly put. However if we have unicode data available, we could insert U+25CC '
' inside any sequence where '>' is followed by a combining character. I think that's reasonable for a function called
cleanUp
What if on any output normalization (so excluding normalization of user input in WebRequest), we first count the number of precomposed "not greater than" signs, normalize, and then count again. If the number changes we know an attack is happening since we assume at this point the decomposed not greater than is always malicious.
That's an ingenious idea. Maybe we should do it if we end up unconvinced that our coverage is good enough?
Bawolff
added a comment.
Feb 27 2025, 11:07 AM
2025-02-27 11:07:04 (UTC+0)
Comment Actions
In
T387130#10585603
@Bawolff
wrote:
Thinking about this
The requirements are:
combining slash is sometimes valid, so we cant outright ban it
< followed by combining slash will almost always be malicious in output because most keyboards output the precomposed form and we also run NFC on all user input
we cant generically replace combining slash with entity at the normalization stage as we dont know if the output is html or not.
What if on any output normalization (so excluding normalization of user input in WebRequest), we first count the number of precomposed "not greater than" signs, normalize, and then count again. If the number changes we know an attack is happening since we assume at this point the decomposed not greater than is always malicious. At this point we throw an exception or maybe go back to the unnormalized string, replace all combining slash with unicode replacement and try again (its ok to be a little lossy here since we assume this code path only happens during an attack). Thoughts?
If we do strip these characters, i think it is less confusing to the user to replace them with unicode replacement character then to just silently delete.
Attempt at implementing this idea, although maybe it belongs as a method of UTFValidator instead of in API. I only covered
as I don't think ≮ is a security risk, but maybe it would make more sense to cover both just in case.
0001-Avoid-attacks-in-API-related-to-unicode-NFC-and-deco.patch
2 KB
zoe
added a comment.
Feb 27 2025, 1:10 PM
2025-02-27 13:10:15 (UTC+0)
Comment Actions
What if on any output normalization (so excluding normalization of user input in WebRequest), we first count the number of precomposed "not greater than" signs, normalize, and then count again. If the number changes we know an attack is happening since we assume at this point the decomposed not greater than is always malicious.
Headline: I think this would work.
We are relying on the properties of the characters, not the NFC algorithm itself. If we can find something that composes with
U+003E >
and which comes earlier in the canonical ordering algorithm than
U+0338 ◌̸
then we could create a string which normalises with fewer copies of
and bypass such a check.
I'm having trouble finding details of the normalization algorithm, but experimentally we can gain confidence:
[...
new
Array
0xffff
)].
map
((
=>
'\u226f'
String
fromCodePoint
)).
normalize
"NFC"
)).
filter
((
=>
codePointAt
!==
0x226f
//[]
dchan
added a comment.
Feb 27 2025, 2:05 PM
2025-02-27 14:05:35 (UTC+0)
Comment Actions
In
T387130#10586700
@zoe
wrote:
We are relying on the properties of the characters, not the NFC algorithm itself. If we can find something that composes with
U+003E >
and which comes earlier in the canonical ordering algorithm than
U+0338 ◌̸
then we could create a string which normalises with fewer copies of
and bypass such a check.
I'm having trouble finding details of the normalization algorithm, but experimentally we can gain confidence:
[...
new
Array
0xffff
)].
map
((
=>
'\u226f'
String
fromCodePoint
)).
normalize
"NFC"
)).
filter
((
=>
codePointAt
!==
0x226f
//[]
Oh yes, that's an important thing to consider. You're right that we're safe: the 4th field of UnicodeData.txt shows U+0338 has Canonical Combining Class 1, which is the lowest possible for a combining character. Therefore no combining character can be moved before U+0338 by the Canonical Ordering Algorithm.
0338;COMBINING LONG SOLIDUS OVERLAY;Mn;1;NSM;;;;;N;NON-SPACING LONG SLASH OVERLAY;;;;
cscott
added a comment.
Feb 27 2025, 3:35 PM
2025-02-27 15:35:36 (UTC+0)
Comment Actions
In
T387130#10586404
@dchan
wrote:
In
T387130#10585557
@cscott
wrote:
Here is a patch for 1b from above (
T387130#10584304
). This patches wikimedia/utfnormal to prefix isolated combining characters, using the unicode database [...]
0004-Replace-isolated-combining-characters.patch
134 KB
$canPrecedeCombining
[];
[...]
public
function
testU0338
()
$text
\u
{0338}<
\u
{0338}>
\u
{0338}"
$expect
\u
{25CC}
\u
{0338}
\u
{226E}
\u
{226F}"
$this
->
assertEquals
bin2hex
$expect
),
bin2hex
Validator
::
cleanUp
$text
);
Please can we remove '
' and '
' from
$canPrecedeCombining
? that would render html invulnerable, without breaking any combining sequence except
">\x{0338}"
and
"<\x{0338}"
— both of which have precomposed alternatives and are bad things to be floating around our ecosystem.
This code applies *after* NFC normalization has been done. So
and
will never appear as a preceding character. That's not the point of this code -- the point of this code is to eliminate "hanging" combining characters that are then time bimbs that cause trouble when they are pasted inside an HTML tag.
cscott
added a comment.
Edited
Feb 27 2025, 5:09 PM
2025-02-27 17:09:53 (UTC+0)
Comment Actions
I am (currently) proposing the following two patches:
To mediawiki-core: (updated to include Tidy/Html/Message/OutputTransform)
0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch
15 KB
To Parsoid: (unchanged from
T387130#10584604
0001-Entity-escape-U-0338-where-needed-to-make-HTML-outpu.patch
8 KB
The Parsoid patch will have to be deployed to prod as a patch to the vendor directory.
All other patches are additional hardenings which are nice-to-have but shouldn't be on the critical path here. In particular, I think hardening
UtfNormal\Validator::cleanUp()
is worthwhile, but isn't necessary to stop the injection attack. I have a number of other patches in my tree which shift code to using the
Html::*
helper classes instead of string concatenation which is again helpful, but unnecessary given a final postprocessing pass to entity-escape all remaining U+0338, which is what the above implements.
I also considered a patch to OutputPage::output() to do a final fail-safe against U+0338, but that also doesn't seem strictly necessary as all the attack vectors go through the Action API (which is where NFC normalization is performance) not direct HTML output (which is what OutputPage::output() does).
sbassett
added a subscriber:
gerritbot
Feb 27 2025, 5:18 PM
2025-02-27 17:18:50 (UTC+0)
Comment Actions
In
T387130#10588179
@cscott
wrote:
The Parsoid patch will have to be deployed to prod as a patch to the vendor directory.
All other patches are additional hardenings which are nice-to-have but shouldn't be on the critical path here.
Ok, once we have all of the relevant patches completed and code-reviewed, we should categorize them as "this patch needs a discrete security deployment to Wikimedia production to fix the active security issues there" and "this patch is code-hardening". The latter should all be able to go through gerrit. And of course Parsoid (and similar) patches can go through gerrit, where we don't have a defined, discrete Wikimedia production security deployment process.
ssastry
added a comment.
Feb 27 2025, 5:45 PM
2025-02-27 17:45:30 (UTC+0)
Comment Actions
In
T387130#10588179
@cscott
wrote:
I am (currently) proposing the following two patches:
To mediawiki-core: (updated to include Tidy/Html/Message/OutputTransform)
0001-Ensure-emitted-HTML-is-safe-against-Unicode-NFC-norm.patch
15 KB
My IDE shows that there is a typo in this patch in Message .. "escapeCombiningChars" instead of "escapeCombiningChar". Is it easy to update Message tests to cover format* functions you updated in this patch?
dchan
added a comment.
Feb 27 2025, 6:00 PM
2025-02-27 18:00:41 (UTC+0)
Comment Actions
In
T387130#10587431
@cscott
wrote:
In
T387130#10586404
@dchan
wrote:
Please can we remove '
' and '
' from
$canPrecedeCombining
? that would render html invulnerable, without breaking any combining sequence except
">\x{0338}"
and
"<\x{0338}"
— both of which have precomposed alternatives and are bad things to be floating around our ecosystem.
This code applies *after* NFC normalization has been done. So
and
will never appear as a preceding character. That's not the point of this code -- the point of this code is to eliminate "hanging" combining characters that are then time bimbs that cause trouble when they are pasted inside an HTML tag.
Oh sorry, I missed that. Then shouldn't we fix U+0338 before NFC normalization?
$string
self
::
prependIsolatedCombining
$string
);
$norm
normalizer_normalize
$string
Normalizer
::
FORM_C
);
...
return
self
::
prependIsolatedCombining
self
::
NFC
$string
);
return
self
::
NFC
self
::
prependIsolatedCombining
$string
);
I think we need it, because otherwise
MediaWiki::Request::WebRequest::getValues
breaks valid HTML / wikitext. For example, right now (even with the above patches), If I use VisualEditor to save a source page containing
"FOO\x{0338}BAR"
then
MediaWiki::Request::WebRequest::getValues
mangles it into
"FOO