CSS Speech Module Level 1
CSS Speech Module Level 1
Editor’s Draft
22 December 2025
More details about this document
This version:
Latest published version:
Previous Versions:
Feedback:
CSSWG Issues Repository
Editors:
Léonie Watson
Tetralogical
Elika J. Etemad / fantasai
Apple
Former Editors:
Daniel Weck
DAISY Consortium
Claudio Santambrogio
Opera Software
Dave Raggett
W3C / Canon
Suggest an Edit for this Spec:
GitHub Editor
World Wide Web Consortium
W3C
liability
trademark
and
permissive document license
rules apply.
Abstract
The Speech module defines aural CSS properties that enable authors to declaratively control the rendering of documents via speech synthesis, and using optional audio cues. Note that this standard was developed in cooperation with the
Voice Browser Activity
CSS
is a language for describing the rendering of structured documents
(such as HTML and XML)
on screen, on paper, etc.
Status of this document
This is a public copy of the editors’ draft.
It is provided for discussion only and may change at any moment.
Its publication here does not imply endorsement of its contents by W3C.
Don’t cite this document other than as work in progress.
Please send feedback
by
filing issues in GitHub
(preferred),
including the spec code “css-speech” in the title, like this:
“[css-speech]
…summary of comment…
”.
All issues and comments are
archived
Alternately, feedback can be sent to the (
archived
) public mailing list
www-style@w3.org
This document is governed by the
18 August 2025 W3C Process Document
The following features are at-risk, and may be dropped during the CR period:
voice-balance
voice-duration
voice-pitch
voice-range
, and
voice-stress
“At-risk” is a W3C Process term-of-art, and does not necessarily imply that the feature is in danger of being dropped or delayed. It means that the WG believes the feature may have difficulty being interoperably implemented in a timely manner, and marking it as such allows the WG to drop the feature if necessary when transitioning to the Proposed Rec stage, without having to publish a new Candidate Rec without the feature first.
1.
Introduction, design goals
This section is non-normative.
The aural presentation of information
is commonly used by people who are
blind, visually-impaired, or otherwise print-disabled.
For instance,
“screen readers” allow users to interact with visual interfaces
that would otherwise be inaccessible to them.
There are also circumstances in which
listening
to content
(as opposed to
reading
is preferred, or sometimes even required,
irrespective of a person’s physical ability to access information.
For instance: playing an e-book whilst driving a vehicle,
learning how to manipulate industrial and medical devices,
interacting with home entertainment systems,
teaching young children how to read.
The CSS properties defined in this Speech module
enable authors to declaratively control the presentation of a document
in the aural dimension.
The aural rendering of a document combines speech synthesis
(also known as “TTS”, the acronym for “Text to Speech”)
and auditory icons
(which are referred-to as “audio cues” in this specification).
The CSS Speech properties provide the ability
to control speech pitch and rate, sound levels, TTS voices, etc.
These stylesheet properties can be used together
with visual properties (mixed media),
or as a complete aural alternative to a visual presentation.
2.
Background information, CSS 2.1
This section is non-normative.
The CSS Speech module is a re-work of the informative
CSS2.1 Aural appendix
within which the
aural
media type was described,
but also deprecated (in favor of the
speech
media type, which has now
also been deprecated).
Although the
[CSS2]
specification reserved the
speech
media type,
it didn’t actually define the corresponding properties.
The Speech module describes the CSS properties that apply to speech output,
and defines a new “box” model specifically for the aural dimension.
Content creators can include CSS properties for user agents with
text to speech synthesis capabilities for any media type - though
generally, they will only make sense for
all
and
screen
These styles are simply ignored by user agents that do not support
the Speech module.
3.
Relationship with SSML
This section is non-normative.
Some of the features in this specification are conceptually similar to
functionality described in the Speech Synthesis Markup Language (SSML) Version 1.1
[SSML]
However, the specificities of the CSS model mean
that compatibility with SSML in terms of syntax and/or semantics
is only partially achievable.
The definition of each property in the Speech module
includes informative statements, wherever necessary,
to clarify their relationship with similar functionality from SSML.
3.1.
Value Definitions
This specification follows the
CSS property definition conventions
from
[CSS2]
using the
value definition syntax
from
[CSS-VALUES-3]
Value types not defined in this specification are defined in CSS Values & Units
[CSS-VALUES-3]
Combination with other CSS modules may expand the definitions of these value types.
In addition to the property-specific values listed in their definitions,
all properties defined in this specification
also accept the
CSS-wide keywords
as their property value.
For readability they have not been repeated explicitly.
4.
Example
This example shows how authors can tell the speech synthesizer to speak HTML headings
with a voice called "paul",
using "moderate" emphasis (which is more than normal)
and how to insert an audio cue (pre-recorded audio clip located at the given URL)
before the start of TTS rendering for each heading.
In a stereo-capable sound system,
paragraphs marked with the CSS class
heidi
are rendered on the left audio channel (and with a female voice, etc.),
whilst the class
peter
corresponds to the right channel (and to a male voice, etc.).
The volume level of text spans marked with the class
special
is lower than normal,
and a prosodic boundary is created
by introducing a strong pause after it is spoken
(note how the
span
inherits the voice-family from its parent paragraph).
h1, h2, h3, h4, h5, h6 {
voice-family: paul;
voice-stress: moderate;
cue-before: url(../audio/ping.wav);
voice-volume: medium 6dB;
p.heidi {
voice-family: female;
voice-balance: left;
voice-pitch: high;
voice-volume: -6dB;
p.peter {
voice-family: male;
voice-balance: right;
voice-rate: fast;
span.special {
voice-volume: soft;
pause-after: strong;

...

I am Paul, and I speak headings.


Hello, I am Heidi.



Can you hear me ?
I am Peter.


5.
The aural formatting model
The CSS formatting model for aural media is based on
a sequence of sounds and silences that occur within a nested context
similar to the
visual box model
which we name the
aural “box” model
The aural “canvas” consists of a two-channel (stereo) space
and of a temporal dimension,
within which synthetic speech and audio cues coexist.
The selected element is surrounded by
rest
cue
and
pause
properties
(from the innermost to the outermost position).
These can be seen as aural equivalents to
padding
border
and
margin
, respectively.
When used, the
::before
and
::after
pseudo-elements
[CSS2]
get inserted between the element’s contents and the
rest
The following diagram illustrates the equivalence between
properties of the visual and aural box models,
applied to the selected :
6.
Mixing properties
6.1.
The
voice-volume
property
Name:
voice-volume
Value:
silent
[ [ x-soft
soft
medium
loud
x-loud ]
||

Initial:
medium
Applies to:
all elements
Inherited:
yes
Percentages:
N/A
Computed value:
silent
, or a keyword value and optionally also a decibel offset (if not zero)
Canonical order:
per grammar
Animation type:
by computed value type
The
voice-volume
property allows authors to control
the amplitude of the audio waveform generated by the speech synthesizer,
and is also used to adjust the relative volume level of
audio cues
within the
aural box model
of the selected element.
Note:
Although the functionality provided by this property is similar to
the
volume
attribute of the
prosody
element
from the SSML markup language
[SSML]
there are notable discrepancies.
For example, CSS Speech volume keywords and decibels units are not mutually-exclusive,
due to how values are inherited and combined for selected elements.
silent
Specifies that no sound is generated (the text is read "silently").
Note:
This has the same effect as using negative infinity decibels.
Also note that there is a difference between
an element whose
voice-volume
property has a value of
silent
and an element whose
speak
property has the value
none
With the former,
the selected element takes up the same time as if it was spoken,
including any pause before and after the element,
but no sound is generated
(and descendants within the
aural box model
of the selected element
can override the
voice-volume
value, and may therefore generate audio output).
With the latter,
the selected element is not rendered in the aural dimension
and no time is allocated for playback
(descendants within the
aural box model
of the selected element
can override the
speak
value,
and may therefore generate audio output).
x-soft
soft
medium
loud
x-loud
This sequence of keywords corresponds to
monotonically non-decreasing volume levels,
mapped to implementation-dependent values
that meet the listener’s requirements with regards to perceived loudness.
These audio levels are typically provided via a preference mechanism
that allow users to calibrate sound options
according to their auditory environment.
The keyword
x-soft
maps to the user’s
minimum audible
volume level,
x-loud
maps to the user’s
maximum tolerable
volume level,
medium
maps to the user’s
preferred
volume level,
soft
and
loud
map to intermediary values.

This represents a change (positive or negative)
relative to the given keyword value (see enumeration above),
or to the default value for the root element,
or otherwise to the inherited volume level
(which may itself be a combination of a keyword value and decibel offset,
in which case the decibel values are combined additively).
When the inherited volume level is
silent
this
voice-volume
resolves to
silent
too,
regardless of the specified

value.
The

type denotes
dimension
with a "dB" (decibel unit) unit identifier.
Decibels represent
the ratio of the squares of the new signal amplitude
a1
and the current amplitude
a0
as per the following logarithmic equation:
volume(dB) = 20 × log10(
a1
a0
).
Note:
-6.0dB is approximately half the amplitude of the audio signal,
and +6.0dB is approximately twice the amplitude.
Note:
Perceived loudness depends on various factors,
such as the listening environment, user preferences or physical abilities.
The effective volume variation between
x-soft
and
x-loud
represents
the dynamic range (in terms of loudness) of the audio output.
Typically, this range would be compressed in a noisy context,
i.e. the perceived loudness corresponding to
x-soft
would effectively be closer to
x-loud
than it would be in a quiet environment.
There may also be situations where both
x-soft
and
x-loud
would map to low volume levels,
such as in listening environments requiring discretion
(e.g. library, night-reading).
6.2.
The
voice-balance
property
Name:
voice-balance
Value:

left
center
right
leftwards
rightwards
Initial:
center
Applies to:
all elements
Inherited:
yes
Percentages:
N/A
Computed value:
the specified value resolved to a between
-100
and
100
(inclusive)
Canonical order:
per grammar
Animation type:
by computed value type
The
voice-balance
property controls the spatial distribution
of audio output across a lateral sound stage:
one extremity is on the left, the other extremity is on the right hand side,
relative to the listener’s position.
Authors can specify intermediary steps between left hand right extremities,
to represent the audio separation along the resulting left-right axis.
Note:
The functionality provided by this property has no match in the SSML markup language
[SSML]

number
between
-100
and
100
(inclusive).
Values smaller than
-100
are clamped to
-100
Values greater than
100
are clamped to
100
The value
-100
represents the left side,
and the value
100
represents the right side.
The value
represents the center point
whereby there is no discernible audio separation
between left and right sides.
(In a stereo sound system,
this corresponds to equal distribution of audio signals
between left and right speakers).
left
Same as
-100
center
Same as
right
Same as
100
leftwards
Moves the sound to the left
by subtracting 20 from the inherited
voice-balance
value
(and by clamping the resulting number to
-100
).
rightwards
Moves the sound to the right,
by adding 20 to the inherited
voice-balance
value
(and by clamping the resulting number to
100
).
User agents can be connected to different kinds of sound systems,
featuring varying audio mixing capabilities.
The expected behavior for mono, stereo, and surround sound systems
is defined as follows:
When user agents produce audio via a mono-aural sound system
(i.e. single-speaker setup),
the
voice-balance
property has no effect.
When user agents produce audio through a stereo sound system
(e.g. two speakers, or a pair of headphones),
the left-right distribution of audio signals
can precisely match the authored values for the
voice-balance
property.
When user agents are capable of mixing audio signals through more than 2 channels
(e.g. 5-speakers surround sound system, including a dedicated center channel),
the physical distribution of audio signals
resulting from the application of the
voice-balance
property
should be performed so that the listener perceives sound
as if it was coming from a basic stereo layout.
For example, the center channel as well as the left/right speakers
may be used all together
in order to emulate the behavior of the
center
value.
Future revisions of the CSS Speech module may include support for three-dimensional audio,
which would effectively enable authors to specify “azimuth” and “elevation” values.
In the future, content authored using the current specification
may therefore be consumed by user agents which are compliant
with the version of CSS Speech that supports three-dimensional audio.
In order to prepare for this possibility,
the values enabled by the current
voice-balance
property
are designed to remain compatible with “azimuth” angles.
More precisely, the mapping between the current left-right audio axis (lateral sound stage)
and the envisioned 360 degrees plane around the listener’s position
is defined as follows:
The value
maps to zero degrees (
center
).
This is in "front" of the listener, not from "behind".
The value
-100
maps to -40 degrees (
left
).
Negative angles are in the counter-clockwise direction
(assuming the audio stage is seen from the top).
The value
100
maps to 40 degrees (
right
).
Positive angles are in the clockwise direction
(assuming the audio stage is seen from the top).
Intermediary values on the scale from
100
to
100
map to the angles between -40 and 40 degrees
in a numerically linearly-proportional manner.
For example,
-50
maps to -20 degrees.
Note:
Sound systems can be configured by users
in such a way that it would interfere with the left-right audio distribution
specified by document authors.
Typically, the various “surround” modes available in modern sound systems
(including systems based on basic stereo speakers)
tend to greatly alter the perceived spatial arrangement of audio signals.
The illusion of a three-dimensional sound stage
is often achieved using a combination of
phase shifting, digital delay, volume control (channel mixing), and other techniques.
Some users may even configure their system to “downgrade” any rendered sound
to a single mono channel,
in which case the effect of the
voice-balance
property
would obviously not be perceivable at all.
The rendering fidelity of authored content
is therefore dependent on such user customizations,
and the
voice-balance
property merely specifies the desired end-result.
Note:
Many speech synthesizers only generate mono sound,
and therefore do not intrinsically support the
voice-balance
property.
The sound distribution along the left-right axis
consequently occurs at post-synthesis stage
(when the speech-enabled user agent mixes
the various audio sources authored within the document)
7.
Speaking properties
7.1.
The
speak
property
Name:
speak
Value:
auto
never
always
Initial:
auto
Applies to:
all elements
Inherited:
yes
Percentages:
N/A
Computed value:
specified value
Canonical order:
per grammar
Animation type:
discrete
The
speak
property determines whether or not to render text aurally.
Note:
The functionality provided by this property has no match in the SSML markup language
[SSML]
auto
Resolves to a computed value of
never
when
display
is
none
otherwise resolves to a computed value of
auto
The used value of a computed
auto
is equivalent
to
always
if
visibility
is
visible
and to
never
otherwise.
Note:
The
none
value of the
display
property
cannot be overridden by descendants of the selected element,
but the
auto
value of
speak
can, however,
be overridden using either of
never
or
always
never
This value causes an element (including pauses, cues, rests and actual content)
to not be rendered (i.e., the element has no effect in the aural dimension).
Note:
Any of the descendants of the affected element are allowed to override this value,
so descendants can actually take part in the aural rendering
despite using
display: none
at this level.
However, the pauses, cues, and rests of the ancestor element
remain “deactivated” in the aural dimension,
and therefore do not contribute to the
collapsing of pauses
or additive behavior of adjoining rests.
always
The element is rendered aurally
(regardless of its
display
value,
or the
display
or
speak
values of its ancestors).
Note:
Using this value can result in the element being rendered in the aural dimension
even though it would not be rendered on the visual canvas.
7.2.
The
speak-as
property
Name:
speak-as
Value:
normal
spell-out
||
digits
||
[ literal-punctuation
no-punctuation ]
Initial:
normal
Applies to:
all elements
Inherited:
yes
Percentages:
N/A
Computed value:
specified value
Canonical order:
per grammar
Animation type:
discrete
The
speak-as
property determines in what manner text gets rendered aurally,
based upon a predefined list of possibilities.
Note:
The functionality provided by this property is conceptually similar to
the
say-as
element
from the SSML markup language
[SSML]
(whose possible values are described in the
[SSML-SAYAS]
W3C Note).
Although the design goals are similar,
the CSS model is limited to a basic set of pronunciation rules.
normal
Uses language-dependent pronunciation rules for rendering the element’s content.
For example, punctuation is not spoken as-is,
but instead rendered naturally as appropriate pauses.
spell-out
Spells the text one letter at a time (useful for acronyms and abbreviations).
In languages where accented characters are rare,
it is permitted to drop accents in favor of alternative unaccented spellings.
As an example, in English, the word “rôle” can also be written as “role”.
A conforming implementation would thus be able to spell-out “rôle” as “R O L E”.
digits
Speak numbers one digit at a time,
for instance, “twelve” would be spoken as “one two”,
and “31” as “three one”.
Note:
Speech synthesizers are knowledgeable about what a
number
is.
The
speak-as
property enables some level of control on how user agents render numbers,
and may be implemented as a preprocessing step
before passing the text to the actual speech synthesizer.
literal-punctuation
Punctuation such as semicolons, braces, and so on
is named aloud (i.e. spoken literally)
rather than rendered naturally as appropriate pauses.
no-punctuation
Punctuation is not rendered: neither spoken nor rendered as pauses.
8.
Pause properties
8.1.
The
pause-before
and
pause-after
properties
Name:
pause-before
pause-after
Value: