Voice Extensible Markup Language (VoiceXML) Version
2.0
Voice Extensible Markup Language
(VoiceXML) Version 2.0
W3C Recommendation 16 March 2004
This Version:
Latest Version:
Previous Version:
Editors:
Scott McGlashan, Hewlett-Packard (Editor-in-Chief)
Daniel C. Burnett, Nuance Communications
Jerry Carter, Invited Expert
Peter Danielsen, Lucent (until October 2002)
Jim Ferrans, Motorola
Andrew Hunt, ScanSoft
Bruce Lucas, IBM
Brad Porter, Tellme Networks
Ken Rehor, Vocalocity
Steph Tryphonas, Tellme Networks
Please refer to the
errata
for this document, which may include some normative corrections.
See also
translations
W3C
MIT
ERCIM
Keio
), All Rights Reserved. W3C
liability
trademark
document use
and
software licensing
rules apply.
Abstract
This document specifies VoiceXML, the Voice Extensible Markup
Language. VoiceXML is designed for creating audio dialogs that
feature synthesized speech, digitized audio, recognition of
spoken and DTMF key input, recording of spoken input, telephony,
and mixed initiative conversations. Its major goal is to bring
the advantages of Web-based development and content delivery to
interactive voice response applications.
Status of this Document
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the
W3C technical reports index
at http://www.w3.org/TR/.
This document has been reviewed by W3C Members and other
interested parties, and it has been endorsed by the Director
as a
W3C
Recommendation
. W3C's role in making the Recommendation is to
draw attention to the specification and to promote its widespread
deployment. This enhances the functionaility and interoperability
of the Web.
This specification is part of the W3C Speech Interface Framework
and has been developed within the
W3C Voice Browser Activity
by participants in
the
Voice Browser Working
Group
W3C
Members only
).
The design of VoiceXML 2.0 has been widely reviewed (see the
disposition of comments
) and satisfies the Working Group's
technical requirements
A list of implementations is included in the
VoiceXML 2.0 implementation report
, along with the associated test suite.
Comments are welcome on
www-voice@w3.org
archive
).
See
W3C mailing list and archive usage
guidelines
The W3C maintains a list of
any patent
disclosures related to this work
Conventions of this Document
In this document, the key words "must", "must not",
"required", "shall", "shall not", "should", "should not",
"recommended", "may", and "optional" are to be interpreted as
described in
[RFC2119]
and indicate requirement levels for compliant VoiceXML
implementations.
Table of Contents
Abbreviated Contents
1.
Overview
2.
Dialog
Constructs
3.
User
Input
4.
System
Output
5.
Control flow
and scripting
6.
Environment
and Resources
Appendices
Full Contents
1.
Overview
1.1
Introduction
1.2
Background
1.2.1
Architectural Model
1.2.2
Goals
of VoiceXML
1.2.3
Scope
of VoiceXML
1.2.4
Principles of Design
1.2.5
Implementation Platform Requirements
1.3
Concepts
1.3.1
Dialogs and Subdialogs
1.3.2
Sessions
1.3.3
Applications
1.3.4
Grammars
1.3.5
Events
1.3.6
Links
1.4
VoiceXML
Elements
1.5
Document
Structure and Execution
1.5.1
Execution within one Document
1.5.2
Executing a Multi-Document Application
1.5.3
Subdialogs
1.5.4
Final
Processing
2.
Dialog
Constructs
2.1
Forms
2.1.1
Form
Interpretation
2.1.2
Form
Items
2.1.3
Form
Item Variables and Conditions
2.1.4
Directed Forms
2.1.5
Mixed
Initiative Forms
2.1.6
Form
Interpretation Algorithm
2.2
Menus
2.2.1
menu element
2.2.2
choice element
2.2.3
DTMF
in Menus
2.2.4
enumerate element
2.2.5
Grammar Generation
2.2.6
Interpretation Model
2.3
Form
Items
2.3.1
field element
2.3.2
block element
2.3.3
initial element
2.3.4
subdialog element
2.3.5
object element
2.3.6
record element
2.3.7
transfer element
2.4
Filled
2.5
Links
3.
User
Input
3.1
Grammars
3.1.1
Speech
Grammars
3.1.2
DTMF
Grammars
3.1.3
Scope
of Grammars
3.1.4
Activation of Grammars
3.1.5
Semantic Interpretation of Input
3.1.6
Mapping Semantic Interpretation Results to VoiceXML
forms
4.
System
Output
4.1
Prompt
4.1.1
Speech
Markup
4.1.2
Basic
Prompts
4.1.3
Audio
Prompting
4.1.4
Element
4.1.5
Bargein
4.1.6
Prompt
Selection
4.1.7
Timeout
4.1.8
Prompt
Queueing and Input Collection
5.
Control flow
and scripting
5.1
Variables
and Expressions
5.1.1
Declaring Variables
5.1.2
Variable Scopes
5.1.3
Referencing Variables
5.1.4
Standard Session Variables
5.1.5
Standard Application Variables
5.2
Event
Handling
5.2.1
throw element
5.2.2
catch element
5.2.3
Shorthand Notation
5.2.4
catch
Element Selection
5.2.5
Default catch elements
5.2.6
Event
Types
5.3
Executable
Content
5.3.1
var element
5.3.2
assign element
5.3.3
clear element
5.3.4
if,
elseif, else elements
5.3.5
prompts
5.3.6
reprompt element
5.3.7
goto element
5.3.8
submit element
5.3.9
exit element
5.3.10
return element
5.3.11
disconnect element
5.3.12
script element
5.3.13
log element
6.
Environment
and Resources
6.1
Resource
Fetching
6.1.1
Fetching
6.1.2
Caching
6.1.3
Prefetching
6.1.4
Protocols
6.2
Metadata
Information
6.2.1
meta element
6.2.2
metadata element
6.3
property element
6.3.1
Platform-Specific Properties
6.3.2
Generic Speech Recognizer Properties
6.3.3
Generic DTMF Recognizer Properties
6.3.4
Prompt
and Collect Properties
6.3.5
Fetching Properties
6.3.6
Miscellaneous Properties
6.4
param element
6.5
Value Designations
Appendices
Appendix A.
Glossary of Terms
Appendix B.
VoiceXML Document Type Definition
Appendix C.
Form Interpretation Algorithm
Appendix D.
Timing Properties
Appendix E.
Audio File Formats
Appendix F.
Conformance
Appendix G.
Internationalization
Appendix H.
Appendix I.
Appendix J.
Changes from VoiceXML 1.0
Appendix K.
Reusability
Appendix L.
Acknowledgements
Appendix M.
References
Appendix N.
Media Type and File Suffix
Appendix O.
VoiceXML XML Schema Definition
Appendix P.
Builtin Grammar Types
1.
Overview
This document defines VoiceXML, the Voice Extensible Markup
Language. Its background, basic concepts and use are presented in
Section 1
. The dialog
constructs of form, menu and link, and the mechanism (Form
Interpretation Algorithm) by which they are interpreted are then
introduced in
Section 2
. User
input using DTMF and speech grammars is covered in
Section 3
, while
Section 4
covers system output using speech
synthesis and recorded audio. Mechanisms for manipulating dialog
control flow, including variables, events, and executable
elements, are explained in
Section
. Environment features such as parameters and properties as
well as resource handling are specified in
Section 6
. The appendices provide additional
information including the
VoiceXML Schema
, a detailed specification of the
Form Interpretation Algorithm
and
timing
audio file formats
, and
statements relating to
conformance
internationalization
and
The origins of VoiceXML began in 1995 as an XML-based dialog
design language intended to simplify the speech recognition
application development process within an AT&T project called
Phone Markup Language (PML). As AT&T reorganized, teams at
AT&T, Lucent and Motorola continued working on their own
PML-like languages.
In 1998, W3C hosted a conference on voice browsers. By this
time, AT&T and Lucent had different variants of their
original PML, while Motorola had developed VoxML, and IBM was
developing its own SpeechML. Many other attendees at the
conference were also developing similar languages for dialog
design; for example, such as HP's TalkML and PipeBeach's
VoiceHTML.
The VoiceXML Forum was then formed by AT&T, IBM, Lucent,
and Motorola to pool their efforts. The mission of the VoiceXML
Forum was to define a standard dialog design language that
developers could use to build conversational applications. They
chose XML as the basis for this effort because it was clear to
them that this was the direction technology was going.
In 2000, the VoiceXML Forum released VoiceXML 1.0 to the
public. Shortly thereafter, VoiceXML 1.0 was submitted to the W3C
as the basis for the creation of a new international standard.
VoiceXML 2.0 is the result of this work based on input from W3C
Member companies, other W3C Working Groups, and the public.
Developers familiar with VoiceXML 1.0 are particularly directed
to
Changes from Previous
Public Version
which summarizes how VoiceXML 2.0 differs from
VoiceXML 1.0.
1.1
Introduction
VoiceXML is designed for creating audio dialogs that feature
synthesized speech, digitized audio, recognition of spoken and
DTMF key input, recording of spoken input, telephony, and
mixed initiative conversations. Its major goal is to bring the
advantages of Web-based development and content delivery to
interactive voice response applications.
Here are two short examples of VoiceXML. The first is the
venerable "Hello World":

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml
version="2.0">


Hello World!


The top-level element is , which is mainly a
container for
dialogs
. There are two types of dialogs:
forms
and
menus
. Forms present information and
gather input; menus offer choices of what to do next. This
example has a single form, which contains a block that
synthesizes and presents "Hello World!" to the user. Since the
form does not specify a successor dialog, the conversation
ends.
Our second example asks the user for a choice of drink and
then submits it to a server script:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml
version="2.0">


Would you like coffee, tea, milk, or nothing?







field
is an input field. The user must provide a
value for the field before proceeding to the next element in the
form. A sample interaction is:
(computer)
: Would you like
coffee, tea, milk, or nothing?
(human)
: Orange juice.
C: I did not understand what you said.
(a
platform-specific default message.)
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C:
(continues in document
drink2.asp)
1.2
Background
This section contains a high-level architectural model, whose
terminology is then used to describe the goals of VoiceXML, its
scope, its design principles, and the requirements it places on
the systems that support it.
1.2.1
Architectural Model
The architectural model assumed by this document has the
following components:
Figure 1: Architectural Model
document server
(e.g. a Web server) processes
requests
from a client application,
the VoiceXML
Interpreter
, through the
VoiceXML interpreter context
The server produces
VoiceXML documents
in reply, which are
processed by the VoiceXML interpreter. The VoiceXML interpreter
context may monitor user inputs in parallel with the VoiceXML
interpreter. For example, one VoiceXML interpreter context may
always listen for a special escape phrase that takes the user to
a high-level personal assistant, and another may listen for
escape phrases that alter user preferences like volume or
text-to-speech characteristics.
The
implementation platform
is controlled by the
VoiceXML interpreter context and by the VoiceXML interpreter. For
instance, in an interactive voice response application, the
VoiceXML interpreter context may be responsible for detecting an
incoming call, acquiring the initial VoiceXML document
and answering the call, while the VoiceXML interpreter conducts
the dialog after answer. The implementation platform generates
events in response to user actions (e.g. spoken or character
input received, disconnect) and system events (e.g. timer
expiration). Some of these events are acted upon by the VoiceXML
interpreter itself, as specified by the VoiceXML document, while
others are acted upon by the VoiceXML interpreter context.
1.2.2 Goals of VoiceXML
VoiceXML's main goal is to bring the full power of Web
development and content delivery to voice response applications,
and to free the authors of such applications from low-level
programming and resource management. It enables integration of
voice services with data services using the familiar
client-server paradigm. A voice service is viewed as a sequence
of interaction dialogs between a user and an implementation
platform. The dialogs are provided by document servers, which may
be external to the implementation platform. Document servers
maintain overall service logic, perform database and legacy
system operations, and produce dialogs. A VoiceXML document
specifies each interaction dialog to be conducted by a VoiceXML
interpreter. User input affects dialog interpretation and is
collected into requests submitted to a document server. The
document server replies with another VoiceXML document to
continue the user's session with other dialogs.
VoiceXML is a markup language that:
Minimizes client/server interactions by specifying multiple
interactions per document.
Shields application authors from low-level, and
platform-specific details.
Separates user interaction code (in VoiceXML) from service
logic (e.g. CGI scripts).
Promotes service portability across implementation platforms.
VoiceXML is a common language for content providers, tool
providers, and platform providers.
Is easy to use for simple interactions, and yet provides
language features to support complex dialogs.
While VoiceXML strives to accommodate the requirements of a
majority of voice response services, services with stringent
requirements may best be served by dedicated applications that
employ a finer level of control.
1.2.3 Scope of VoiceXML
The language describes the human-machine interaction provided
by voice response systems, which includes:
Output of synthesized speech (text-to-speech).
Output of audio files.
Recognition of spoken input.
Recognition of DTMF input.
Recording of spoken input.
Control of dialog flow.
Telephony features such as call transfer and disconnect.
The language provides means for collecting character and/or
spoken input, assigning the input results to document-defined
request variables, and making decisions that affect the
interpretation of documents written in the language. A document
may be linked to other documents through Universal Resource
Identifiers (URIs).
1.2.4 Principles of Design
VoiceXML is an XML application
[XML]
The language promotes portability of services through
abstraction of platform resources.
The language accommodates platform diversity in supported
audio file formats, speech grammar formats, and URI schemes.
While producers of platforms may support various grammar formats
the language requires a common grammar format, namely the XML
Form of the W3C Speech Recognition Grammar Specification
[SRGS]
, to facilitate
interoperability. Similarly, while various audio formats for
playback and recording may be supported, the audio formats
described in
Appendix
must be supported
The language supports ease of authoring for common types of
interactions.
The language has well-defined semantics that preserves the
author's intent regarding the behavior of interactions with the
user. Client heuristics are not required to determine document
element interpretation.
The language recognizes semantic interpretations from grammars
and makes this information available to the application.
The language has a control flow mechanism.
The language enables a separation of service logic from
interaction behavior.
It is not intended for intensive computation, database
operations, or legacy system operations. These are assumed to be
handled by resources outside the document interpreter, e.g. a
document server.
General service logic, state management, dialog generation,
and dialog sequencing are assumed to reside outside the document
interpreter.
The language provides ways to link documents using URIs, and
also to submit data to server scripts using URIs.
VoiceXML provides ways to identify exactly which data to
submit to the server, and which HTTP method (GET or POST) to use
in the submittal.
The language does not require document authors to explicitly
allocate and deallocate dialog resources, or deal with
concurrency. Resource allocation and concurrent threads of
control are to be handled by the implementation platform.
1.2.5 Implementation Platform Requirements
This section outlines the requirements on the
hardware/software platforms that will support a VoiceXML
interpreter.
Document acquisition.
The interpreter context is
expected to acquire documents for the VoiceXML interpreter to act
on. The "http" URI scheme must be supported. In some cases, the
document request is generated by the interpretation of a VoiceXML
document, while other requests are generated by the interpreter
context in response to events outside the scope of the language,
for example an incoming phone call. When issuing document
requests via http, the interpreter context identifies itself
using the "User-Agent" header variable with the value
"/", for example,
"acme-browser/1.2"
Audio output.
An implementation platform must support
audio output using audio files and text-to-speech (TTS). The
platform must be able to freely sequence TTS and audio output. If
an audio output resource is not available, an error.noresource
event must be thrown. Audio files are referred to by a URI. The
language specifies a required set of audio file formats which
must be supported (see
Appendix E
); additional audio file formats may
also be supported.
Audio input.
An implementation platform is required to
detect and report character and/or spoken input simultaneously
and to control input detection interval duration with a timer
whose length is specified by a VoiceXML document. If an audio
input resource is not available, an error.noresource event must
be thrown.
It must report
characters
(for example, DTMF) entered
by a user. Platforms must support the XML form of DTMF grammars
described in the W3C Speech Recognition Grammar Specification
[SRGS]
. They should also
support the Augmented BNF (ABNF) form of DTMF grammars described
in the W3C Speech Recognition Grammar Specification
[SRGS]
It must be able to receive
speech recognition
grammar
data dynamically. It must be able to use speech grammar data in
the XML Form of the W3C Speech Recognition Grammar Specification
[SRGS]
. It should be able to
receive speech recognition grammar data in the ABNF form of the
W3C Speech Recognition Grammar Specification
[SRGS]
, and may support other formats such as
the JSpeech Grammar Format
[JSGF]
or proprietary formats. Some VoiceXML
elements contain speech grammar data; others refer to speech
grammar data through a URI. The speech recognizer must be able to
accommodate dynamic update of the spoken input for which it is
listening through either method of speech grammar data
specification.
It must be able to
record
audio received from the user.
The implementation platform must be able to make the recording
available to a
request
variable. The language specifies a
required set of recorded audio file formats which must be
supported (see
Appendix
); additional formats may also be supported.
Transfer
The platform should be able to support making
a third party connection through a communications network, such
as the telephone.
1.3
Concepts
A VoiceXML
document
(or a set of related documents
called an
application
) forms a conversational finite state
machine. The user is always in one conversational state, or
dialog
, at a time. Each dialog determines the next dialog
to transition to.
Transitions
are specified using URIs,
which define the next document and dialog to use. If a URI does
not refer to a document, the current document is assumed. If it
does not refer to a dialog, the first dialog in the document is
assumed. Execution is terminated when a dialog does not specify
a successor, or if it has an element that explicitly exits the
conversation.
1.3.1 Dialogs and Subdialogs
There are two kinds of dialogs:
forms
and
menus
Forms define an interaction that collects values for a set of
form item variables. Each field may specify a grammar that
defines the allowable inputs for that field. If a form-level
grammar is present, it can be used to fill several fields from
one utterance. A menu presents the user with a choice of options
and then transitions to another dialog based on that choice.
subdialog
is like a function call, in that it
provides a mechanism for invoking a new interaction, and
returning to the original form. Variable instances, grammars, and
state information are saved and are available upon returning to
the calling document. Subdialogs can be used, for example, to
create a confirmation sequence that may require a database query;
to create a set of components that may be shared among documents
in a single application; or to create a reusable library of
dialogs shared among many applications.
1.3.2 Sessions
session
begins when the user starts to interact with
a VoiceXML interpreter context, continues as documents are loaded
and processed, and ends when requested by the user, a document,
or the interpreter context.
1.3.3 Applications
An
application
is a set of documents sharing the same
application root document
. Whenever the user interacts
with a document in an application, its application root document
is also loaded. The application root document remains loaded
while the user is transitioning between other documents in the
same application, and it is unloaded when the user transitions to
a document that is not in the application. While it is loaded,
the application root document's variables are available to the
other documents as application variables, and its grammars remain
active for the duration of the application, subject to the grammar
activation rules discussed in
Section 3.1.4
Figure 2 shows the transition of documents (D) in an
application that share a common application root document
(root).
Figure 2: Transitioning between documents in an application.
1.3.4 Grammars
Each dialog has one or more speech and/or DTMF
grammars
associated with it. In
machine directed
applications, each
dialog's grammars are active only when the user is in that
dialog. In
mixed initiative
applications, where the user
and the machine alternate in determining what to do next, some of
the dialogs are flagged to make their grammars
active
(i.e., listened for) even when the user is in another dialog in
the same document, or on another loaded document in the same
application. In this situation, if the user says something
matching another dialog's active grammars, execution
transitions to that other dialog, with the user's utterance
treated as if it were said in that dialog. Mixed initiative adds
flexibility and power to voice applications.
1.3.5 Events
VoiceXML provides a form-filling mechanism for handling
"normal" user input. In addition, VoiceXML defines a mechanism
for handling events not covered by the form mechanism.
Events are thrown by the platform under a variety of
circumstances, such as when the user does not respond, doesn't
respond intelligibly, requests help, etc. The interpreter also
throws events if it finds a semantic error in a VoiceXML
document. Events are caught by catch elements or their syntactic
shorthand. Each element in which an event can occur may specify
catch elements. Furthermore, catch elements are also inherited
from enclosing elements "as if by copy". In this way, common
event handling behavior can be specified at any level, and it
applies to all lower levels.
1.3.6 Links
link
supports mixed initiative. It specifies a
grammar that is active whenever the user is in the scope of the
link. If user input matches the link's grammar, control
transfers to the link's destination URI. A link can be used
to throw an event or go to a destination URI.
1.4
VoiceXML Elements
Table 1: VoiceXML Elements
Element
Purpose
Section

Assign a variable a value
5.3.2