Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document specifies VoiceXML 3.0, a modular XML language for creating interactive media dialogs that feature synthesized speech, recognition of spoken and DTMF key input, telephony, mixed initiative conversations, and recording and presentation of a variety of media formats including digitized audio, and digitized video.
Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the 17 June 2010 Sixth Public Working Draft of "Voice Extensible Markup Language (VoiceXML) 3.0". The main differences from the previous draft are described in Appendix F Major changes since the last Working Draft. A diff-marked version of this document is also available for comparison purposes.
This document is very much a work in progress. Many sections are incomplete, only stubbed out, or missing entirely. To get early feedback, the group focused on defining enough functionality, modules, and profiles to demonstrate the general framework. To complete the specification, the group expects to introduce additional functionality (for example speaker identification and verification, external eventing) and describe the existing functionality at the level of detail given for the Prompt and Field modules. We explicitly request feedback on the framework, particularly any concerns about its implementability or suitability for expected applications. By late 2010 the group expects all key capabilities to be present in the specification, with details worked out by early 2011.
Applications written as 2.1 documents can be used under a 3.0 processor using the 2.1 profile. As an example, the Implementation Report tests for 2.1 (which includes the IR tests for 2.0) will be supported on a 3.0 processor. Exceptions will be clarifications and changes needed to improve interoperability.
This document is a W3C Working Draft. It has been produced as part of the Voice Browser Activity. The authors of this document are participants in the Voice Browser Working Group. For more information see the Voice Browser FAQ. The Working Group expects to advance this Working Draft to Recommendation status.
Comments are welcome on [email protected] (archive). See W3C mailing list and archive usage guidelines.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Terminology
2 Overview
2.1 Structure of VoiceXML 3.0
2.2 Structure of this document
2.3 How to
read this document
3 Data Flow Presentation (DFP) Framework
3.1 Data
3.2 Flow
3.3 Presentation
4 Core Concepts
4.1 Syntactic and
Semantic descriptions
4.2 Resources,
Resource Controllers, and Events
4.2.1 Top Level Controller
4.3 Syntax
4.4 Event
Model
4.4.1 Internal Events
4.4.1.1
Event Interfaces
4.4.1.1.1
Event
4.4.1.1.2
EventTarget
4.4.1.1.3
EventListener
4.4.1.2
Event Flow
4.4.1.2.1
Event Listener Registration
4.4.1.2.2
Event Listener Activation
4.4.1.3
Event Categories
4.4.2 External Events
4.5 Document Initialization and
Execution
4.5.1 Initialization
4.5.1.1
DOM Processing
4.5.1.2
Preparation for Execution
4.5.2 Execution
4.5.2.1
Subdialogs
4.5.2.2
Application Root
4.5.2.3
Summary of Syntax/Semantics Interaction
4.5.3 Transition Controllers
5 Resources
5.1 Datamodel Resource
5.1.1 Data Model Resource API
5.2 Prompt
Queue Resource
5.2.1 State Chart
Representation
5.2.2 SCXML
Representation
5.2.3 Defined Events
5.2.4 Device Events
5.2.5 Open Issue
5.3 Recognition Resources
5.3.1 Definition
5.3.2 Defined Events
5.3.3 Device Events
5.3.4 State Chart Representation
5.3.5 SCXML Representation
5.4 SIV
Resource
5.5 Connection Resource
5.5.1 Definition
5.5.2 Final Processing
State
5.5.3 Defined Events
5.5.4 State Chart Representation
5.5.5 SCXML Representation
5.6 Timer
Resource
5.6.1 Definition
5.6.2 Defined Events
5.6.3 Device Events
5.6.4 State Chart Representation
6 Modules
6.1 Grammar Module
6.1.1 Syntax
6.1.1.1
Attributes
6.1.1.2
Content Model
6.1.2 Semantics
6.1.2.1
Definition
6.1.2.2
Defined Events
6.1.2.3
External Events
6.1.2.4
State Chart Representation
6.1.2.5
SCXML Representation
6.1.3 Events
6.1.4 Examples
6.2 Inline SRGS Grammar
Module
6.2.1 Syntax
6.2.2 Semantics
6.2.2.1
Definition
6.2.2.2
Defined Events
6.2.2.3
External Events
6.2.2.4
State Chart Representation
6.2.2.5
SCXML Representation
6.2.3 Events
6.2.4 Examples
6.3 External Grammar
Module
6.3.1 Syntax
6.3.1.1
Attributes
6.3.1.2
Content Model
6.3.2 Semantics
6.3.2.1
Definition
6.3.2.2
Defined Events
6.3.2.3
External Events
6.3.2.4
State Chart Representation
6.3.2.5
SCXML Representation
6.3.3 Events
6.3.4 Examples
6.4 Prompt
Module
6.4.1 Syntax
6.4.1.1
Attributes
6.4.1.2
Content Model
6.4.2 Semantics
6.4.2.1
Definition
6.4.2.2
Defined Events
6.4.2.3
External Events
6.4.2.4
State Chart Representation
6.4.2.5
SCXML Representation
6.4.3 Events
6.4.4 Examples
6.5 Builtin
SSML Module
6.5.1 Syntax
6.5.2 Semantics
6.5.3 Examples
6.6 Media
Module
6.6.1 Syntax
6.6.1.1
Attributes
6.6.1.2
Content Model
6.6.1.2.1
Tips (informative)
6.6.2 Semantics
6.6.3 Examples
6.7 Parseq
Module
6.7.1 Syntax
6.7.2 Semantics
6.7.3 Examples
6.8 Foreach
Module
6.8.1 Syntax
6.8.1.1
Attributes
6.8.1.2
Content Model
6.8.2 Semantics
6.8.3 Examples
6.9 Form
Module
6.9.1 Syntax
6.9.2 Semantics
6.9.2.1
Form RC
6.9.2.1.1
Definition
6.9.2.1.2
Defined Events
6.9.2.1.3
External Events
6.9.2.1.4
State Chart Representation
6.9.2.1.5
SCXML Representation
6.10 Field
Module
6.10.1 Syntax
6.10.2 Semantics
6.10.2.1
Field RC
6.10.2.1.1
Definition
6.10.2.1.2
Defined Events
6.10.2.1.3
External Events
6.10.2.1.4
State Chart Representation
6.10.2.1.5
SCXML Representation
6.10.2.2
PlayandRecognize RC
6.10.2.2.1
Definition
6.10.2.2.2
Defined Events
6.10.2.2.3
External Events
6.10.2.2.4
State Chart Representation
6.10.2.2.5
SCXML Representation
6.11 Builtin
Grammar Module
6.11.1 Usage of Platform Grammars
6.11.2 Platform Requirements
6.11.3 Syntax and Semantics
6.11.4 Examples
6.12 Data Access and
Manipulation Module
6.12.1 Overview
6.12.2 Semantics
6.12.2.1
The scope stack
6.12.2.2
Relevance of scope stack to properties
6.12.2.3
Implicit variables
6.12.2.4
Variable resolution
6.12.2.5
Standard session variables
6.12.2.6
Standard application variables
6.12.2.7
Legal variable values and expressions
6.12.3 Syntax
6.12.3.1
Creating variables: the <var>
element
6.12.3.2
Reading variables: "expr" and "cond" attributes
and the <value> element
6.12.3.2.1
Inserting variable values in prompts: The
<value> element
6.12.3.3
Updating variables: the <assign> and
<data> elements
6.12.3.3.1
The <assign> element
6.12.3.3.2
The <data> element
6.12.3.4
Deleting variables: the <clear>
element
6.12.3.5
Relevance for properties
6.12.4 Backward compatibility with VoiceXML 2.1
6.12.5 Implicit functions using XPath
6.13 External Communication
Module
6.13.1 Receiving external
messages within a voice application
6.13.1.1
External
Message Reflection
6.13.1.2
Receiving
External Messages Asynchronously
6.13.1.3
Receiving
External Messages Synchronously
6.13.1.3.1
<receive>
6.13.2 Sending messages from a
voice application
6.13.2.1
sendtimeout
6.14 Session
Root Module
6.14.1 Syntax
6.14.2 Semantics
6.14.3 Examples
6.15 Run Time
Control Module
6.15.1 <rtc>
6.15.1.1
Syntax
6.15.2 <cancelrtc>
6.15.2.1
Syntax
6.15.3 Semantics
6.15.4 Examples
6.16 SIV Module
6.16.1 SIV Core Functions
6.16.2 Syntax
6.16.3 Semantics
6.16.3.1
Definition
6.16.3.2
Defined Events
6.16.3.3
External Events
6.16.3.4
State Chart Representation
6.16.4 Events
6.16.5 Examples
6.17 Subdialog
Module
6.17.1 Syntax
6.17.2 Semantics
6.17.3 Examples
6.18 Disconnect
Module
6.18.1 Syntax
6.18.1.1
Attributes
6.18.1.2
Content Model
6.18.2 Semantics
6.18.2.1
Definition
6.18.2.2
Defined Events
6.18.2.3
External Events
6.18.2.4
State Chart Representation
6.18.2.5
SCXML Representation
6.18.3 Example
6.19 Play
Module
6.19.1 Semantics
6.19.1.1
Definition
6.19.1.2
Defined Events
6.19.1.3
External Events
6.19.1.4
State Chart Representation
6.19.1.5
SCXML Representation
6.20 Record
Module
6.20.1 Syntax
6.20.1.1
Attributes
6.20.1.2
Content Model
6.20.1.3
Data Model Variables
6.20.2 Semantics
6.20.2.1
RecordInputItem RC
6.20.2.1.1
Definition
6.20.2.1.2
Defined Events
6.20.2.1.3
External Events
6.20.2.1.4
State Chart Representation
6.20.2.1.5
SCXML Representation
6.20.2.2
Record RC
6.20.2.2.1
Definition
6.20.2.2.2
Defined Events
6.20.2.2.3
External Events
6.20.2.2.4
State Chart Representation
6.20.2.2.5
SCXML Representation
6.21 Property
Module
6.21.1 Syntax
6.21.1.1
Attributes
6.21.1.2
Content Model
6.21.2 Semantics
6.21.2.1
Definition
6.21.2.2
Defined Events
6.21.2.3
External Events
6.21.2.4
State Chart Representation
6.21.2.5
SCXML Representation
6.21.3 Events
6.21.4 Examples
6.22 Transition
Controller Module
6.22.1 Syntax
6.22.1.1
Attributes
6.22.1.2
Content Model
6.22.2 Semantics
6.22.2.1
Definition
6.22.2.2
Defined Events
6.22.2.3
External Events
6.22.2.4
State Chart Representation
6.22.2.5
SCXML Representation
6.22.3 Events
6.22.4 Examples
7 Profiles
7.1 Legacy
Profile
7.1.1 Conformance
7.1.2 Vxml Root Module
Requirements
7.1.3 Form Module Requirements
7.1.4 Field Module
Requirements
7.1.5 Prompt Module
Requirements
7.1.6 Grammar Module
Requirements
7.1.7 Data Access and Manipulation Module
Requirements
7.2 Basic
Profile
7.2.1 Introduction
7.2.2 What the Basic Profile
includes
7.2.2.1
SIV functions
7.2.2.2
Presentation
functions
7.2.2.3
Capture functions
7.2.2.4
Other modules
7.2.3 Returned results
7.2.4 What the Basic Profile does not
include
7.2.5 Examples
7.3 Maximal
Profile
7.4 Enhanced
Profile
7.5 Convenience Syntax (Syntactic
Sugar)
8 Environment
8.1 Resource
Fetching
8.1.1 Fetching
8.1.2 Caching
8.1.2.1
Controlling the Caching
Policy
8.1.3 Prefetching
8.1.4 Protocols
8.2 Properties
8.2.1 Speech Recognition Properties
8.2.2 DTMF Recognition Properties
8.2.3 Prompt and Collect
Properties
8.2.4 Media Properties
8.2.5 Fetch Properties
8.2.6 Miscellaneous Properties
8.3 Speech and
DTMF Input Timing Properties
8.3.1 DTMF Grammars
8.3.1.1
timeout, No Input
Provided
8.3.1.2
interdigittimeout,
Grammar is Not Ready to Terminate
8.3.1.3
interdigittimeout,
Grammar is Ready to Terminate
8.3.1.4
termchar
and interdigittimeout, Grammar Can Terminate
8.3.1.5
termchar
Empty When Grammar Must Terminate
8.3.1.6
termchar Non-Empty and termtimeout When Grammar
Must Terminate
8.3.1.7
termchar Non-Empty and termtimeout When Grammar
Must Terminate
8.3.1.8
Invalid DTMF Input
8.3.2 Speech Grammars
8.3.2.1
timeout When No Speech
Provided
8.3.2.2
completetimeout
With Speech Grammar Recognized
8.3.2.3
incompletetimeout
with Speech Grammar Unrecognized
8.4 Value
Designations
8.4.1 Integers
8.4.2 Real Numbers
8.4.3 Times
9 Integration with Other Markup
Languages
9.1 Embedding of VoiceXML within
SCXML
9.2 Integrating Flow Control
Languages into VoiceXML
9.2.1 SCXML for Dialog
Management
9.2.1.1
System-driven
Dialog
9.2.1.2
User-driven
Dialog
9.2.2 Graceful
Degradation
9.2.3 SCXML as Basis for
Recursive MVC
A Acknowledgements
B References
B.1 Normative
References
B.2 Informative
References
C Glossary of Terms
D VoiceXML 3.0 XML Schema
D.1 Schema for VXML
Root Module
D.2 Schema for Form
Module
D.3 Schema for
Field Module
D.4 Schema for
Prompt Module
D.5 Schema
for Builtin SSML Module
D.6 Schema for
Foreach Module
D.7 Schema for Data
Access and Manipulation Module
D.8 Schema for
Legacy Profile
E Convenience Syntax in VoiceXML
2.x
E.1 Simplified Dialog
Structure
E.2 Examples
E.2.1 <menu> with
<choice>
E.2.2 Equivalent <form>,
<field>, <option>
E.2.3 Equivalent
<form>, <field>, <grammar>
F Major changes since the last Working
Draft
In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119] and indicate required levels for compliant VoiceXML 3.0 implementations.
Terms used in this specification are defined in Appendix C Glossary of Terms.
How does one build a successor to VoiceXML 2.0/2.1? Requests for improvements to VoiceXML fell into two main categories: extensibility and new functionality.
To accommodate both, the Voice Browser Working Group
One of the benefits of detailed semantic descriptions is improving portability within VoiceXML. Two vendors may implement the same functionality differently; however, the functionality must be consistent with the semantic meanings described in this document so that application authors are isolated from the different implementations. This increases portable among platforms that support the same syntax. Note that there are many other factors that effect to the portability that is outside the scope of this document (e.g. speech recognition capabilities, telephony).
This document covers the following:
The remainder of this document is structured as follows:
3 Data Flow Presentation (DFP) Framework presents the Data-Flow-Presentation Framework, its importance for the development of VoiceXML 3.0 and how VoiceXML 3.0 fits into the model.
4 Core Concepts explains the core concepts underlying the new structure for VoiceXML, including resources, resource controllers, the relationship between syntax and semantics, DOM eventing, modules and profiles.
5 Resources presents the resources defined for the language. These provide the key presentation-related functionality in the language.
6 Modules presents the modules defined for the language. Each module consists of a syntax piece (with its user-visible events), a semantics piece (with its behind-the-scenes events) and a description of how the two are connected.
7 Profiles presents two profiles. The first, the VoiceXML 2.1 profile, shows how a language similar to VoiceXML 2.1 can be created using the structure and functionality of VoiceXML 3.0. The second, the Basic profile, is a simple compilation of all of the functionality available in VoiceXML 3.0.
The Appendices provide useful references and a glossary of terms used in the specification.
For everyone: Please first read 3 Data Flow Presentation (DFP) Framework. The data-flow- presentation distinction applies not only to VoiceXML 3.0, but to many of W3C's specifications. Understanding VoiceXML's role as a presentation language is crucial context for understanding the rest of the specification.
For application authors: we recommend that you begin with syntax and only gradually explore details of the semantics as you need to understand behavioral specifics.
For VoiceXML platform developers: we recommend that you begin with the functionality and framework and only focus on syntax later.
Unlike VoiceXML 2.0/2.1, the focus in VoiceXML 3.0 is almost exclusively on the user interface portions of the language. By choice, very little work has gone into the development of data storage and manipulation or control flow capabilities. In short, VoiceXML 3.0 has been designed from the ground up as a *presentation* language, according to the definition presented in the Data Flow Presentation ([DFP]) Framework.
Although VoiceXML 3.0 is a presentation language, it also contains within it all 3 levels of the DFP framework ( Figure 6).
The Data Flow Presentation (DFP) Framework is an instance of the Model-View-Controller paradigm, where computation and control flow are kept distinct from application data and from the way in which the application communicates with the outside world. This partitioning of an application allows for any one layer to be replaced independently of the other two. In addition, it is possible to simultaneously make use of more than one Data (Model) language, Flow (Controller), and/or Presentation (View) language.
The Data layer of VoiceXML 3.0 is responsible for maintaining all presentation-specific information in a format that is easily accessible and easily editable. Note that the data layer of VoiceXML 3.0 is very different from the backend data for an application. Presentation-specific datainformation disappears when the VoiceXML 3.0 application terminates, while data in the backend database continues to exist after VoiceXML 3.0 terminates. Examples of presentation-specific data might include the status of the dialog in collecting certain information, which prompts have just been played, and how many of various error conditions have occurred so far, and the values entered by the user until they are transmitted to the back-end database or file system.
Within VoiceXML 3.0 the Data layer is realized through a pluggable data language and a data access or manipulation language. Access to and use of the data is aligned with options available in SCXML for simpler interaction with the Flow layer (see the next section). This specification defines two specific data languages, XML and ECMAScript, and two data access and manipulation languages, E4X/DOM and XPath. Others may be defined by implementers.
The Flow layer of VoiceXML 3.0 is responsible for all application control flow, including business logic, dialog management, and anything else that is not strictly data or presentation. VoiceXML 3.0 provides primitives that contain the control flow needed to implement them, but all combinations between and among the elements at the syntax level is done via calls to external control flow processors. Two that are likely to be used with VoiceXML are CCXML and SCXML. Note that flow control components written outside of VoiceXML may be communicating not only with a VoiceXML processor but with an HTML browser, a video game controller, or any of a variety of other input and output components.
The Presentation layer of VoiceXML 3.0 is responsible for all interaction with the outside world, i.e., human beings and external software components. VoiceXML 3.0 *is* the Presentation layer. Designed originally for human-computer interaction, VoiceXML "presents" a dialog by accepting audio and dtmf input and producing audio and video output. All [?] of the modules defined in this document belong to the VoiceXML 3.0 presentation layer.
This document specifies the VoiceXML 3.0 language as a collection of modules. Each module is described at two levels:
The visual UML state chart diagrams are informative. They are included for ease of reading and quick understanding. The more detailed textual SCXML representations are normative.
It is important to note that this model places no burden or requirements that a VoiceXML interpreter must implement behavior as described in the model. Rather, the requirement is that the behavior must be the same as if it were implemented as described, but it is permitted to have optimizations or different architecture behind the implementation of the markup interpretation.
The semantic descriptions are important for reasons including the following:
The resources, resource controllers, and the events they generate are intended only to describe the semantics of VoiceXML 3 modules. Implementations are not required to use SCXML to implement VoiceXML 3 modules, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise.
The logical SCXML events must be distinguished from the author-visible DOM events that are a mandatory part of the VoiceXML 3 language. Implementations MUST raise these DOM events and process them in the manner described in 4.4 Event Model . The interaction between actual DOM events and logical SCXML events is described in 4.5 Document Initialization and Execution, below.
Each VoiceXML 3.0 module is described using SCXML notation and optionally a UML state chart representation of its underlying behavior expressed in terms of resources and resource controllers. While the resources and resource controllers are not exposed directly in the markup, they are used to define the semantics of VoiceXML 3.0 markup elements. Figure 7 illustrates the relationship among resource controllers, resources, and media devices. The arrows represent events exchanged among components. A more concrete example is represented in Figure 8 which illustrates the Prompt Resource controller (further defined in 6.4.2 Semantics), the PromptQueue Resource, and the SSML Media Player.
In addition to the Resource Controllers associated with elements like <form>, there is a top-level controller associated with the <vxml> element that is responsible for starting processing and deciding which Resource Controller to execute next (i.e., for <form> or other interaction element ). The top-level controller also holds session level properties and is responsible for returning results to the Flow Level when script execution terminates.
The event model for VoiceXML 3.0 builds upon the DOM Level 3 Events [DOM3Events] specification. DOM Level 3 Events offer a robust set of interfaces for managing the listener registration, dispatching, propagation, and handling of events, as well as a description of how events flow through an XML tree.
The DOM 3.0 event model offers VoiceXML developers a rich set of interfaces that allow them to easily add behavior to their applications. In addition, conforming to the standard DOM event model enables authors to integrate their Voice applications in next generation multimodal or multi-namespaced frameworks such as MMI and CDF with minimal efforts.
Within the VoiceXML 3.0 semantic model, the DOM Level 3 Events APIs are available to all Resource Controllers that have markup elements associated with them. Indeed, this section covers the eventing APIs as available to VoiceXML 3.0 markup elements. The following section describes how the semantic model ties in with the DOM eventing model.
All VoiceXML 3.0 markup elements implement interfaces that support the following:
The VoiceXML 3.0 Event interface extends the DOM Level 3 Event interface to support voice specific event information. In particular, the VoiceXML 3.0 Event interface supports a count integer that stores the number of times a resources emits a particular event type. The semantic model manages the count field by incrementing its value and resetting it as described in the section that follows.
Note:
RH: should we expose the count to authors? If so, should we have a special variable like event.count or something similar ?VoiceXML 3.0 markup elements implement the DOM Level 3 EventTarget interface.This interface allows registration and removal of event listeners as well as dispatching of events.
The VoiceXML 3.0 markup elements implement the DOM Level 3 EventListener interface. This interface allows the activation of handlers associated with a particular event. When a listener is activated, the event handler execution is done in the semantic model as described in the section that follows.
[To be updated by Michael Bodell (members only) due April 1 2008]
Events propagate through markup elements as per the DOM event flow. Event listeners may be registered on any of VoiceXML markup elements.
When processing a VoiceXML 2.0 profile, event listeners are not allowed to be registered for the capture phase, as this contradicts the as-if-by-copy event semantics of VoiceXML 2.0. If a listener is registered with the capture phase set to true in a VoiceXML 2.0 document, an error.event.illegalphase event will be dispatched onto the root document and the listener registration will be ignored (does that sound reasonable to people?).
The DOM Level 3 Event specification supports the notion of partial ordering using the event listener group; all events within a group are ordered. As such, in VoiceXML 3.0, event listeners are registered as they are encountered in the document. Furthermore, all event listeners registered on an element belong to the same default group. Both of these provisions ensure that event handlers will execute in document order.
An event listener is triggered if:
Once en event listener is triggered, the execution is handled by the semantic model as described in the section below. Event propagation blocks until it is notified by the semantic model to proceed.
The VoiceXML 3.0 specification extends the DOM 3 Event specification to support partial name matching on events. VoiceXML 3.0 creates categories of events (the list of categories needs to be specified in the VoiceXML 3.0 spec ) and allows authors and the platform to register listeners for either a specific event type or for all events within a particular category or subcategory. For example, VoiceXML 3.0 may create a connection category such as:
{"http://www.example.org/2007/v3","connection"}
The spec may also declare a subcategory of connection or a specific event type that belongs to this category:
{"http://www.example.org/2007/v3","connection.disconnect"} {"http://www.example.org/2007/v3","connection.disconnect.hangup"}
Following this declaration, the VoiceXML 3.0 Event specification uses partial name matching to associate events propagating through the DOM to listeners registered on the tree. The VoiceXML 3.0 Event specification follows the prefix matching used in VoiceXML 2.0 for associating events with their categories.
Note:
It might be useful to introduce the "*" notation to be specify a catch for all events irrespective of their type and/or category.VoiceXML 3.0 interpreters may receive events from external sources, for example SCXML engines. In particular, it may receive the life cycle events specified as part of the Multimodal Architecture and Interfaces specification [MMI]. These life cycle events allow the flow component of the DFP architecture to control the presentation layer by starting and stopping the processing of markup. By handling these events, the VoiceXML interpreter acts as a 'modality component' in the multimodal architecture, while the flow component acts as an 'interaction manager'. As a result, VoiceXML 3 applications can be easily extended into multimodal applications. However it is important to note that support for the life cycle events is required by the DFP framework in all applications, whether uni- or multimodal.
The interpreter must handle the following life cycle events automatically:
All other life cycle events and all other external events are ignored unless the External Communications Module 6.13 External Communication Module is included in the profile. If the External Communications Module is present, all other external events are passed up to the application, placed in the application event queue and then handled as specified by the developer using the functionality defined in that module.
Editorial note | |
Open Issue: Should ClearContextRequest be handled automatically? Should Done be sent automatically when the document is finished? Where do these response events get sent? |
VoiceXML 3.0 document initialization takes place over two phases: "DOM Processing" and "Preparation for Execution". Both of these phases assume the required resources have already been created. Any errors in the initialization of the document or the creation of these resources MUST be thrown in the calling context. If that context was a VoiceXML document, then this MUST be an error.badfetch.
Note that while these phases are ordered, and the steps within the phases ordered, this is only a logical ordering. Implementations are allowed to use a different ordering as long as behave as if they were following the specified ordering.
The first step in initializing a VoiceXML 3.0 document (root document or child) is generating the Level-3 DOM. This task involves both checking the document for well formed XML and full schema and syntax validation to ensure proper tag/attribute relationships.
Once complete, the interpreter invokes the semantic constructor for the root <vxml> node in the DOM. In this context, the term "semantic constructor" represents whatever mechanism is used to create the Resource Controllers for a given node. No particular implementation is implied or required. The root <vxml> node constructor is responsible for invoking the constructors for all nodes in the document that have them. When it does this, it will call the semantic constructor routine passing it
Editorial note | |
Open Issue: we must specify the operation of the root node constructor in more detail as part of the V3 specification. Other people can define modules, but we must specify how they are assembled into a full semantic representation of the application.) If there is an application root document specified, the root node constructor will have to construct its RCs as well, by calling its root node constructor. Also, needs to happen after creation of RCs and before general semantic initialization. After the creation of of the RCs is when the mapping from syntax to RCs will occur, and that's when the list would be known. |
Note that the initial construction process creates the RCs but does not necessarily fully configure them. Further initialization, including in particular the creation of variables and variable scopes, will happen only when the RCs are activated at runtime (e.g. by visiting a Form). However, at this point the list of children for each element (and thus each RC) is known. For each RC this list of children will populate into the appropriate place in the RC data model before semantic initialization of the RC.
Once the RCs are constructed, they are independent of the DOM, except for the interactions specified below. However, while they are running the RCs often make use of what appears to be syntactic information. For example, the concept of 'next item' relies heavily on document order, while <goto> can take a specific syntactic label as its target. We provide for this by assuming that RCs can maintain a shadow copy of relevant syntactic information, where "shadow copy" is intended to allow a variety of implementations. In particular, platforms may make an actual copy of the information or may maintain pointers back into the DOM. The construction process may create multiple RCs for a given node. In that case, one of the RCs will be marked as the primary RC. It is the one that will be invoked when the flow of control reaches that (shadow) node.
If the document being initialized is a child of a root document, then the root document of that child must fully complete its initialization before the child can be prepared. In other words, the root document must both process its DOM and prepare for execution before child initialization proceeds.
Once in the preparation phase, static properties (ie those NOT a function of ECMA) are available for lookup. Although this isn't an explicit step, it is mentioned here as this is the first opportunity for their retrieval. Note that even if documentmaxage/stale properties were to be specified in the child document, they would not be available for retrieval when downloading the root document. Rather these values would be taken from the system defaults or context. For example, consider the case of a first call into a system which lands on a child document called A. The default values for documentmaxage/stale would be used when fetching both this child A and root document of the child called A-root. Should A transition to child document B which references root B-root, the <property> values of documentmaxage/stale in A would be used to fetch B. However, the implicit fetch of B-root would use the system defaults for documentmaxage/stale.
With the ability to read <property> values comes the first opportunity to act on any prefetching directives supplied by the application. Prefetching is an optional step, and could be postponed temporality or indefinitely. The only requirement on a conformant processor is that prefetching cannot take place before this step.
Next, document-level variables and scripts are initialized in document order. Note that conformant processors MUST not locally handle any semantic errors generated during this step. Such errors MUST be thrown to the calling document or context (e.g error.badfetch). The reason being that the present document is not yet fully initialized and thus cannot reliably handle errors locally.
The final step in preparation is for the controller to select the first <form> to execute. If either the local controller is malformed or the optional URI fragment points to a non-existent <form>, an error MUST be generated in the calling document or context (eg error.badfetch). A conformant processor MUST not handle this locally.
After initialization, the semantic control flow does a <goto> to the initial Resource Controller. Once a RC is running, it invokes Resources and other RCs by sending them events. The DOM is not involved in this process. At various points in the processing, however, an RC may decide to raise an author-visible event. It does this by creating an event targeted at a specific DOM node and sending it back to the DOM. When the DOM receives the event, it performs the standard bubble/capture cycle with the target specified in the event. In the course of the bubble/capture cycle, various event handlers may fire. Their execution is a semantic action and occurs back in the semantic 'side' of the environment. The DOM sends messages back to the appropriate semantic objects to cause this to happen. Note that this means that the DOM must store some sort of link to the appropriate RCs. The event handlers may update the data model, execute script, or raise other DOM events. When the handler finishes processing on the semantic side, it sends a notification back to the DOM so that it can resume the bubble/capture phase. (N.B. This notification is NOT a DOM event.) When the DOM finishes the bubble/capture processing of the event, it sends a notification back to the RC that raised the event so that it can continue processing.
Editorial note | |
Open Issue: Is this notification a standard semantic event? Note that RC processing must pause during the bubble/capture phase to avoid concurrency problems. |
A subdialog has a completely separate context from the invoking application. Thus it has a separate DOM and a separate set of RCs. However it shares the same set of Resources since they are global. When a subdialog is entered, the Datamodel Resource will have to create a new scope for the subdialog and hide the calling document's scopes. When the subdialog is exited, the Datamodel resource will destroy the subdialog scope(s) and restore the calling document's scope(s).
To handle event propagation from the leaf application to the application root document, we create a Document Manager to handle all communication between the documents. This means that the DOMs of the two documents remain separate. When an event is not handled in the leaf document, the Document Manager will propagate it to the application root, where it will be targeted at the <vxml> node. Requests to fetch properties or to active grammars will be handled by the Document Manager in a similar fashion. To handle platform- and/or language-level defaults, we will create a "super-root" document above the application root. The Document Manager will pass it events and requests that are not handled in the root document. If root and superroot documents do not handle an event, the Document Manager will ensure that the event is thrown away.
There seem to be four kinds of interactions between RCs and the DOM at runtime:
Editorial note | |
Open Issue: DOM Modification. There are two possibilities: 1) we can refuse to allow the DOM to be modified (or ignore the modifications if it is) 2) we can reconstruct the relevant resource controllers when the DOM is modified. In the latter case, the straightforward approach would be: a) find the least node that is an ancestor of all the changes and that has a constructor b) call its constructor as during initialization, using the current state of the DOM and RCs as context. |
Transition controllers provide the basis for managing flow in VoiceXML applications. Resource controllers for some elements like <form> have associated transition controllers which influence how form items get selected and executed. In addition to form, there is a transition controller for each of the following higher VoiceXML scopes:
Whenever a form item or form or document finishes execution, the relevant transition controller is consulted for selecting the subsequent one for execution. To find the relevant transition controller, begin at the current resource controller and navigate along the associated VoiceXML element's parent axis until you reach a resource controller with an associated transition controller. For example, for a form item, the parent form's resource controller has the relevant transition controller.
When a transition controller runs to completion, control is returned to the next higher transition controller along with any results that need to be passed up. For example, when the last form item in a form is filled, the transition controller associated with the form returns control to the document level transition controller along with the results for the filled form. Control may be returned to the parent transition controller in case of such run to completion semantics as well as error semantics.
This section describes semantic models for common VoiceXML resources. Resources have a life cycle of creation and destruction. Specific resources may specify detailed requirements on these phases. All resources must be created prior to their use by a VoiceXML interpreter.
Editorial note | |
Standard lifecycle events are expected to be defined in later versions: create event: from idle to created; destroy event: from created to idle. |
Resources are defined in terms of a state model and events which it processes within defined states. Events may be divided into those which are defined by the resource itself and events defined by other conceptual entities which the resource receives or sends within these states. These conceptual entities include resource controllers and a 'device' which provides an implementation of the services defined by the resource.
The semantic model is specified in both UML state chart diagrams and SCXML representations. In case of ambiguity, the SCXML representation takes precedence over UML diagrams. Note that SCXML is used here to define the states and events for resources and this definitional usage should not be confused with the use of SCXML to specify application flow (see 3.2 Flow). Furthermore, these resource events are conceptual, not DOM events: they are used to define relationship with other conceptual entities and are not exposed at the markup level. The relationship between conceptual events and DOM events is described in XXX.
The following resources are defined: data model (5.1 Datamodel Resource), prompt queue (5.2 Prompt Queue Resource) and DTMF and ASR recognition (5.3 Recognition Resources).
[Later versions will defined the following resources: recorder, SIV. Later versions may define the following resources: session recorder, ...]
Editorial note | |
Later versions of this document will clarify that different datamodels may be instanced, such as ECMAScript, XML, etc. Conformance requirements will be stated at a later stage. |
The datamodel is a repository for both user- and system-defined data and properties. To simplify variable lookup,we define the datamodel with a synchronous function-call API, rather than an asynchronous one based on events. The data model API does not assume any particular underlying representation of the data or any specific access language, thus allowing implementations to plug in different concrete data model languages.
There is a single global data model that is created when the system is first initialized. Access to data is controlled by means of scopes, which are stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "Global" is created. Thereafter scopes are explicitly created and destroyed by the data model's clients.
Editorial note | |
Resource and Resource controller description to be updated with API calls rather than events. |
Function | Arguments | Return Value | Sequencing | Description |
CreateScope | name(optional) | Success or Failure | Creates a new scope object and pushes it on top of the scope stack. If no name is provided the scope is anonymous and may be accessed only when it on the top of the scope stack. A Failure status is returned if a scope already exists with the specified name. | |
DeleteScope | name(optional) | Success or Failure | Removes a scope from the scope stack. If no name is provided, the topmost scope is removed. Otherwise the scope with provided name is removed. A Failure status is returned if the stack is empty or no scope with the specified name exists. | |
CreateVariable | variableName, value(optional), scopeName(optional) | Success or Error | Creates a variable. If scopeName is not specified, the variable is created in the top most scope on the scope stack. If no value is provided, the variable is created with the default value specified by the underlying datamodel. A Failure status is returned if a variable of the same name already exists in the specified scope. | |
DeleteVariable | variableName, scopeName(optional) | Success or Failure | Deletes the variable with the specified name from the specified scope. If no scopeName is provided, the variable is deleted from the topmost scope on the stack. The status Failure is returned if no variable with the specified name exists in the scope. | |
UpdateVariable | variableName, newValue, scopeName(optional) | Success or Failure | Assigns a new value to the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. A Failure status is returned if the specified variable or scope cannot be found. | |
ReadVariable | variableName, scopeName(optional) | value | Returns the value of the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. An error is raised if the specified variable or scope cannot be found. | |
EvaluateExpression | expr, scopeName(optional) | value | Evaluates the specified expression and returns its value. If scopeName is not specified, the expression is evaluated in the topmost scope on the stack. An error is raised if the specified scope cannot be found. |
Issue ():
Do we need event listeners on the data model, e.g., to notify when the value of a variable changes?
Resolution:
None recorded.
Here is a UML representation of the prompt queue. This state machine assumes that "queue" and "play" are separate commands and that a separate "play" will always be issued to trigger the play. When the "play" is issued, the systems plays any queued prompts, up to and including the first fetch audio in the queue. Then it halts, even if there are additional prompts or fetch audio in the queue and waits for another "play" command.
Editorial note | |
Open issue: Can queued prompt commands, either audio or TTS, be left un-fetched or un-rendered until a play command is issued to the prompt resource? This may result in delays or gaps in the production of the actual audio, as the rendering or fetching may not produce playable audio fast enough to avoid inter-prompt delays. |
The prompt structure assumed here is fairly abstract. It consists of a specification of the audio along with optional parameters controlling playback (for example, speed or volume.) The audio may be presented in-line, as SSML or some other markup language, or as a pointer to a file or streaming audio source. Logically, URLs are dereferenced at the time the prompt is queued, but implementations are not required to fetch the actual media until the prompt in question is sent to the player device. Note that the player device is assumed to be able to handle both recorded prompts and TTS, and to be able to interpret SSML. Platforms are free to optimize their implementations as long as they conform to the state machine specified here. In particular, platforms may prefetch audio or begin TTS processing in the background before the prompt is sent to the player device. For applications that make use of VCR controls (speed up, skip forward, etc.), actual performance may depend on whether the platform has implemented such optimizations. For example, a request to skip forward on a platform that does not prefetch prompts may result in a long delay. Such performance issues are outside the scope of this specification.
This diagram assumes that SSML mark information is delivered in the Player.Done event, and that the player returns a Player.Done event when it is sent a 'halt' event (otherwise mark information would get lost on barge-in and hangup, etc).
Note that the "FetchAudio" state is shown stubbed out for reasons of space, and is expanded in a separate diagram below the main one.
Figure X: Prompt Queue Model
Figure Y: Fetch audio Model
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data name="queue"/> <data name="markName"/> <data name="markTime"/> <data name="bargeInType"/> </datamodel> <state id="Created"> <initial id="Idle"/> <transition event="QueuePrompt"> <insert pos="after" loc = "datamodel/data[@name='queue']/prompt" val="_eventData/prompt"/> </transition> <transition event="QueueFetchAudio"> <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> <if cond="$node[@fetchAudio='true']"> <delete loc="$node"/> <else> <assign loc="$node[@bargeInType]" val="unbargeable"/> </else> </if> </foreach> <insert pos="after" name="datamodel/data[@name='queue']/prompt" val="_eventData/audio"/> </transition> <transition event="setParameter"> <send target="player" event="setParameter" namelist="_eventData.paramName, _eventData.newValue"/> </transition> <transition event="Cancel" target="Idle"> <send target="player" event="halt"/> <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/> <delete loc="datamodel/data[@name='queue']/prompt"/> </transition> <transition event="CancelFetchAudio"> <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> <if cond="$node[@fetchAudio='true']"> <delete loc="$node"/> </if> </foreach> </transition> <state id="Idle"> <onentry> <assign loc="/datamodel/data[@name='markName']" val=""/> <assign loc="/datamodel/data[@name='markTime']" val="-1"/> <assign loc="/datamodel/data[@name='bargeInType']" val=""/> </onentry> <transition event="Play" cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'false'" target="PlayingPrompt"/> <transition event="Play" cond="/datamodel/data[@name='[queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/> </state> <state id="PlayingPrompt"> <datamodel> <data name="currentPrompt"/> </datamodel> <onentry> <assign loc="/datamodel/data[@name='currentPrompt']/prompt" val="/datamodel/data[@name='queue']/prompt[1])"/> <delete loc="/datamodel/data[@name='queue']/prompt[1]"/> <if cond="/datamodel/data[@name='currentPrompt']/prompt[@bargeInType] != /datamodel/data[@name='bargeInType']"> <send event="BargeInChange" namelist="/datamodel/ data[@name='currentPrompt']/prompt[@bargeInType]"/> <assign loc="/datamodel/data[@name='bargeInType']" expr="/ datamodel/data[@name='currentPrompt']/prompt[@bargeInType]"/> </if> </onentry> <invoke targettype="player" srcexpr="/datamodel/ data[@name='currentPrompt']/prompt"/> <finalize> <if cond="_eventData/MarkTime neq '-1'"> <assign name="/datamodel/data[@name='markName']/" val="_eventData/markName.text()"/> <assign name="/datamodel/data[@name='markTime']/" val="_eventData/markTime.text()"/> </if> </finalize> <transition event="player.Done" cond="/datamodel/data[@name='queue']/prompt[last()] le '1'" target="Idle"> <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/> </transition> <transition event="player.Done" cond="/datamodel/data[@name='queue'/prompt[1][@fetchAudio] neq 'true'" target="PlayingPrompt"/> <transition event="player.Done" cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/> </state> <!-- end PlayingPrompt --> <state id="FetchAudio"> <initial id="WaitFetchAudio"/> <transition event="player.Done" target="FetchAudioFinal"/> <state id="WaitFetchAudio"> <onentry> <send target="self" event="fetchAudioDelay" delay="/datamodel/data[@name='queue']/prompts[1][@fetchaudiodelay]"/> </onentry> <transition event="fetchAudioDelay" next="StartFetchAudio"/> <transition event="cancelFetchAudio" next="FetchAudioFinal"/> </state> <state id="StartFetchAudio"> <datamodel> <data name="fetchAudio"/> </datamodel> <onentry> <assign loc="/datamodel/data[@name='fetchAudio']" expr="/datamodel/data[@name='queue']/prompts[1]"/> <delete loc="/datamodel/data[@name='queue']/prompts[1]"/> <send target="self" event="fetchAudioMin" delay="/datamodel/data[@name='fetchAudio'][@fetchaudiominimum]"/> <send target="player" event="Play" namelist="/datamodel/data[@name='fetchAudio']"/> <if cond="/datamodel/data[@name='bargeInType'].text() ne 'fetchAudio'"> <send event="BargeInChange" namelist="fetchAudio"/> </if> </onentry> <transition event="CancelFetchAudio" target="WaitFetchMinimum"/> <transition event="fetchAudioMin" target="WaitFetchCancel"/> </state> <state id="WaitFetchMinimum"> <transition event="fetchAudioMin" target="FetchAudioFinal"> <send target="player" event="halt"/> </transition> </state> <state id="WaitFetchCancel"> <transition event="CancelFetchAudio" target="FetchAudioFinal"> <send target="player" event="halt"/> </transition> </state> <state id="FetchAudioFinal" final="true" /> <!-- could put cleanup handling here --> </state> <!-- end FetchAudio --> </state> <!-- end Created --> </scxml>
The prompt queue resource can be controlled by means of the following events:
Event | Source | Payload | Sequencing | Description |
queuePrompt | any | prompt (M), properties(O) | adds prompt to queue, but does not cause it to be played | |
queueFetchAudio | any | prompt (M) | adds fetch audio to queue, removing any existing fetch audio from queue. Does not cause it to be played. | |
play | any | Causes any queued prompts or fetch audio to be played | ||
changeParameter | any | paramName, newValue | Sets the value of paramName to newValue, which may be either an absolute or relative value. The new setting takes effect immediately, even if there is already a prompt playing. | |
cancelFetchAudio | any | Deletes any queued fetch audio. Also cancels any fetch audio that is already playing, unless fetchAudioMin has been specified and not yet reached. | ||
cancel | any | Immediately cancels any prompt or fetch audio that is playing and clears the queue. |
The prompt queue resource returns the following events to its invoker:
Event | Target | Payload | Sequencing | Description |
prompt.Done | controller | markName(O), markTime(O) | Indicates prompt queue has played to completion and is now empty | |
bargeintypeChange | controller | one of: unbargeable, hotword, energy, fetchAudio | sent at start of prompt play and whenever a new prompt or fetch audio is played whose bargeinType differs from the preceding one. |
Issue ():
Do we need 'fetchAudio' as a distinct bargein type?
Resolution:
None recorded.
The prompt queue receives the following events from the underlying player:
Event | Payload | Sequencing | Description |
player.Done | Sent whenever a single prompt or piece of fetch audio finishes playing. |
and sends the following events to the underlying device:
Event | Payload | Sequencing | Description |
play | prompt (M) | sent to platform to cause a single prompt to be played. | |
setParameter | paramName (M), value(O) | sent to platform to change the value of a playback parameter such as speed or volume. The new value may be absolute or relative. The change takes effect immediately. |
Issue ():
Differences in PromptQueue Definition: see Details (members only).
Resolution:
None recorded.
Three types of recognition resources are defined: DTMF recognition for recognition of DTMF input, ASR recognition for recognition of speech input, and SIV for speaker identification and verification. Each recognition resource is associated with a device which implements their respective recognition services. Each device represents one or more actual recognizer instances. In case of a device implemented with multiple recognizers - for example two different speech recognition engines - it is the responsibility of the interpreter implementation to ensure that they adhere to the semantic model defined in this section.
DTMF and ASR recognition resources and SIV resources are semantically similar. They share the same state and eventing model as well as recognition processing, timing and result handling. However, the resources differ in the following respects:
Otherwise, these resources share the same semantic model.
If a resource controller activates both DTMF and ASR recognition resources, then that resource controller is responsible for managing the resources so that only a single recognition result is produced per recognition cycle. If a resource controller activates ASR and SIV resources, it may produce multiple results timed to provide the results within the same cycle or independently.
The recognition resource is defined in its created state grammars (or a voice model) are added to the resource and subsequently prepared on the device, recognition with these grammars (or voice model) ncan be activated and suspended, and recognition results are returned.
When the recognition resource is ready to recognize (at least one active grammar and/or voice model), one or more recognition cycles may occur in sequence.
Thus a recognition resource may enter multiple recognition cycles (as required for 'hotword' recognition), while requiring that a device, even if it has multiple instantiations, only produces one set of recognition results per recognition cycle.
The recognition resource is defined in terms of a data model and state model.
The data model is composed of the following elements:
Editorial note | |
List of properties for active grammars needs to be aligned with the list of items sent in the AddGrammar event. |
The state model is composed of states corresponding to functional state: idle, preparing grammars / preparing voice model, ready to recognize, recognizing, suspended recognition and waiting for results.
In the idle state, the resource awaits events from resource controllers to activate grammars or a voice model for recognition on the device. The data model - activeGrammars or activeVoiceModel, properties, controller and mode - is (re-)initialized upon entry to this state: activeGrammars and activeVoiceModel are cleared, properties and controllers are set to null. If the resource receives an 'addGrammar' event, a new item is added to activeGrammars using grammar, properties and listener data in the event payload. If the resource receives a 'prepare' event, it updates its data model with event data: 'properties' with the properties event data and 'controller' is updated with the controller event data. Subsequent event notifications and responses are sent to the resource controller identified as the 'controller'. The recognition resource then moves into the preparing grammars (or preparing voice model) state.
In the preparing grammars state, the resource behavior depends on whether activeGrammars is empty or not. If activeGrammars is empty (i.e. no active grammars are defined for this recognition resource), the resource sends the controller a 'notPrepared' event and returns to the idle state. If activeGrammar is non-empty, the resource sends a 'prepare' event to the device. The event payload includes 'grammars' and 'properties' parameters. The 'grammars' value is an ordered list where each list item is a grammar's content and its properties extracted from activeGrammars. The order of grammars in the 'grammars' parameter must follow the order in the activeGrammar data model. If the device sends a 'prepared' event, the resource sends a 'prepared' event to the controller and transitions into the ready to recognize state.
In the preparing voice models state, the resource behavior depends on whether activeVoiceModel is empty or not. If activeVoiceModel is empty (i.e. voice model is not defined for this resource), the resource sends the controller a 'notPrepared' event and returns to the idle state. If activeVoiceModel is non-empty, the resource sends a 'prepare' event to the device. The event payload includes 'voicemodel' and 'properties' parameters. The 'voicemodel' value is a URI to the voicemodel, and its properties are extracted from activeVoiceModel. If the device sends a 'prepared' event, the resource sends a 'prepared' event to the controller and transitions into the ready to recognize state.
When the recognition resource is in a ready to recognize state, it may receive a 'stop' event. In this case, the resource sends a 'stop' event to the device, and returns to the idle state. If the resource receives a 'listen' event, it sends a 'listen' event to the device and moves into the recognizing state.
When the resource is in a recognizing state, it can toggle between this state and a suspended recognizing state by receiving further 'listen' and 'suspend' events. If the resource receives a 'suspend' event, then it moves into the suspended recognizing state and sends the device a 'suspend' event which causes the device to suspend recognition and delete any buffered input. No input is buffered while the device is in a suspended state. If the resource then receives a 'listen' event, it moves back into the recognizing state.
When in the recognizing state, the resource may receive an 'inputStarted' event from the device, indicating that user input has been detected. The resource then moves into a waiting for results state. The device may send an 'error' event (for example, if maximum time has been exceeded) causing it to return to the idle state and send the controller an 'error' event. Alternatively, the device may send a 'recoResults' event, which contains a results parameter, a data structure representing recognition results. In the case of DTMF or ASR, the results can be in VoiceXML 2.0 or EMMA format. For SIV, the results must be in EMMA format. The structure may contain zero or more recognition results. Each result must specify the grammar (or voicemodel) associated with the recognition (using the same grammar/voicemodel name as used in the payload of the 'prepare' event), its recognition confidence and its input mode. The resource sends its controller a 'recoResults' event with event data containing the device's results parameter together with a listener parameter whose value is the listener associated with the grammar of the first result with the highest confidence (if there are no results, then the listener parameter is not defined). The resource then returns to the ready to recognize state, awaiting either a 'stop' event to terminate recognition or a 'listen' event to start another recognition cycle using the same active grammars and recognition properties.
A recognition resource is defined by the events it receives:
Event | Source | Payload | Sequencing | Description |
addGrammar | any | grammar (M), listener (M), properties (O) | creates a grammar item composed of the grammar, listener and properties, and adds it to the activeGrammars | |
addVoiceModel | any | voicemodel (M), listener (M), properties (O) | creates a VoiceModel item composed of the voice model, listener and properties, and adds it to the activeVoiceModel | |
prepare | any | controller (M), properties (M) | prepares the device for recognition using activeGrammars or activeVoiceModel and properties | |
listen | any | initiates/resumes recognition | ||
suspend | any | suspends recognition | ||
stop | any | terminates recognition |
and the events it sends:
Editorial note | |
Need to add a prepareGrammar event (from the grammar RC). |
Event | Target | Payload | Sequencing | Description |
prepared | controller | one-of: prepared, notPrepared | positive response to prepare (activeGrammars or activeVoiceModel prepared) | |
notPrepared | controller | one-of: prepared, notPrepared | negative response to prepare (no activeGrammars or activeVoiceModel defined) | |
inputStarted | controller | notification that onset of input has been detected | ||
inputFinished | controller | notification that the end of input has been detected | ||
partialResult | controller | results (M), listener (O) | notification of a partial recognition result | |
recoResult | controller | results (M), listener (O) | notification of complete recognition result, including the results structure and a listener | |
error | controller | error status (M) | notification that an error has occurred |
The resource receives from the recognition device the following events:
Event | Payload | Sequencing | Description |
prepared | response to prepare indicating that activeGrammars or activeVoiceModel have been successfully prepared | ||
inputStarted | notification that the onset of input has been detected | ||
inputFinished | notification that the end of input has been detected | ||
partialResults | results (M) | notification of a partial recognition results | |
recoResults | results (M) | notification of final recognition results | |
error | error status (M) | an error occurred |
and sends to the recognition device the following events:
Event | Payload | Sequencing | Description |
prepare | grammars (M) or voicemodel (M), properties (M) | the recognition or SIV device is prepared with grammars and properties | |
clear | all grammars and properties in the recognition device are to be cleared | ||
listen | recognition is to be initiated | ||
suspend | recognition is to be suspended | ||
stop | recognition is to be stopped |
The state model for an ASR recognition resource are shown in Figure 9. The DTMF resource model only differs in that the value for the mode data is 'dtmf' instead of 'voice'.
[generalize stop event returning resource to idle state ...]
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data name="activeGrammars"/> <data name="properties"/> <data name="controller"/> <data name="mode"/> </datamodel> <state id="Created"> <initial id="idle"/> <state id="idle"> <onentry> <foreach var="node" nodeset="datamodel/data[@name='activeGrammars']"> <delete loc="$node"/> </foreach> <assign loc="/datamodel/data[@name='properties']" val="null"/> <assign loc="/datamodel/data[@name='controller']" val="null"/> <assign loc="/datamodel/data[@name='mode']" val="voice"/> </onentry> <transition event="AddGrammar"> <datamodel> <data name = "gram"/> </datamodel> <assign name="/datamodel/data/[@name='gram']/grammar" expr="_eventData/grammar" /> <assign name="/datamodel/data/[@name='gram']/properties" expr="_eventData/properties" /> <assign name="/datamodel/data/[@name='gram']/listener" expr="_eventData/listener" /> <insert pos="after" name="datamodel/data[@name='activeGrammars']" val="gram"/> </transition> <transition event="prepare" target="preparingGrammars"> <assign loc="/datamodel/data[@name='properties']" expr="_eventData/properties"/> <assign loc="/datamodel/data[@name='controller']" expr="_eventData/controller"/> </transition> </state> <!-- end idle --> <state id="preparingGrammars"> <onentry> <if cond="isEmpty(/datamodel/data[@name='activeGrammars']) eq 'false'"> <send target="device" event="dev:clear"/> <send target="device" event="dev:prepare" namelist="/datamodel/data[@name='activeGrammars'], /datamodel/data[@name='properties']"/> </if> </onentry> <transition cond="isEmpty(/datamodel/data[@name='activeGrammars']) eq 'true'" target="idle"> <send target="controller" event="notPrepared"/> </transition> <transition event="stop" target="idle"> <send target="device" event="dev:stop"/> </transition> <transition event="dev:prepared" target="readyToRecognize"> <send target="controller" event="Prepared"/> </transition> </state> <!-- end preparingGrammars --> <state id="readyToRecognize"> <transition event="listen" target="recognizing" /> <transition event="stop" target="idle"> <send target="device" event="dev:stop"/> </transition> </state> <!-- end readyToRecognize --> <state id="recognizing"> <onentry> <send target="device" event="dev:listen"/> </onentry> <transition event="suspend" target="suspendedRecognizing"/> <transition event="dev:inputStarted" target="waitingForResult"/> <transition event="stop" target="idle"> <send target="device" event="dev:stop"/> </transition> </state> <!-- end recognizing --> <state id="suspendedRecognizing"> <onentry> <send target="device" event="dev:suspend"/> </onentry> <transition event="listen" target="recognizing"/> <transition event="stop" target="idle"> <send target="device" event="dev:stop"/> </transition> </state> <!-- end suspendedRecognizing --> <state id="waitingForResult"> <onentry> <send target="controller" event="inputStarted"/> </onentry> <transition event="dev:inputFinished"> <send target="controller" event="inputFinished"/> </transition> <transition event="dev:partialResult"> <send target="controller" event="partialResult" namelist="_eventData/results,_eventData/grammar/listener"/> </transition> <transition event="dev:recoResults" target="readyToRecognize"> <send target="controller" event="recoResult" namelist="_eventData/results,_eventData/grammar/listener"/> </transition> <transition event="dev:error" target="idle"> <send target="controller" event="error" namelist="_eventData/error status"/> </transition> <transition event="stop" target="idle"> <send target="device" event="dev:stop"/> </transition> </state> <!-- end waitForResult --> </state> <!-- end Created --> </scxml>
The working group plans to define an SIV (Speaker Identification and Verification) resource within this document. The group currently expects it to have the following characteristics:
A connection resource is an entity that establishes/relinquishes a connection between the VoiceXML interpreter context and the user. It can be used by the interpreter context to establish a connection with the user at the start of the dialog or to disconnect an existing connection either implicitly (when there are no more elements to execute in the VoiceXML application) or explicitly (when the interpreter context encounters a <disconnect> element or the user chooses to terminate the call).
Whenever an interpreter context disconnects from the user, the interpreter context may enter into the final processing state as is the case while executing a <disconnect> element or when the user actively disconnects. The purpose of this state is to allow the VoiceXML application to perform any necessary final cleanup, such as submitting information to the application server. In this state, entering into the wait state is not allowed. However, the application can navigate from one page to another without trying to interact with the user. Therefore, the application should not enter <field>, <record> or <transfer> during the final processing state. The VoiceXML interpreter must exit if the VoiceXML application attempts to enter the waiting state while in the final processing state.
Aside from this restriction, execution of the VoiceXML application continues normally while in the final processing state. Thus for example the application may transition between documents while in the final processing state, and the interpreter must exit if no form item is eligible to be selected.
Event | Source | Payload | Sequencing | Description |
prepare | Prepare to connect the user with the interpreter context | |||
connect | Connect the user to the interpreter context | |||
disconnect | any | Disconnect the user from the interpreter context |
Event | Target | Payload | Sequencing | Description |
prepared | Connection prepared | |||
connected | Interpreter context connected to the user | |||
disconnected | Any | Interpreter context disconnected from the user |
The timer resource is a resource that tracks timers for various resource controllers. A timer can be set to send a timeout event at some future time. Timers which have been set may also be canceled.
A timer resource is defined by the events it receives:
Event | Source | Payload | Sequencing | Description |
start | any | owner (M), timeout (M), handle (O) | The effect of sending a start event to the timer resource will have a new timer started that in timeout time will fire a timerExpired event to the owner of the event. The handle must be used to correlate timers if the cancel is supported. | |
cancel | any | owner (M), handle (O) | The cancel event will cancel any previous timeout that has been set with the timer resource that match the handle and owner. If there was no previous event set (or the previous timer has fired) then cancel still succeeds, as the semantics of cancel are once cancel has succeed you will not receive a timerExpired event |
and the events it sends:
Event | Target | Payload | Sequencing | Description |
timerExpired | controller | handle (O) | timerExpired must precede any cancelSuccess | This event means the timer has fired. The controller that receives the event is the owner of the start event. If the handle was passed in to the start event then it will be passed back when the timer expires |
cancelSuccess | controller | handle (O) | timerExpired must precede any cancelSuccess | This event means that the timer in question is cancelled and no new timerExpired events may be received |
t is possible to receive both a timerExpired event and then a cancelSuccess event as the events may have crossed paths.
In VoiceXML 3.0, the language is partitioned into independent modules which can be combined in various ways. In addition to the modules defined in this section, it is also possible for third parties to define their own modules (see Section XXX).
Each module is assigned a schema, which defines its syntax, plus one or more Resource Controllers (RCs), which define its semantics, plus a "constructor" that knows how to create them from the syntactic representation at initialization time. Only DOM nodes that have schemas and constructors (and hence RCs) assigned to them can be modules in VoiceXML 3.0. However, we may choose to define constructors and RCs for nodes that are not modules. Nodes that do not have constructors and RCs ultimately depend on some module for their interpretation. (Those modules are usually ancestor nodes, but we do not require this.) There can be multiple modules associated with the same VoiceXML element. They may set properties differently, add different child elements, etc. In many cases, some of the modules will be extensions of the others, but we don't require this.
Note there is not necessarily a one-to-one relationship between semantic RCs and syntactic markup elements. It may take several RCs to implement the functionality of a single markup element.
This module describes the syntactic and semantic features of a <grammar> element which defines grammars used in ASR and DTMF recognition. Grammars defined via this module are used by other modules.
The attributes and content model of <grammar> are specified in 6.1.1 Syntax. Its semantics are specified in 6.1.2 Semantics.
Editorial note | |
Issue: Grammar processing will need to know the Base URI to resolve relative references. |
[See XXX for schema definitions].
The <grammar> element has the attributes specified in Table 16.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
mode | The only allowed values are "voice" and "dtmf" | Defines the mode of the grammar following the modes of the W3C Speech Recognition Grammar Specification [SRGS]. | No | The value of the document property "grammarmode" |
weight | Weights are simple positive floating point values without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or many digits. | Specifies the weight of the grammar. See vxml2: Section 3.1.1.3 | No | 1.0 |
fetchhint | One of the values "safe" or "prefetch" | Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. | No | None |
fetchtimeout | Time Designation | The interval to wait for the content to be returned before throwing an error.badfetch event. | No | None |
maxage | An unsigned integer | Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. | No | None |
maxstale | An unsigned integer | Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. | No | None |
Editorial note | |
The default value of the "grammarmode" document property (see XXXX) is "voice". |
The content model of <grammar> consists of exactly one of:
The grammar RC is the primary RC for the <grammar> element.
The grammar RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.
In the Initializing state, the grammar RC first initializes its child.
Next, the language, charset, and encoding parameters are set to the values in effect at this point in the document. If the fetchhint parameter value is "Prefetch", the RC sends the Prefetch event to the DTMF or ASR Recognizer resource, as appropriate (see below), with the following data: the child RC, fetchtimeout, maxage, maxstale. Finally, the RC sends the controller an 'initialized' event and transitions to the Ready state.
In the Ready state, when the grammar RC receives an 'execute' event it transitions to the Executing state.
In the Executing state,
If the child RC is an External Grammar, the grammar RC sends an 'execute' event to the child RC and waits for it to complete.
Then, the grammar RC sends an AddGrammar event to the DTMF Recognizer Resource if mode="dtmf" or to the ASR Recognizer Resource if mode="voice", with the following as event data: the child RC, the fetchhint, language, charset, and encoding parameter values, and the controller RC (e.g., link, field, or form) as the handler for recognition results.
Finally, the grammar RC sends the controller an executed event and transitions to the Ready state.
Editorial note | |
The currently-active value of fetchhint, fetchtimeout, maxage, and maxstale properties may be different at execution than at initialization, so the determination of these values should be done by the RCs that initialize or execute the grammar RC rather than by the grammar RC itself. The document text needs to be updated to reflect this. (From Nov 2009 f2f) Initializing: Validate that behavior of sending a pointer to the child RC to the ASR resource. Is this acceptable, or do we need to extract the grammar data from the child RC and then send that data? The advantage of sending the RC pointer is that it makes clear what kind of grammar info it is -- inline SRGS or external reference. Execute issues:
Editor will write new section 4.5 "Other" and subsections 4.5.1 "property/attribute resolution" and 4.5.2 "language resolution". Depending on the text, we may need to update the semantics to refer to section 4.5.2 when describing how xml:lang is used. |
The Grammar RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | causes the element and its children to be initialized |
execute | controller | Adds the grammar to the appropriate Recognition Resource |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized | |
executed | controller | response to execute event indicating that it has been successfully executed |
The external events sent and received by the Grammar RC are those defined in this table:
Event | Source | Target | Description |
addGrammar | GrammarRC | DTMF Recognition Resource or ASR Recognition Resource | Adds grammar to list of currently active grammars |
Prefetch | GrammarRC | DTMF Recognition Resource or ASR Recognition Resource | Requests that the grammar be fetched/compiled in advance, if possible |
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data id="controller"/> <data id="child"/> <data id="content"/> <!-- not used --> <data id="properties"/> <data id="mode"/> <data id="fetchhint"/> </datamodel> <state id="Created"> <initial id="Idle"/> <state id="Idle"> <onentry> <!-- ISSUE: Is this needed? --> <assign location="$controller" expr="null"/> <assign location="$child" expr="null"/> <assign location="$content" expr="null"/> <assign location="$properties/weight" expr="1.0"/> <assign location="$properties/fetchtimeout"/> <assign location="$properties/maxage"/> <assign location="$properties/maxstale"/> <assign location="$properties/charset"/> <assign location="$properties/encoding"/> <assign location="$properties/language"/> <assign location="$mode"/> <assign location="$fetchhint"/> </onentry> <transition event="initialize" target="Initializing"> <assign name="$controller" expr="_eventData/controller"/> <assign name="$child" expr="_eventData/child"/> </transition> </state> <!-- end Idle --> <state id="Initializing"> <onentry> <!-- ISSUE: complete the initialization --> <assign location="$properties/weight" expr="1.0"/> <assign location="$properties/fetchtimeout" expr="InitDefault();"/> <assign location="$properties/maxage" expr="InitDefault();"/> <assign location="$properties/maxstale" expr="InitDefault();"/> <assign location="$properties/charset" expr="InitDefault();"/> <assign location="$properties/encoding" expr="InitDefault();"/> <assign location="$properties/language" expr="InitDefault();"/> <assign location="$mode" expr="InitDefault();"/> <assign location="$fetchhint" expr="InitDefault();"/> <send target="$child/controller" event="initialize" namelist="$child"/> </onentry> <transition event="Initializing.error" target="Idle"> <send target="controller" event="initialize.error" namelist="_eventData/error_status"/> </transition> <transition event="Initializing.done" target="Ready"> <if cond="$fetchhint eq 'prefetch'"> <if cond="$mode eq 'voice'"> <send target="ASRRecognizer" event="Prefetch" namelist="$child $properties/fetchtimeout $properties/maxage $properties/maxstale"/> <else/> <send target="DTMFRecognizer" event="Prefetch" namelist="$child $properties/fetchtimeout $properties/maxage $properties/maxstale"/> </if> </if> <send target="controller" event="initialized"/> </transition> </state> <!-- end Initializing --> <state id="Ready"> <transition event="execute" target="Executing"/> </state> <!-- end Ready --> <state id="Executing"> <onentry> <!-- ISSUE: Initialization function to be completed --> <assign location="$properties/fetchtimeout" expr="InitDefault();"/> <assign location="$properties/maxage" expr="InitDefault();"/> <assign location="$properties/maxstale" expr="InitDefault();"/> <!-- ISSUE: Add condition if $child is externalgrammar element --> <if cond="???"> <send target="$child/controller" event="execute" namelist="$child"/> <else/> <!-- ISSUE: Missing in sendcontroller RC (e.g. link, field, etc) as the handler for recognition results --> <if cond="$mode eq 'voice'"> <send target="ASRRecognizer" event="AddGrammar" namelist="$child $fetchhint $properties/language $properties/charset $properties/encoding"/> <else/> <send target="DTMFRecognizer" event="AddGrammar" namelist="$child $fetchhint $properties/language $properties/charset $properties/encoding"/> </if> <send target="controller" event="executed"/> </if> </onentry> <transition event="Executing.done"> <!-- ISSUE: Missing in send controller RC (e.g. link, field, etc) as the handler for recognition results --> <if cond="$mode eq 'voice'"> <send target="ASRRecognizer" event="AddGrammar" namelist="$child $fetchhint $properties/language $properties/charset $properties/encoding"/> <else/> <send target="DTMFRecognizer" event="AddGrammar" namelist="$child $fetchhint $properties/language $properties/charset $properties/encoding"/> </if> <send target="controller" event="executed"/> </transition> </state> <!-- end Executing --> </state> <!-- end Created --> </scxml>
Editorial note | |
|
The events in this table may be raised during initialization and execution of the <grammar> element.
Event | Description | State |
error.semantic | indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. | execution |
---|
Note that additional errors may occur when the grammar is fetched or added by the ASR or DTMF resource. Please check there for details.
This module describes the syntactic and semantic features of inline SRGS grammars used in ASR and DTMF recognition.
Editorial note | |
Issue: Do we need to support inline ABNF SRGS?: |
The attributes and content model of Inline SRGS grammars are specified in 6.2.1 Syntax. Its semantics are specified in 6.2.2 Semantics.
[See XXX for schema definitions].
The syntax of the Inline SRGS Grammar Module is precisely all of the XML markup for a legal stand-alone XML form grammar as described in SRGS ([SRGS]), minus the XML Prolog. Note that both elements and attributes must be in the SRGS namespace (http://www.w3.org/2001/06/grammar).
The Inline SRGS grammar RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
Editorial note | |
Should the contents of the grammar parameter be parsed rather than the raw document text? For example, should it be the DOM representation of the grammar, or just the XML Info set, or what? |
The grammar RC's state model consists of the following states: Idle, Initializing, and Ready. Unlike most of the other modules, this module is primarily a data model for storing a grammar. The module itself has no execution semantics.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.
In the Initializing state, the syntactic contents of the grammar are saved into the grammar parameter. The RC sends the controller an 'initialized' event and transitions to the Ready state.
The Inline SRGS Grammar RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | causes the element and its children to be initialized |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized |
This module describes the syntactic and semantic features of an <externalgrammar> element which defines external grammars used in ASR and DTMF recognition.
Editorial note | |
The name of this element is still under discussion. |
The attributes and content model of <externalgrammar> are specified in 6.3.1 Syntax. Its semantics are specified in 6.3.2 Semantics.
[See XXX for schema definitions].
The <externalgrammar> element has the attributes specified in Table 23.
Name | Type | Description | Required | Default Value | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
src | anyURI | The URI specifying the location of the grammar and optionally a rulename within that grammar, if it is external. The URI is interpreted as a rule reference as defined in Section 2.2 of the Speech Recognition Grammar Specification [SRGS] but not all forms of rule reference are permitted from within VoiceXML. The rule reference capabilities are described in detail below this table. | No | ||||||||||||||||||||||
srcexpr | A data model expression | Equivalent to src, except that the URI is dynamically determined by evaluating the content as a data model expression. | No | ||||||||||||||||||||||
type | A data model expression |
The preferred media type of the grammar. A resource indicated by the URI reference in the src attribute may be available in one or more media types. The author may specify the preferred media-type via the type attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation. The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain for an 'application/srgs+xml' document). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative. Three special cases may arise. The declared media-type may not be supported by the processor; in this case, an error.unsupported.format is thrown by the platform. The declared media-type may be supported but the actual media-type may not match; an error.badfetch is thrown by the platform. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the capabilities of the grammar processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616], section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:
|
No | None |
Editorial note | |
Error messages for "type" attribute need to be updated. |
See 6.3.1.2 Content Model for restrictions on occurrence of src and srcexpr attributes.
The value of the src attribute is a URI specifying the location of the grammar with an optional fragment for the rulename. Section 2.2 of the Speech Recognition Grammar Specification [SRGS] defines several forms of rule reference. The following are the forms that are permitted on a grammar element in VoiceXML:
The following are the forms of rule reference defined by [SRGS] that are not supported in VoiceXML 3.
The <externalgrammar> element has the following co-occurrence constraints:
Editorial note | |
Editor: please remove the "otherwise, an error.badfetch ..." from the above and all other co-occurrence text and write general text somewhere describing what happens when a co-occurrence constraint is violated. |
The External Grammar RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The External Grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.
In the Initializing state, the RC sends the controller an 'initialized' event and transitions to the Ready state.
In the Ready state, when the External Grammar RC receives an 'execute' event it transitions to the Executing state.
In the Executing state, if the srcexpr variable is set it is evaluated against the data model as a data model expression, and the value is placed into the src variable; if srcexpr cannot be evaluated, an error.semantic event is thrown. Otherwise, the RC sends an 'executed' event to the controller RC and transitions into the Ready state.
The External Grammar RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | causes the element and its children to be initialized |
execute | controller | Evaluates srcexpr and populates src variable |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized | |
executed | controller | response to execute event indicating that it has been successfully executed |
The events that may be raised during initialization and execution of the <externalgrammar> element are those defined in Table 27 below.
Event | Description | State |
error.semantic | indicates that there was an error in the evaluation of the srcexpr attribute. |
---|
This module defines the syntactic and semantic features of a <prompt> element which controls media output. The content model of this element is empty: content is defined in other modules which extend this element's content model (for example 6.5 Builtin SSML Module, 6.6 Media Module and 6.7 Parseq Module).
The attributes and content model of <prompt> are specified in 6.4.1 Syntax. Its semantics are specified in 6.4.2 Semantics, including how the final prompt content is determined and how the prompt is queued for playback using the PromptQueue Resource (5.2 Prompt Queue Resource).
[See XXX for schema definitions].
The <prompt> element has the attributes specified in Table 28.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
bargein | boolean | Controls whether the prompt can be interrupted. | No | bargein property |
bargeintype | string | On prompts that can be interrupted, determines the type of bargein, either 'speech', or 'hotword'. | No | bargeintype property |
cond | data model expression | A data model expression that must evaluate to true after conversion to boolean in order for the prompt to be played. | No | true |
count | positive integer | A number indicating the repetition count, allowing a prompt to be activated or not depending on the current repetition count. | No | 1 |
timeout | Time Designation | The time to wait for user input. | No | timeout property |
xml:lang | string | The language identifier for the prompt. | No | document's "xml:lang" attribute |
xml:base | string | Declares the base URI from which relative URIs in the prompt are resolved. | No | document's "xml:base" attribute |
The prompt RC is the primary RC for the <prompt> element.
The prompt RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The prompt RC's state model consists of the following states: Idle, Initializing, Ready, FormReady, and Executing. The initial state is the Idle state.
While in the Idle state, the prompt RC may receive an 'initialize' event, whose controller event data is used to update the data model. The prompt RC then transitions into Initializing state.
In the Initializing state, the prompt RC initializes its children: this is modeled as a separate RC (see XXX). The children may return an error for initialization. If a child sends an error, then the prompt RC returns an error. When all children are initialized, the prompt RC sends the controller an 'initialized' event and transitions to the Ready state.
In the Ready state, the prompt RC can receive a 'checkStatus' event to check whether this prompt is eligible for execution or not. The value of the cond parameter in its data model is checked against the data model resource: the status is true if the value of the cond parameter evaluates to true. The status, together with its count data, is sent in a 'checkedStatus' event to the controller RC. The controller RC then determines if the prompt is selected for execution ([vxml20: 4.1.6], see PromptSelectionRC, Section XXX). The prompt RC will then transition to the FormReady state. If the prompt RC receives an 'execute' event and the cond parameter evaluates to true, it transitions to the Executing state; if the cond parameter evaluates to false, it will send the controller the executed event and stay in the Ready state.
In the FormReady State, if the prompt RC receives a 'checkStatus' event, it will again check the cond parameter and send the 'checkedStatus' event to the controller RC as in the Ready State. In this state, if the RC receives an 'execute' event it transitions to the Executing state.
In the Executing state, the prompt RC sends an evaluate event to its children. Each child returns either an error, or content (which may include parameters) for playback. If a child sends an error, then the prompt RC returns an error. Once evaluation is complete, the RC sends a queuePrompt event to the Prompt Queue Resource with the <prompt> parameters (bargein, bargeintype, timeout) with event data consisting of the list of content returned by its children. The prompt RC then sends the controller an executed event and transitions to the Ready state.
Editorial note | |
SSML validation issue: what if evaluation results in a non-valid structure? |
The Prompt RC is defined to receive the following events:
Event | Source | Payload | Description | |
initialize | any | controller(M) | causes the element and its children to be initialized | |
checkStatus | controller | causes evaluation of the cond parameter against the data model | ||
execute | controller | causes the evaluation of its content and conversion to a format suitable for queueing on the PromptQueue Resource |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized | |
checkedStatus | controller | status (M), count (M) | response to checkStatus event with count parameter and status indicating evaluation of cond parameter |
executed | controller | response to execute event indicating that it has been successfully executed |
Table 31 shows the events sent and received by the prompts RC to resources and other RCs which define the events.
Event | Source | Target | Description |
evaluate | PromptRC | DataModel | used to evaluate the cond parameter (see XXX) |
queuePrompt | PromptRC | PromptQueue | adds prompt content and properties to the Prompt Queue (see XXX) |
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data id="properties"/> <data id="children"/> <data id="content"/> <data id="properties"/> <data id="count"/> <data id="cond"/> <data id="xml:lang"/> <data id="xml:base"/> </datamodel> <state id="Created"> <initial id="Idle"/> <state id="Idle"> <onentry> <assign location="$controller" expr="null"/> <assign location="$children" expr="null"/> <assign location="$content" expr="null"/> <assign location="$properties/bargein" expr="true"/> <assign location="$properties/bargeintype" expr="speech"/> <assign location="$properties/timeout" expr="5s"/> <assign location="$count" expr="1"/> <assign location="$cond" expr="true"/> <assign location="$xml:lang" expr=""/> <assign location="$xml:base" expr=""/> </onentry> <transition event="initialize" target="Initializing"> <assign name="$controller" expr="_eventData/controller"/> <assign name="$children" expr="_eventData/children"/> </transition> </state> <!-- end Idle --> <state id="Initializing"> <datamodel> <data id="childcounter"/> </datamodel> <onentry> <assign location="$childcounter" expr="0"/> <foreach var="child" array="$children"> <send target="$child/controller" event="initialize" namelist="$child/child"/> </foreach> </onentry> <transition event="Initializing.done"> <assign location="$childcounter" expr="$childcounter + 1"/> </transition> <transition event="Initializing.error" target="Idle"> <assign location="$childcounter" expr="$childcounter + 1"/> <send target="controller" event="initialize.error" namelist="_eventData/error_status"/> </transition> <transition event="Initializing.done" cond="$childcounter eq $children.size()-1" target="Ready"> <send target="controller" event="initialized"/> </transition> </state> <!-- end Initializing --> <state id="Ready"> <datamodel> <data id="status"/> </datamodel> <transition event="checkStatus" target="FormReady"> <assign location="$status" expr="checkStatus()"/> <send target="controller" event="checkStatus" namelist="$status, $count"/> </transition> <transition event="execute" cond="checkStatus() eq 'true'" target="Executing"/> <transition event="execute" cond="checkStatus() eq 'false'"> <send target="controller" event="executed"/> </transition> </state> <!-- end Ready --> <state id="FormReady"> <datamodel> <data id="status"/> </datamodel> <transition event="checkStatus"> <assign location="$status" expr="checkStatus()"/> <send target="controller" event="checkStatus" namelist="$status, $count"/> </transition> <transition event="execute" target="Executing"/> </state> <!-- end FormReady --> <state id="Executing"> <datamodel> <data id = "prompt"/> </datamodel> <onentry> <assign location="$counter" expr="0"/> <assign location="$child_return" expr="null"/> <foreach var="child" array="$children"> <send target="$child/controller" event="evaluateChild"/> </foreach> </onentry> <transition event="Executing.done"> <assign location="$counter" expr="$counter + 1"/> <insert pos="after" name="$prompt" expr="_eventData/prompts"/> </transition> <transition event="Executing.error" target="Idle"> <send target="controller" event="Executing.error" namelist="_eventData/error_status"/> </transition> <transition event="Executing.done" cond="$counter eq $children.size()-1" target="Ready"> <insert pos="after" name="$prompt" expr="_eventData/prompts"/> <send target="PromptQueue" event="/queuePrompt" namelist="$prompt, $properties"/> <send target="controller" event="executed"/> </transition> </state> <!-- end Executing --> </state> <!-- end Created --> </scxml>
The events in Table 32 may be raised during initialization and execution of the <prompt> element.
Event | Description | State |
error.unsupported.language | indicates that an unsupported language was encountered. The unsupported language is indicated in the event message variable. | execution |
---|---|---|
error.unsupported.element | indicates that the element within the <prompt> element is not supported | initialization |
error.badfetch | indicates that the prompt content is malformed ... | initialization, execution |
error.noresource | indicates that a Prompt Queue resource is not available for rendering the prompt content. | execution |
error.semantic | indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. | execution |
Editorial note | |
The relationship between the user visible events defined in the above table, and semantic event model has yet to be defined. Can we really determine whether errors are raised in initialization (syntax) or execution (evaluation) states? How does this fit in with errors returned when prompts are played in PromptQueue player implementation? ACTION: Clarify which specific cases are affected by 'error.badfetch' ambiguity re. initialization versus execution states. Clarify that error.semantic doesn't apply to evaluation of src/expr with <audio> (e.g. fallback). Clarify that errors are recorded? (vxml21??) Should media control properties (e.g. clipBegin, speed, etc) of <media> be also available on <prompt>? We should clarify where the error.badfetch gets thrown. For instance, if we are loading a document with malformed prompt elements, the error.badfetch may get thrown back to the calling document. If we are throwing error.badfetch during execution, then it will be thrown back to the malformed document itself? |
This module describes the syntactic and semantic features of SSML elements built into VoiceXML.
This module is designed to extend the content model of the <prompt> element defined in 6.4 Prompt Module.
The attributes and content model of SSML elements are specified in 6.5.1 Syntax. Its semantics are specified in 6.5.2 Semantics, including how elements are evaluated to yield final content for playback.
[See XXX for schema definitions].
This module defines an SSML ([SSML]) Conforming Speech Synthesis Markup Language Fragment where:
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
fetchtimeout | See fetchtimeout definition | No | fetchtimeout property | |
fetchhint | See fetchhint definition | No | audiofetchhint property | |
maxage | See maxage definition | No | audiomaxage property | |
maxstale | See maxstale definition | No | audiomaxstale property | |
expr | A data model expression which determines the source of the audio to be played. The expression may be either a reference to audio previously recorded (see Record Module) or evaluate to the URI of an audio resource to fetch. | No | undefined |
Exactly one of "src" or "expr" attributes must be specified; otherwise, an error.badfetch event is thrown.
Editorial note | |
SSML 1.1 required for fetching attributes like fetchtimeout? Or profile dependent? Support for 'say-as' extension to SSML 1.0? Support for <enumerate>? Note that profiles specify which media formats are required |
When the RC receives an evaluate event, its children are evaluated in order to return an SSML Conforming Stand-Alone Speech Synthesis Markup Language Document which can be processed by a Conforming Speech Synthesis Markup Language Processor.
Evaluation comprises of:
Editorial note | |
We may want to refine the description that the output of evaluation is an SSML Document. One rationale is that we don't want to prohibit that SSML extensions are lost during evaluation. The output may be another Fragment rather than a Document. Clarify exact nature of <audio> expr value for skipping - undefined vs. null? Need to specify further error cases Do these elements have RCs? They are in the VoiceXML namespace but are just enhanced SSML elements. Need to clarify unsupported languages and external (e.g. MRCP) SSML processors. |
In this example
<prompt> <foreach item="item" array="array"> <audio expr="item.audio"><value expr="item.tts"/></audio> <break time="300ms"/> </foreach> </prompt>
evaluation returns a sequence of content for each item in <foreach> with <audio> and <value> elements.
Assume that the array consists of 2 items where each item.audio evaluates to 'one.wav' and 'two.wav' respectively, and each item.tts evaluates to 'one' and 'two' respectively. Evaluation of <foreach> is equivalent to the following
<prompt> <audio expr="'one.wav'"><value expr="'one'"/></audio> <break time="300ms"/> <audio expr="'two.wav'"><value expr="'two'"/></audio> <break time="300ms"/> </prompt>
further evaluation of the <audio> and <value> elements result in
<prompt> <audio src="one.wav">one</audio> <break time="300ms"/> <audio src="two.wav">two</audio> <break time="300ms"/> </prompt>
and finally the prompt content is converted into a stand-alone SSML document (assuming the <prompt>'s xml:lang attribute evaluates to 'en'):
<speak version="1.0" xml:lang="en" xmlns="http://www.w3.org/2001/10/synthesis"> <audio src="one.wav">one</audio> <break time="300ms"/> <audio src="two.wav">two</audio> <break time="300ms"/> </speak>
This content is queued and played using the PromptQueue: each audio URI, or fallback content, is played, followed by a 300 millisecond break.
The media module defines the syntax and semantics of <media> element.
The module is designed to extend the content model of <prompt> in the prompt module (6.4 Prompt Module).
The <media> element can be seen as an enhanced and generalized version of the VoiceXML <audio> element. It is enhanced in that it provides additional attributes describing the type of media, conditional selection, as well as control over playback . It is a generalization of the <audio> element in that it permits media other than audio to be played; for example, media formats which contains audio and video tracks.
[See XXX for schema definitions].
The <media> element has the attributes specified in Table 34.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
src | The URI specifying the location of the media source. | No | None | |
srcexpr | A data model expression which evaluates to a URI indicating the location of the media resource. | No | undefined | |
cond | A data model expression that must evaluate to true after conversion to boolean in order for the media to be played. | No | true | |
type |
The preferred media type of the output resource. A resource
indicated by the URI reference in the The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain' for a 'audio/x-wav' or video/3gpp' resource). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative. Three special cases may arise.
|
No | undefined | |
clipBegin | Time Designation | offset from start of media to begin rendering. This offset is measured in normal media playback time from the beginning of the media. | No | 0s |
clipEnd | Time Designation | offset from start of media to end rendering. This offset is measured in normal media playback time from the beginning of the media. | No | None |
repeatDur | Time Designation | total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media. | No | None |
repeatCount | positive Real number | number of iterations of media to render. A fractional value describes a portion of the rendered media. | No | 1 |
soundLevel | signed ("+" or "-") CSS2 numbers immediately followed by "dB" | Decibel values are interpreted as a ratio of the squares of the new signal amplitude (a1) and the current amplitude (a0) and are defined in terms of dB: soundLevel(dB) = 20 log10 (a1 / a0) A setting of a large negative value effectively plays the media silently. A value of '-6.0dB' will play the media at approximately half the amplitude of its current signal amplitude. Similarly, a value of '+6.0dB' will play the media at approximately twice the amplitude of its current signal amplitude (subject to hardware limitations). The absolute sound level of media perceived is further subject to system volume settings, which cannot be controlled with this attribute. | No | +0.0dB |
speed | x% (where x is a positive real value) | the speed at which to play the referenced media, relative to the original speed. The speed is set to the requested percentage of the speed of the original media. For audio, a change in the speed will change the rate at which recorded samples are played back and this will affect the pitch. | No | 100% |
outputmodes | space separated list of media types | Determines the modes used for media output. See 8.2.4 Media Properties for further details. | No | outputmodes property |
See occurrence constraints for restrictions on occurrence of src and srcexpr attributes.
Calculations of rendered durations and interaction with other timing properties follow SMIL 2.1 Computing the active duration where
Note that not all SMIL 2.1 Timing features are supported.
Editorial note | |
Use SMIL 3.0 or SMIL 2.1 reference? should trimming and media attributes also be defined in <prompt>? do we need expr values for type, clipBegin, clipEnd, repeatDur, repeatCount, etc? (Perhaps add implied expr for every attribute?) when is a property evaluation error thrown? Add fetchtimeout, fetchhint, maxage and maxstale attributes Major attribute candidate: errormode (flexible error handling which controls whether errors are thrown or fallback is used). Other candidate attributes: id/idref (use case?) |
The <media> element content model consists of:
The <media> has the following co-occurrence constraints:
Note that the type attribute does not affect inline content. The handling of inline XML content is in accordance to the namespace of the root element (such as SSML <speak>, SMIL <smil>, and so forth). CDATA, or mixed content with VoiceXML <foreach> or <value> elements must be treated as an SSML Fragment and evaluated as described in 6.6.2 Semantics.
Editorial note | |
Permit other types of inline content apart from SSML? Are child <property> elements necessary? Alternative: extended <prompt> so that <property> children are allowed? |
Developers should be aware that there may be performance implications when using <media> depending on which attributes are specified, the media itself, its transport and processing.
Since operations like trimming, soundLevel and speed modifications are applied to media, this requires that the SSML processor begins generating output audio before these operations are applied. If the clipBegin attribute is specified, this may required SSML generation of audio prior to clipBegin, depending on the implementation. This may lead to a gap between execution of the <media> element and start of playback.
If the media is fetched with HTTP protocol and the clipBegin attribute is specified, then, unless the the resource is cached locally, the part of the media resource before the clipBegin, will still be fetched from the origin server. This may result in a gap between the execution of the <media> element and playback actually beginning.
Note also if <media> uses the RTSP protocol, and the
VoiceXML platform supports this protocol, then the clipBegin
attribute value may be mapped to the RTSP Range
header
field, thereby reducing the gap between element execution and the
onset of playback.
When an media RC receives an evaluate event, the following operations are performed:
The resulting media resource is returned together with resolved media operation properties (clipBegin, clipEnd, soundLevel, speed, outputmodes).
Editorial note | |
Semantics needs to address a mixed content model; e.g. CDATA and XML elements as children of the root. Do we require 'application/ssml+xml' type with SSML and CDATA content? Need to clarify where resource fetching takes place in the semantic model. Eg. in prompt initializing or executing state? or in prompt queue? This approach assumes the prompt queue applies media processing operations. Intended to fit with the VCR/RTC approach. What about streaming cases? Allow streams to be returned? Specify how errors are addressed. |
Playback of external audio media resource.
<media type="audio/x-wav" src="http://www.example.com/resource.wav"/>
Application of media operations to audio resource. The soundLevel increases the volume by approximately 50% and the speed is reduced to 50%.
<media type="audio/x-wav" soundLevel="+6.0dB" speed="50%" src="http://www.example.com/resource.wav"/>
Playback of 3GPP media resource.
<media type="video/3gpp" src="http://www.example.com/resource.3gp"/>
Playback of 3GPP media resource with the speed doubled and playback ending after 5 seconds.
<media type="video/3gpp" clipEnd="5s" speed="200%" src="http://www.example.com/resource.3gp"/>
Playback of external SSML document.
<media type="application/ssml+xml" src="http://www.example.com/resource.ssml"/>
Inline CDATA content with a <value> element
<media> Ich bin ein Berliner, said <value expr="speaker"/> </media>
which is syntactically equivalent to
<media> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> Ich bin ein Berliner, said <value expr="speaker"/> </speak> </media>
Inline SSML content to which gain and clipping operations are applied.
<media soundLevel="+4.0dB" clipBegin="4s"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> Ich bin ein Berliner. </speak> </media>
Inline SSML with audio media fallback.
<media volume="+4.0dB" clipBegin="4s"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> Ich bin ein Berliner. </speak> <media type="audio/x-wav" src="ichbineinberliner.wav"> </media>
This module defines the syntax and semantics of <par> and <seq> elements. The <par> element specifies playback of media in parallel, while <seq> specifies playback in sequence.
The module is designed to extend the content model of the <prompt> element (6.4 Prompt Module).
This module is dependent upon the media module (6.6 Media Module).
With connections which support multiple media streams, it is possible to simultaneously playback multiple media types. For media container formats like 3GPP, audio and video media can be generated simultaneously from the same media resource.
There are established use cases for simultaneous playback of multiple media which are specified in separate resources:
The intention is provide support for basic use cases where audio or TTS output from one resource can be complemented with output from another resource as permitted by the connection and platform capabilities.
The <par> element is derived from SMIL <par> element, a time container for parallel output of media resources. Media elements (or containers) within a <par> element are played back in parallel.
Editorial note | |
SMIL reference should be added in B References. SMIL is Synchronized Multimedia Integration Language (SMIL). Reference to SMIL 1.0 (or later) Specification. |
The <par> element has the attributes specified in Table 35.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
endsync | Indicates when element is considered complete. 'first' indicates that the element is complete when any media (or container) child reports that it is complete; 'last' indicates it is complete when all media children are complete. | No | last |
The content model of <par> consists of:
The <par> element is derived from SMIL <seq> element, a time container for sequential output of media resources. Media elements within a <seq> element are played back in parallel.
No attributes are defined for <seq>.
The content model of <seq> consists of:
Editorial note | |
Issue: how should parallel playback interact with the PromptQueue resource? The simplest assumption would be that if this module is supported, then prompt queue needs to be able to handle parallel playback. For example when bargein event happens during the parallel execution, the synchronization between both prompt and for example video play should be handled. This information should be explained in the prompt queue resource section. |
This module requires a PromptQueue resource which support playback of parallel and sequential media. The following defines its playback completion, termination and error handling.
Completion of playback of the <par> element is determined according to the value of its endsync attribute. For instance, assume a <par> element containing <media> (or <seq>) elements A and B, and that B finishes before A. If endsync has the value first, then completion is reported upon B's completion. If endsync has the value last, then completion is reported upon A's completion.
Completion of playback of the <seq> element occurs when the last <media> is complete.
If the <par> element playback is terminated, then playback of its <media> and <seq> children is terminated. Likewise, if the <seq> element playback is terminated, then playback of its (active) <media> elements is terminated.
If mark information is provided by <media> elements (for example with SSML), then, the mark information associated with last element played in sequence or parallel is exposed as described in XXX.
Editorial note | |
Open issue: Clarify interaction with VCR media control model(s). <reposition> approach would require that <par> and <seq> need to be able to restart from a specific position indicated by the markname/time of a <media> element contained within them. RTC approach would require that for <par>, media operations are applied in parallel. |
Error handling policy is inherited from the element in which <par> and <seq> element are children.
For instance if the policy is to ignore errors, then the following applies:
If the policy is to terminate playback and report the error, then the any error causes immediate termination of any playback and the error is reported.
If execution of the <par> and <seq> elements requires media capabilities which are not supported by the platform or the connection, or there is an error fetching or playing any <media> element within <par> or <seq>, then error handling follows the defined policy.
video avatar with audio commentary. Note the use of the outputmodes attributes of <media> to ensure that only video is played.
<par> <media type="audio/x-wav" src="commentary.wav"/> <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/> </par>
video avatar with a sequence of audio and TTS commentary.
<par> <seq> <media type="audio/x-wav" src="intro.wav"/> <media type="application/ssml+xml" src="commentary.ssml"/> </seq> <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/> </par>
This module describes the syntactic and semantic features of the <foreach> element.
This module is designed to extend the content model of an element in another module. For example, SSML elements in the 6.5 Builtin SSML Module, the <prompt> element defined in 6.4 Prompt Module, etc.
The attributes and content model of the element are specified in 6.8.1 Syntax. Its semantics are specified in 6.8.2 Semantics.
[See XXX for schema definitions].
The <foreach> element has the attributes specified in Table 36.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
array | A data model expression that must evaluate to an array; otherwise, an error.semantic event is thrown. Note that the <foreach> element operates on a shallow copy of the array specified by the array attribute. | Yes | ||
item | A data model variable that stores each array item upon each iteration of the loop. A new variable will be declared if it is not already defined within the parent's scope. | Yes |
Both "array" and "item" must be specified; otherwise, an error.badfetch event is thrown.
The iteration process starts from an index of 0 and increments by one to an index of array_name.length - 1, where array_name is the name of the shallow copied array operated on by the <foreach> element. For each index, a shallow copy or reference to the corresponding array element is assigned to the item variable (i.e. <foreach> assignment is equivalent to item = array_name[index] in ECMAScript); the assigned value could be undefined for a sparse array. Undefined array items are ignored.
VoiceXML 3.0 does not provide break functionality to interrupt a <foreach>.
Editorial note | |
Clarify that array items which evaluate to ECMAScript undefined are ignored? |
When the RC receives an evaluate event, the RC loops through the array to produce an evaluated content for each item in the array.
Editorial note | |
These examples may be moved to the respective profile section later. |
The vxml21 profile defines the content model for the <foreach> element so that it may appear in executable content and within <prompt> elements.
Within executable content, except within a <prompt>, the <foreach> element may contain any elements of executable content; this introduces basic looping functionality by which executable content may be repeated for each element of an array.
When <foreach> appears within a <prompt> element as part Builtin SSML content, it may contain only those elements valid within <enumerate> (i.e. the same elements allowed within <prompt> less <meta>, <metadata>, and <lexicon>); this allows for sophisticated concatenation of prompts.
In this example using Builtin SSML, each item in the array has an audio property with a URI value, and a tts property with SSML content. The element loops through the array, playing the audio URI or the SSML content as fallback, with a 300 millisecond break between each iteration.
<prompt> <foreach item="item" array="array"> <audio expr="item.audio"><value expr="item.tts"/></audio> <break time="300ms"/> </foreach> </prompt>
In the mediaserver profile, <foreach> may occurs within <prompt> elements and has the content model of 0 or more <media> elements.
Play each media resource in the array.
<foreach item="item" array="array"> <media type="audio/x-wav" src="item.audio"/> </foreach>
Play each media resource in the array.
<foreach item="item" array="array"> <media type="audio/x-wav" src="item.wav"> <media type="application/ssml+xml"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> <value expr="item.tts"/> <break time=300ms"/> </speak> </media> </media> </foreach>
Forms are the key component of VoiceXML documents. A form contains:
id | The name of the form. If specified, the form can be referenced within the document or from another document. For instance <form id="weather">, <goto next="#weather">. |
---|---|
scope | The default scope of the form's grammars. If it is dialog then the form grammars are active only in the form. If the scope is document, then the form grammars are active during any dialog in the same document. If the scope is document and the document is an application root document, then the form grammars are active during any dialog in any document of this application. Note that the scope of individual form grammars takes precedence over the default scope; for example, in non-root documents a form with the default scope "dialog", and a form grammar with the scope "document", then that grammar is active |
The Form RC is the primary RC for the <form> element.
The Form RC interacts with resource controllers of other modules so as to provide the behavior of VoiceXML 2.1/2.0 <form> tag. Input and control form items are modeled as resource controllers: for the example, the <field> RC (6.10.2.1 Field RC) of the Field Module.
The behavior of the Form RC follows the VoiceXML FIA, although some aspects of this are not modeled directly in this RC: external transition handling is not part of the form RC; input items used separate RCs to manage coordination between media resources, while recognition results can be received directly by form, field or other RCs.
[This initial version does not address all aspects of FIA behavior; for example, event handling, error handling and external transitions are not covered.]
The form RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The form RC's state model consists of the following states: Idle, Initializing, Ready, SelectingItem, PreparingItem, PreparingFormGrammars, PreparingOtherGrammars, Executing, Active, ProcessingFormResult, Evaluating and Exit.
In the Idle state, the form RC can receive an 'initialize' event whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.
In the Initializing state, the RC creates a dialog scope in the Datamodel Resource and then initializes its children: this is modeled as a separate RC. When all children are initialized, the RC sends an 'initialized' event to its controller and transitions to the Ready state.
In the Ready state, the form RC sets its active status to false. It can receive one of two events: 'prepareGrammars' or ‘execute’. ‘prepareGrammars’ indicates that another form is active, but this form's form-level grammars may be activated; an 'execute' event indicates that this form is active. If the RC receives a 'prepareGrammars' event, it transitions to the PreparingFormGrammars state. If the RC receives an 'execute' event, it sets its active data to true and transitions to the 'SelectingItem' state.
In the SelectingItem state, the RC determines which form item to select as the active item. This is defined by a FormItemSelection RC which iterates over the children sending each a 'checkStatus' event. If a child returns a true status (indicating that it ready for execution)), the activeItem is set to this child RC and the RC transitions to the PreparingItem state. If no child returns this status, then the RC is complete and transitions the Exit State.
In the PreparingItem state, the activeItem is sent a 'prepare' event causing it to prepare itself; for example, the field RC prepares its prompts and grammars for execution. When the activeItem returns a 'prepared' event, the event data indicates whether the item is modal or not. If the item is modal, then the form RC transitions to the Executing state. If the item is not modal (other grammars can be activated), then the form RC transitions to the PreparingFormGrammars state.
In the PreparingFormGrammars state, the RC prepares form-level grammars. This is defined by a separate RC which iterates through and executes grammar children. When this is complete, the RC transitions to the Active state if the form is not active (active data), and transitions to the PreparingOtherGrammars if the form is active.
In the PreparingOtherGrammars states, the RC sends a 'prepareGrammars' event to its controller RC (which in turn sends the event to appropriate form, document and application level RCs with grammars). When its receives a 'prepared' from its controller, the RC transitions to the Executing state.
In the Executing state, the form RC sends an 'execute' event to the active form item. If the form item is a field, then this will causes prompts to be played and recognition to take place. The RC then transitions to the Active state awaiting a result.
In the Active state, the RC re-initializes the justFilled data to a new array and waits for a recognition results (as active or non-active form), or for a signal from its selected form item that it has received the recognition result. Recognition results are divided into two types: form item level results, received and processed by the form item; and form level results which are received by the form RC which caused the grammar to be added. If a 'recoResult' event is received by the form RC, the RC transitions into the ProcessingFormResult state. If the active form item receives the recognition result (and locally updated itself), then the form RC receives a 'formItemResult' event, adds the active item to the justFilled array, and transitions into the Evaluating state.
In the ProcessingFormResult state, the recognition result is processed by iterating through the form item children, obtaining their name and slotname, and then attempting to match the slotname to the results. If the match is successful, the name variable in the data model result is updated with the value from the recognition result and the child is added to the justFilled data array. When this process is complete, the form RC transitions to the Evaluating state.
In the Evaluating state, the form RC then iterates through its children and if a child is a member of the 'JustFilled' array, it sends a 'evaluate' event to the form item RC causing the appropriate filled RCs to be executed. If the child is a filled RC, then it is executed if appropriate. When evaluation is complete, the form RC transitions to the 'selectformitem' state so that the next form item can be selected for execution.
Event | Source | Payload | Description |
initialize | any | controller(M) | Update the data model |
prepareGrammars | controller | Another form is active, but the current form's form-level grammars may be activated. | |
execute | controller | Current form is active |
Event | Source | Payload | Description |
initialize | controller | Notification that initialization is complete | |
prepareGrammars | controller | Sent to prepare grammars to appropriate form, document and application level RCs. | |
execute | controller | Notification of complete recognition result from the field RC. |
The following table shows the events sent and received by the form RC to resources and other RCs which define the events.
Event | Source | Target | Description |
checkStatus | FormRC | FormItem RC | Check if ready for execution |
createScope | FormRC | DataModel | Creates a scope. |
destroyScope | FormRC | DataModel | Delete a scope. |
evaluate | FormRC | FormItem RC | Process form item being filled. |
execute | FormRC | FormItem RCs | Start execution. |
prepare | FormRC | FormItem RC | Initiates preparation needed before execution. |
formItemResult | FormItemRC | FormRC | Results received by the form item. |
prepared | FormItemRC | FormRC | Indicates that preparation is complete. |
recoResult | PlayAndRecognize RC | FormRC | Results filled at the form level and not form item level. |
Editorial note | |
Note that the chart for SelectingItem:FormItemSelection is missing. It will be defined later. |
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data id="controller"/> <data id="children"/> <data id="activeItem"/> <data id="active"/> <data id="previousItem"/> <data id="nextItem"/> <data id="recoResult"/> <data id="name"/> <data id="JustFilled"/> </datamodel> <state id="Created"> <initial id="Idle"/> <state id="Idle"> <onentry> <assign loc="$controller" val="null"/> <assign loc="$children" val="null"/> <assign loc="$activeItem" val="null"/> <assign loc="$active" val="false"/> <assign loc="$previousItem" val="null"/> <assign loc="$nextItem" val="null"/> <assign loc="$recoResult" val="null"/> <assign loc="$name" val="null"/> </onentry> <transition event="initialize" target="Initializing"> <assign name="$controller" expr="_eventData/controller"/> </transition> </state> <!-- end Idle --> <state id="Initializing"> <datamodel> <data id="childcounter"/> </datamodel> <onentry> <assign loc="$childcounter" val="0"/> <send target="datamodel" event="createScope" namelist="dialog"/> <foreach var="child" array="$children"> <send target="$child/controller" event="initialize" namelist="$child/child"/> </foreach> </onentry> <transition event="Initializing.done"> <assign loc="$childcounter" expr="$childcounter + 1"/> </transition> <transition event="Initializing.error"> <assign loc="$childcounter" expr="$childcounter + 1"/> <send target="controller" event="initialize.error" namelist="_eventData/error_status"/> </transition> <transition event="Initializing.done" cond="$childcounter eq $children.size()-1" target="Ready"> <send target="controller" event="initialized"/> </transition> </state> <!-- end Initializing --> <state id="Ready"> <onentry> <assign loc="$active" val="false"/> </onentry> <transition event="execute" target="SelectingItem:FormItemSelection"> <assign loc="$active" value="true"/> </transition> <transition event="prepareGrammars" target="PreparingFormGrammars"/> </state> <!-- end Ready --> <state id="SelectingItem:FormItemSelection"> <onentry> <send target="FormItemSelection" event="checkStatus" namelist="$children"/> </onentry> <transition event="SelectedFormItem.done" cond="activeItem eq 'null'" target="Exit"/> <transition event="SelectedFormItem.done" cond="activeItem neq 'null'" target="PreparingItem"/> </state> <!-- end SelectingItem:FormItemSelection --> <state id="PreparingItem"> <onentry> <send target="activeitem" event="prepare" /> </onentry> <transition event="prepared" cond="_eventData/modal eq 'true'" target="Executing"/> <transition event="prepared" cond="_eventData/modal eq 'false'" target="PreparingFormGrammars"/> </state> <!-- end PreparingItem--> <state id="Exit"> <onentry> <send target="datamodel" event="destroyScope" namelist="dialog"/> <send target="parent" event="done"/> </onentry> </state> <!-- end Exit--> <state id="PreparingFormGrammars"> <transition event="PrepareFormGrammars.done" cond="active eq 'true'" target="PreparingOtherGrammars"/> <transition event="PrepareFormGrammars.done" cond="active eq 'false'" target="PreparingOtherGrammars"> <send target="controller" event="prepared"/> </transition> </state> <!-- end PreparingFormGrammars --> <state id="PreparingOtherGrammars"> <onentry> <send target="controller" event="prepareGrammars"/> </onentry> <transition event="PrepareOtherGrammars.done" target="Executing"/> </state> <!-- end PreparingOtherGrammars --> <state id="Executing"> <onentry> <send target="activeItem" event="execute"/> </onentry> <transition event="Executing.done" target="Active"/> </state> <!-- end Executing --> <state id="Active"> <onentry> <assign loc="$JustFilled" expr="new Array()"/> </onentry> <transition event="fieldResult" target="Evaluating"> <insert pos="after" name="$JustFilled" val="currentitem"/> </transition> <transition event="PlayAndRecognize:RecogResult" target="ProcessingFormResult"> <insert pos="after" name="$JustFilled" val="currentitem"/> </transition> </state> <!-- end Active --> <state id="ProcessingFormResult"> <onentry> <foreach var="child" array="$children"> <if cond="$child.slotname eq _eventData/RecogResult/slotname"> <assign loc="$name" value="_eventData/RecogResult/name"/> <insert pos="after" name="$JustFilled" val="$child"/> <transition target="Evaluating"/> </if> </foreach> </onentry> <transition event="ProcessingFormResult.done" target="Evaluating"/> </state> <!-- end ProcessingFormResult --> <state id="Evaluating"> <onentry> <send target="activeItem" event="evaluate"/> </onentry> <transition event="Evaluating.done" target="SelectingItem:FormItemSelection"/> </state> <!-- end Executing --> </state> <!-- end Created --> </scxml>
name | The form item variable in the dialog scope that will hold the result. The name must be unique among form items in the form. If the name is not unique, then a badfetch error is thrown when the document is fetched. The name must conform to the variable naming conventions in (TODO). |
---|---|
expr | The initial value of the form item variable; default is ECMAScript undefined. If initialized to a value, then the form item will not be visited unless the form item variable is cleared. |
cond | An expression that must evaluate to true after conversion to boolean in order for the form item to be visited. The form item can also be visited if the attribute is not specified. |
type | The type of field, i.e., the name of a builtin grammar type (6.11 Builtin Grammar Module). Note that platform support for builtin grammar types is optional. If the specified builtin type is not supported by the platform, an error.unsupported.builtin event is thrown. |
slot | The name of the grammar slot used to populate the variable (if it is absent, it defaults to the variable name). This attribute is useful in the case where the grammar format being used has a mechanism for returning sets of slot/value pairs and the slot names differ from the form item variable names. |
modal | If this is false (the default) all active grammars are turned on while collecting this field. If this is true, then only the field's grammars are enabled: all others are temporarily disabled. |
The semantics of field elements are defined using the following resource controllers: Field (6.10.2.1 Field RC), PlayandRecognize (6.10.2.2 PlayandRecognize RC), ...
The Field Resource Controller is the primary RC for the field element.
The field RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The field RC's state model consists of the following states: Idle, Initializing, Ready, Preparing, Prepared, Executing and Evaluating.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.
In the Initializing state, the RC creates a variable in the Datamodel Resource: the variable name corresponds to the name in the RC's data model, and the variable value is set to the value of the RC's data model expr, if this is defined. The field RC then initializes its children: this is modeled as a separate RC (see XXX). When all children are initialized, the RC transitions to the Ready state.
In the Ready state, the field RC can receive an 'checkStatus' event to check whether it can be executed or not. The value of name and cond in its data model are checked: the status is true if the name is undefined and the value of cond evaluates to true. The status is returned in a 'checkedStatus' event sent back to the controller RC. If the RC receives a 'prepare' event, it updates includePrompts in its data model using the event data, and transitions to the Preparing state.
In the Preparing state, the field prepares its prompts and grammars. Prompts are prepared only if the includePrompts data is true; otherwise, prompts within the field are not prepared (e.g. field prompts aren't queued following a <reprompt>). Preparation of prompts is modeled as a separate RC (see XXX), as is preparation of grammars (see YYY). These RCs are summarized below.
Prompts are prepared by iterating through the children array. In the iteration, each prompt RC child is sent a 'checkStatus' event. If the prompt child returns true (its cond parameter evaluates to true), then it is added to a 'correct count' list together with its count. Once the iteration is complete, the RC determines the highest count on the 'correct count' list: the highest count among those on the list less than or equal to the current count value. All child on the 'correct count' list whose count is not the highest count are removed. The RC then iterates through the 'correct count' list and sends an 'execute' event to each prompt RC, causing it to be queued on the PromptQueue Resource.
Grammars are prepared by recursing through the children array and sending each grammar RC child an 'execute' event. The grammar RC then, if appropriate, sends an 'addGrammar' event to the DTMF or ASR Recognizer Resource where the grammar itself, its properties and the field RC is sent as the handler for recognition results.
When prompts and grammars have been prepared, the prompt counter is incremented and the field RC sends a 'prepared' event to its controller with event data indicating its modal status and then transition into the Prepared state.
In the Prepared state, the field RC may receive an 'execute' event from its controller. The RC sends an 'execute' event to the PlayAndRecognize RC (6.10.2.2 PlayandRecognize RC), causing any queued prompts to be played and recognition to be initiated. In the event data, the controller is set to this RC, and other data is derived from data model properties. The RC transitions to the Executing state.
In the Executing state, the PlayAndRecognize RC must send recoResults (or error events: noinput, nomatch, error.semantic) to the field RC.
If the field RC receives the recoResults, then it updates its name variable in the Datamodel Resource. The field RC then sends a 'fieldResult' event to its controller indicating that a field result has been received and processed.
If the recoResult is received by the field RC's controller, then the field receives an 'evaluate' event which causes it to transition to the Evaluating state.
In the Evaluating state, the field RC iterates through its children executing each filled RC: this is modeled by a separate RC (see XXX). When evaluation is complete, the RC sends a 'evaluated' event to its controller and transitions to the Ready state.
The Field RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | |
checkStatus | controller | ||
prepare | controller | includePrompts (M) | |
execute | controller | ||
evaluate | controller |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | ||
checkedStatus | controller | ||
prepared | controller | ||
fieldResult | controller | ||
evaluated | controller |
Table 44 shows the events sent and received by the field RC to resources and other RCs which define the events.
Event | Source | Target | Description |
create | FieldRC | DataModel | |
assign | FieldRC | DataModel | |
execute | FieldRC | PlayandRecognizeRC | |
recoResult | PlayandRecognizeRC | FieldRC |
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data id="controller"/> <data id="children"/> <data id="counter"/> <data id="recoResult"/> <data id="cond"/> <data id="name"/> <data id="expr"/> <data id="modal"/> <data id="includePrompts"/> </datamodel> <state id="Created"> <initial id="Idle"/> <state id="Idle"> <onentry> <assign location="$controller" val="null"/> <assign location="$children" expr="new Array()"/> <assign location="$counter" val="1"/> <assign location="$recoResult" val="null"/> <assign location="$cond" val="null"/> <assign location="$expr" val="null"/> <assign location="$modal" val="false"/> <assign location="$includePrompts" val="true"/> </onentry> <transition event="initialize" target="Initializing"> <assign name="$controller" expr="_eventData/controller"/> </transition> </state> <!-- end Idle --> <state id="Initializing"> <datamodel> <data id="childcounter"/> </datamodel> <onentry> <if cond="expr neq 'null'"> <send target="datamodel" event="assign" namelist="$name, $expr"/> <else> <send target="datamodel" event="create" namelist="$name"/> </else> </if> <assign location="$childcounter" val="0"/> <foreach var="child" array="$children"> <send target="$child/controller" event="initialize"/> </foreach> </onentry> <transition event="Initializing.done"> <assign location="$childcounter" expr="$childcounter + 1"/> </transition> <transition event="Initializing.error"> <assign location="$childcounter" expr="$childcounter + 1"/> <send target="controller" event="initialize.error" namelist="_eventData/error_status"/> </transition> <transition event="Initializing.done" cond="$childcounter eq $children.size()-1" target="Ready"> <send target="controller" event="initialized"/> </transition> </state> <!-- end Initializing --> <state id="Ready"> <transition event="checkStatus" > <assign location="$status" expr="checkStatus()"/> <send target="controller" event="checkedStatus" namelist="_eventData/status"/> </transition> <transition event="prepare" target="Preparing"> <assign location="$includePrompts" expr="_eventData/includePrompts"/> </transition> </state> <!-- end Ready --> <state id="Preparing"> <onentry> <if cond="$includePrompts eq 'true'"> <send target="Prompts RC" event="initialize"/> </if> <send target="Grammars RC" event="initialize"/> </onentry> <transition event="preparing.done" target="Prepared"> <send target="controller" event="prepared" namelist="modal"/> </transition> </state> <!-- end Preparing --> <state id="Prepared"> <transition event="execute" target="Executing"> <send target="PlayAndRecognize" event="execute" namlist="self, inputmodes"/> </transition> </state> <!-- end Prepared--> <state id="Executing"> <datamodel> <data id="value"/> </datamodel> <transition event="playAndReco:recoResult"> <assign location="$value" expr="processResults($name, slot, _eventdata/result)"/> <send target="datamodel" event="assign" namelist="$name, $value"/> <send target="parent" event="fieldResult"/> </transition> </state> <!-- end Executing--> <state id="Evaluating"> <onentry> <send target="filled RC" event="executeFilleds"/> </onentry> <transition event="evaluating.done" target="Ready"> <send target="controller" event="evaluated"/> </transition> </state> <!-- end Evaluating--> </state> <!-- end Created --> </scxml>
The PlayandRecognize RC coordinates media input with Recognizer resources and media output with the PromptQueue Resource.
The following use cases are covered:
Editorial note | |
Open issue: should we remove the possibility for alternating speech and hotword bargein modes within the recognition cycle? |
The PlayandRecognize RC coordinates media input with recognition resources and media output with the PromptQueue Resource on behalf of a form item.
This RC activates prompt queue playback, activates recognition resources, manages bargein behavior and handles results from recognition resources.
The RC is defined in terms of a data model and a state model.
The data model is composed of the following parameters:
The RC model consists of the following states: idle, prepare recognition resources, start playing, playing prompts with bargein, playing prompts without bargein, recognizing with a timer, waiting for input, waiting for speech result and update results. The complexity of this model is partially a consequence of supporting the relationship between hotword bargein and recognition result processing.
While in the idle state, the RC may receive an 'execute' event, whose event data is used to update the data model. The event information includes: controller, inputmodes, inputtimeout, dtmfProps, asrProps and maxnbest. The RC transition to the prepare recognition resources state.
In the prepare recognition resources, the RC sends 'prepare' events to the ASR and DTMF recognition resource. Both events specify this RC as the controller parameter, while the properties parameter differs. In this state, the RC can received 'prepared' or 'notPrepared' events from either recognition resources. If neither resource returns a 'prepared' event, then activeGrammars is false (i.e. no active DTMF or speech grammar) and the RC sends an 'error.semantic' event to the controller and exits. If at least one resource returns a 'prepared' event, then the RC moves into the start playing state.
The start playing state begins by sending the PromptQueue resource a 'play' event. The PromptQueue responds with a 'playDone' event if there are no prompt in the prompt queue; as a result, this RC moves into the start recognizing with timer state. If there is at least one prompts in the queue, the PromptQueue sends this RC a 'playStarted' event whose data contains the bargein and bargeintype values for the first prompt, and the input timeout value for the last prompt in the queue. The data model is updated with this information.
Editorial note | |
Open issue: PromptQueue Resource doesn't currently have playStarted event. If we don't add playStarted event, then is there a better way to get the bargein, bargeintype, and timeout information from the prompts in the PromptQueue? |
Interaction with the recognizer during prompt playback is determined by the data model's bargein value. If bargein is true, then this RC transitions to the playing with bargein state. If bargein is false, the RC transitions to the playing without bargein state.
Editorial note | |
Open Issue: The event "bargeinChange" as a one way notification could pose a problem, as it takes finite time for recognizer to suspend or resume. This might work better if PromptQueue Resource waited for an event "bargeinChangeAck" (or similar) from PlayandRecognize RC before starting the next play. PlayandRecognize RC will send the event "bargeinChangeAck" after it completed suspend or resume action on the recognizer. |
In the playing without bargein state, recognition is suspended if it has been previously activated (recoActive parameter of the data model tracks this). Suspending recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is suspended; if 'voice' is in inputmodes, the ASR recognition is suspended. In this state, the PromptQueue can report to this RC changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event with the values 'hotword' or 'speech' cause the data model parameter 'bargein' to the set to 'true' and the 'bargeintype' parameter to be updated with event data value. If the PromptQueue resource sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.
In the playing with bargein state, recognition is activated if it has not been previously activated (determined by recoActive parameter in the data model). Activating recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, then ASR recognition is activated. In this state, the PromptQueue can report changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event where the event data value is not 'unbargeable' causes the data model 'bargeintype' parameter to be updated with the event data ('hotword' or 'speech'); while a 'bargeintypeChange' where the event data value is 'unbargeable' causes the data model 'bargein' parameter to set to false and the RC transitions to the playing without bargein state. If the PromptQueue resources sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.
Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative.
Editorial note | |
Further details on recognition processing to be added in later versions. recoResults data parameter is updated with the recognition results (truncated to maxnbest). A speech result is positive iff there is at least one result whose confidence level is equal to or greater than the recognition confidence level; otherwise the result is negative. DTMF results are always positive. The recoListener data parameter is defined as the listener associated with the best result if the result is positive. |
If positive, the RC sends the PromptQueue a 'halt' event, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.
In the start recognizing with timer state, an input timer is activated for the value of the inputtimeout data parameter and, if the recognition is not already active (determined by the recoActive data parameter). Recognition activation is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, the ASR recognition is activated. The RC then transitions into the waiting for input state.
In the waiting for input state, the RC waits for user input. If it receives a 'timerExpired' event, then the RC sends a 'stop' event to all recognition resources, sends a 'noinput' event to its controller and exits. Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative. If positive, the RC cancels the timer, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.
In the waiting for speech result state, the RC waits for a 'recoResult' event whose data is used to update the recoResult data parameter and to set the recoListener data parameter if the recognition result is positive. The RC then transitions to the update results state.
In the update results state, the RC sends 'assign' events to the data model resource, so that the lastresult object in application scope is updated with recognition results as well as markname and marktime information. If the recoListener data parameter is defined, then the RC sends a 'recoResult' event to the recognition listener RC; otherwise, it sends 'nomatch' event to its controller. The RC then exits.
Editorial note | |
Open issue: Behavior if one reco resource sends 'inputStarted' but other sends 'recoResults'? Race conditions between recognizers returning results? (This problem is inherent to the presence of two recognizers. For the sake of clear semantics, we could restrict only one recognizer to respond with 'inputStarted' and 'recoResults'. The other recognizer is always 'stopped'. But a better choice might be to have only one recognizer that handles both DTMF and speech, since semantically both recognizers are very similar.) |
The PlayandRecognize RC is defined to receive the following events:
Event | Source | Payload | Sequencing | Description |
execute | any | controller(M), inputmodes (O), inputtimeout (O), dtmfProps (M), recoProps (M), maxnbest (O) |
and the events it sends:
Event | Target | Payload | Sequencing | Description |
recoResult | any | results (M) | one-of: nomatch, noinput, error.*, recoResult | |
nomatch | controller | one-of: nomatch, noinput, error.*, recoResult | ||
noinput | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.semantic | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.badfetch.grammar | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.noresource | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.unsupported.builtin | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.unsupported.format | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.unsupported.language | controller | one-of: nomatch, noinput, error.*, recoResult |
The events in Table 47 are sent by the PlayandRecognize RC to resources which define the events.
Event | Target | Payload | Sequencing | Description |
play | PromptQueue | |||
halt | PromptQueue | |||
prepare | Recognizer | |||
listen | Recognizer | |||
suspend | Recognizer | |||
stop | Recognizer |
The events in Table 48 are received by this RC. Their definition is provided by the sending component.
Event | Source | Payload | Sequencing | Description |
playStarted | PromptQueue | bargein (O), bargeintype (O), inputtimeout (O) | pq:play notification | |
playDone | PromptQueue | markname (O), marktime (O) | pq:play response | |
bargeinChange | PromptQueue | bargein (M) | ||
bargeintypeChange | PromptQueue | bargeintype (M) | ||
prepared | Recognizer | prepare positive response | ||
notPrepared | Recognizer | prepare negative response | ||
inputStarted | Recognizer | |||
recoResult | Recognizer | results (M), listener (O) |
The main states for the PlayandRecognize RC are shown in Figure 13.
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data id="controller"/> <data id="bargein"/> <data id="bargeintype"/> <data id="inputmodes"/> <data id="inputtimeout"/> <data id="dtmfProps"/> <data id="asrProps"/> <data id="maxnbest"/> <data id="recoActive"/> <data id="markname"/> <data id="marktime"/> <data id="recoResult"/> <data id="recoListener"/> <data id="activeGrammars"/> </datamodel> <state id="Created"> <initial id="Idle"/> <state id="Idle"> <onentry> <assign location="$controller" val="null"/> <assign location="$bargein" expr="true"/> <assign location="$bargeintype" val="speech"/> <assign location="$inputmodes" val="voice"/> <assign location="$inputtimeout" val="0s"/> <assign location="$dtmfProps" val="null"/> <assign location="$asrProps" val="null"/> <assign location="$maxbest" val="1"/> <assign location="$recoActive" val="false"/> <assign location="$markname" val="null"/> <assign location="$marktime" val="0"/> <assign location="$recoResult" val="null"/> <assign location="$recoListener" val="null"/> <assign location="$activeGrammars" val="false"/> </onentry> <transition event="execute" target="PrepareRecognitionResources"> <assign name="/datamodel/data/[@name='controller']" expr="_eventData/controller"/> <assign name="/datamodel/data/[@name='inputmodes']" expr="_eventData/modes"/> <assign name="/datamodel/data/[@name='inputtimeout']" expr="_eventData/timeout"/> <assign name="/datamodel/data/[@name='dtmfProps']" expr="_eventData/dtmfProps"/> <assign name="/datamodel/data/[@name='asrProps']" expr="_eventData/asrProps"/> <assign name="/datamodel/data/[@name='maxnbest']" expr="_eventData/maxnbest"/> </transition> </state> <!-- end Idle --> <state id="PrepareRecognitionResources"> <transition target="StartPlaying" cond="$activeGrammars eq 'true'"/> <transition target="Exit" cond="$activeGrammars eq 'false'"> <send target="controller" event="error.semantic"/> </transition> </state> <!-- end PrepareRecognitionResources --> <state id="StartPlaying"> <onentry> <send target="PromptQueue" event="pq:play"/> </onentry> <transition event="pq:playStarted" cond="$bargein eq 'true'" target="PlayingWithBargein"> <assign location="$bargein" expr="_eventdata/bargein"/> </transition> <transition event="pq:playStarted" cond="$bargein eq 'false'" target="PlayingWithoutBargein"> <assign location="$bargein" expr="_eventdata/bargein"/> </transition> <transition event="pq:playDone" target="StartRecognizingWithTimer"/> </state> <!-- end StartPlaying --> <state id="PlayingWithoutBargein"> <onentry> <if cond="$recoActive eq 'true'"> <if cond="in('dtmf',$inputmodes) "> <send target="DTMFRecognizer" event="rec:suspend"/> </if> <if cond="in('voice',$inputmodes) "> <send target="DTMFRecognizer" event="rec:suspend"/> </if> </if> </onentry> <transition event="bargeintypeChange" cond="_eventdata/value neq 'unbargeable'" target="PlayWithBargein"> <assign location="$bargein" val="true"/> <assign location="$bargeintype" expr="_eventdata/value"/> </transition> <transition event="pq:playDone" target="StartRecognizingWithTimer"> <assign location="$markname" expr="_eventdata/markname"/> <assign location="$marktime" expr="_eventdata/marktime"/> </transition> </state> <!-- end PlayingWithoutBargein --> <state id="PlayingWithBargein"> <datamodel> <data id="negorpos"/> </datamodel> <onentry> <if cond="in('dtmf',$inputmodes) "> <send target="DTMFRecognizer" event="rec:listen"/> </if> <if cond="in('voice',$inputmodes)"> <send target="DTMFRecognizer" event="rec:listen"/> </if> <assign location="$recoActive" val="true"/> </onentry> <transition event="bargeintypeChange" cond="_eventdata/value neq 'unbargeable'"> <assign location="$bargeintype" expr="_eventdata/value"/> </transition> <transition event="rec:recoResult"> <assign location="$negorpos" expr="processRecoResult()"/> <send target="parent" event="negorpos"/> </transition> <transition event="negativeRecoResult"> <send target="rec_source" event="listen"/> </transition> <transition event="pq:playDone" target="StartRecognizingWithTimer"> <assign location="$markname" expr="_eventdata/markname"/> <assign location="$marktime" expr="_eventdata/marktime"/> </transition> <transition event="positiveRecoResult"> <send target="PromptQueue" event="pq:halt"/> </transition> <transition event="rec:inputStarted" cond="$bargeintype eq 'speech'"> <send target="PromptQueue" event="pq:halt"/> </transition> </state> <!-- end PlayingWithBargein --> <state id="StartRecognizingWithTimer"> <onentry> <send target="Timer" event="start" namelist="$inputtimeout"/> <if cond="$recoActive eq 'false'"> <if cond="in('dtmf',$inputmodes) "> <send target="DTMFRecognizer" event="rec:listen"/> </if> <if cond="in('voice',$inputmodes)"> <send target="DTMFRecognizer" event="rec:listen"/> </if> <assign location="$recoActive" val="true"/> </if> </onentry> <transition event="execute" target="Executing"> <send target="PlayAndRecognize" event="execute" namlist="self, inputmodes"/> </transition> </state> <!-- end StartRecognizingWithTimer--> <state id="WaitingForInput"> <datamodel> <data id="negorpos"/> </datamodel> <transition event="rec:recoResult"> <assign location="$negorpos" expr="processResults()"/> <send target="parent" event="negorpos"/> </transition> <transition event="negativeRecoResult"> <send target="rec_source" event="listen"/> </transition> <transition event="timerExpired"> <send target="Recognizer" event="rec:stop"/> <send target="controller" event="noinput"/> </transition> <transition event="rec:inputStarted" cond="$bargeintype eq 'speech'" target="WaitingForSpeechResult"> <send target="Timer" event="cancel"/> </transition> </state> <!-- end WaitingForInput--> <state id="WaitingForSpeechResult"> <datamodel> <data id="negorpos"/> </datamodel> <!--TBD: the original diagram seems put the event at the wrong place--> <transition event="rec:recoResult" target="UpdateResults"> <assign location="$negorpos" expr="processResults()"/> <send target="parent" event="negorpos"/> </transition> </state> <!-- end WaitingForSpeechResult--> <state id="UpdateResults"> <onentry> <send target="datamodel" namelist="application, lastresult$, recoResults"/> <if cond="$negorpos neq 'null'"> <send target="recoListener" event="recoResult" namelist="recoResults"/> <else/> <send target="controller" event="nomatch"/> </if> </onentry> <transition target="Exit"/> </state> <!-- end UpdateResults--> </state> <!-- end Created --> </scxml>
VoiceXML developers are commonly required to sketch out an application for the purpose of a demo or other proof of concept. In such cases, it is convenient to use placeholder grammars for frequent dialogs like collecting a date, asking a yes/no question, etc. Builtin grammars (provided by the platform) are designed to serve this purpose.
Once the prototyping phase is complete, however, it is good practice to replace the builtin grammar references with developer written grammars. There are several reasons behind this suggestion:
Builtin grammars may be specified in one of two ways:
Each builtin type has a convention for the format of the value returned. These are independent of language and of the implementation. The return type for builtin fields is a string except for the boolean field type. To access the actual recognition result, the author can reference the <field> shadow variable "name$.utterance". Alternatively, the developer can access application.lastresult$, where application.lastresult$.interpretation has the same string value as application.lastresult$.utterance.
Type | Description |
boolean | Inputs include affirmative and negative phrases appropriate to the current language. DTMF 1 is affirmative and 2 is negative. The result is ECMAScript true for affirmative or false for negative. The value will be submitted as the string "true" or the string "false". If the field value is subsequently used in <say-as> with the interpret-as value "vxml:boolean", it will be spoken as an affirmative or negative phrase appropriate to the current language. |
date | Valid spoken inputs include phrases that specify a date, including a month day and year. DTMF inputs are: four digits for the year, followed by two digits for the month, and two digits for the day. The result is a fixed-length date string with format yyyymmdd, e.g. "20000704". If the year is not specified, yyyy is returned as "????"; if the month is not specified mm is returned as "??"; and if the day is not specified dd is returned as "??". If the value is subsequently used in <say-as> with the interpret-as value "vxml:date", it will be spoken as date phrase appropriate to the current language. |
digits | Valid spoken or DTMF inputs include one or more digits, 0 through 9. The result is a string of digits. If the result is subsequently used in <say-as> with the interpret-as value "vxml:digits", it will be spoken as a sequence of digits appropriate to the current language. A user can say for example "two one two seven", but not "twenty one hundred and twenty-seven". A platform may support constructs such as "two double-five eight". |
currency | Valid spoken inputs include phrases that specify a currency amount. For DTMF input, the "*" key will act as the decimal point. The result is a string with the format UUUmm.nn, where UUU is the three character currency indicator according to ISO standard 4217 [ISO4217], or mm.nn if the currency is not spoken by the user or if the currency cannot be reliably determined (e.g. "dollar" and "peso" are ambiguous). If the field is subsequently used in <say-as> with the interpret-as value "vxml:currency", it will be spoken as a currency amount appropriate to the current language. |
number | Valid spoken inputs include phrases that specify numbers, such as "one hundred twenty-three", or "five point three". Valid DTMF input includes positive numbers entered using digits and "*" to represent a decimal point. The result is a string of digits from 0 to 9 and may optionally include a decimal point (".") and/or a plus or minus sign. ECMAScript automatically converts result strings to numerical values when used in numerical expressions. The result must not use a leading zero (which would cause ECMAScript to interpret as an octal number). If the field is subsequently used in <say-as> with the interpret-as value "vxml:number", it will be spoken as a number appropriate to the current language. |
phone | Valid spoken inputs include phrases that specify a phone number. DTMF asterisk "*" represents "x". The result is a string containing a telephone number consisting of a string of digits and optionally containing the character "x" to indicate a phone number with an extension. For North America, a result could be "8005551234x789". If the field is subsequently used in <say-as> with the interpret-as value "vxml:phone", it will be spoken as a phone number appropriate to the current language. |
time | Valid spoken inputs include phrases that specify a time, including hours and minutes. The result is a five character string in the format hhmmx, where x is one of "a" for AM, "p" for PM, "h" to indicate a time specified using 24 hour clock, or "?" to indicate an ambiguous time. Input can be via DTMF. Because there is no DTMF convention for specifying AM/PM, in the case of DTMF input, the result will always end with "h" or "?". If the field is subsequently used in <say-as> with the interpret-as value "vxml:time", it will be spoken as a time appropriate to the current language. |
Both the "boolean" and "digits" types can be parameterized as follows:
digits?minlength=n | A string of at least n digits. Applicable to speech and DTMF grammars. If minlength conflicts with either the length or maxlength attributes then a error.badfetch event is thrown. |
digits?maxlength=n | A string of at most n digits. Applicable to speech and DTMF grammars. If maxlength conflicts with either the length or minlength attributes then a error.badfetch event is thrown. |
digits?length=n | A string of exactly n digits. Applicable to speech and DTMF grammars. If length conflicts with either the minlength or maxlength attributes then a error.badfetch event is thrown. |
boolean?y=d | A grammar that treats the keypress d as an affirmative answer. Applicable only to the DTMF grammar. |
boolean?n=d | A grammar that treats the keypress d as a negative answer. Applicable only to the DTMF grammar. |
Note that more than one parameter may be specified separated by the ";" character. This is illustrated in the last example below.
A <field> element with a builtin grammar type. In this example, the boolean type indicates that inputs are various forms of true and false. The value actually put into the field is either true or false. The field would be read out using the appropriate affirmative or negative response in prompts.
<field name="lo_fat_meal" type="boolean"> <prompt> Do you want a low fat meal on this flight? </prompt> <help> Low fat means less than 10 grams of fat, and under 250 calories. </help> <filled> <prompt> I heard <emphasis><say-as interpret-as="vxml:boolean"> <value expr="lo_fat_meal"/></say-as></emphasis>. </prompt> </filled> </field>
In the next example, digits indicates that input will be spoken or keyed digits. The result is stored as a string, and rendered as digits using the <say-as> with "vxml:digits" as the value for the interpret-as attribute, i.e., "one-two-three", not "one hundred twenty-three". The <filled> action tests the field to see if it has 12 digits. If not, the user hears the error message.
<field name="ticket_num" type="digits"> <prompt> Read the 12 digit number from your ticket. </prompt> <help>The 12 digit number is to the lower left.</help> <filled> <if cond="ticket_num.length != 12"> <prompt> Sorry, I didn't hear exactly 12 digits. </prompt> <assign name="ticket_num" expr="undefined"/> <else/> <prompt>I heard <say-as interpret-as="vxml:digits"> <value expr="ticket_num"/></say-as> </prompt> </if> </filled> </field>
The builtin boolean grammar and builtin digits grammar can be parameterized. This is done by explicitly referring to builtin grammars using a platform-specific builtin URI scheme and using a URI-style query syntax of the form type?param=value in the src attribute of a <grammar> element, or in the type attribute of a <field>. In this example, the <grammar> parameterizes the builtin DTMF grammar, the first <field> parameterizes the builtin DTMF grammar (the speech grammar will be activated as normal) and the second <field> parameterizes both builtin DTMF and speech grammars. Parameters which are undefined for a given grammar type will be ignored; for example, "builtin:grammar/boolean?y=7".
<grammar src="builtin:dtmf/boolean?y=7;n=9"/> <field type="boolean?y=7;n=9"> <prompt> If this is correct say yes or press seven, if not, say no or press nine. </prompt> </field> <field type="digits?minlength=3;maxlength=5"> <prompt>Please enter your passcode</prompt> </field>
Information in the Data layer must be easily accessible and easily editable throughout the VoiceXML 3.0 document. The Data Access and Manipulation Module describes the necessary mechanics by which application developers can express such interactions with the Data layer. Implementers must augment the data access and manipulation languages supported to provide the capabilities described in this section.
The remainder of this Section covers the semantics of the Data Access and Manipulation Module in Section 2.2 and the corresponding syntax in Section 2.3. Backward compatibility with VoiceXML 2.1 is discussed in Section 2.4.
The semantics of Data Access and Manipulation can be described in terms of the various scopes in VoiceXML 3.0, the relevance to platform properties, the corresponding implicit variables that platforms must support, the variable resolution mechanism, standard session and application variables and the set of legal data values and expressions.
Access to data is controlled by means of scopes, which are conceptually stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "session" is created. Thereafter scopes are explicitly created and destroyed by the data model resource's clients as necessary. Likewise, during the lifetime of each scope, data is added, read, updated and deleted by the data model resource's clients as necessary.
Implementation note: The API is defined in 5.1.1 Data Model Resource API.
At any given point in time, based on the VoiceXML document structure and the execution state, the stack may contain the following scopes whose semantics are described in VoiceXML 3.0 as follows (bottom to top):
session | These are read-only variables that pertain to an entire user session. They are declared and set by the interpreter context. New session variables cannot be declared by VoiceXML documents. |
---|---|
application | These are declared with <var> and <script> elements that are children of the application root document's <vxml> element. They are initialized when the application root document is loaded. They exist while the application root document is loaded, and are visible to the root document and any other loaded application leaf document. Note that while executing inside the application root document document.x is equivalent to application.x. |
document | These variables are declared with <var> and <script> elements that are children of the document's <vxml> element. They are initialized when the document is loaded. They exist while the document is loaded. They are visible only within that document, unless the document is an application root, in which case the variables are visible by leaf documents through the application scope only. |
dialog | Each dialog (<form> or <menu>) has a dialog scope that exists while the user is visiting that dialog, and which is visible to the elements of that dialog. Dialog scope contains the following variables: variables declared by <var> and <script> child elements of <form>, form item variables, and form item shadow variables. The child <var> and <script> elements of <form> are initialized when the form is first visited, as opposed to <var> elements inside executable content which are initialized when the executable content is executed. |
(anonymous) | Each <block>, <filled>, and <catch> element defines a new anonymous scope to contain variables declared in that element. |
Properties are discussed in detail in 8.2 Properties. Properties may be defined for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item. Thus, access to properties is also controlled by means of the same scope stack that is used by named variables.
VoiceXML 3.0 provides a consistent mechanism to unambiguously read these properties in any scope using the data access and manipulation language in a manner similar to accessing and manipulating named variables. This is described in the two sections below.
VoiceXML 3.0 provides several implicit variables in the data access and manipulation language to unambiguously identify the various scopes in the scope stack. Whenever the corresponding scopes are available, they can be referenced under specific names, which are always the same regardless of the location in the VoiceXML document. Additionally, an implicit variable "properties$" is available in each scope which points to the defined properties for that scope.
session | This implicit variable refers to the session scope. |
---|---|
application | This implicit variable refers to the application scope. |
document | This implicit variable refers to the document scope. |
dialog | This implicit variable refers to the dialog scope. |
properties$ | This read-only implicit variable refers to the defined properties which affect platform behavior in a given scope. The value is an ECMAScript object with multiple ECMAScript properties as necessary where each ECMAScript property has the name of an existing platform property in that scope and value corresponding to the value of the platform property. |
Note that in some data access expression languages (such as XPath), it may be necessary to expose the semantics of implicit variables as expression language functions instead of variables.
Also note that there is no implicit variable corresponding to the anonymous scope since it is not necessary given the variable resolution mechanism described in the next section. Where scope qualifiers are functions, a function to identify the anonymous scope may be necessary.
Finally, the use of the "properties$" implicit variable in VoiceXML 3.0 means that the variable "properties$" is now reserved in all scopes with the semantics described above.
This section describes how named variables are resolved in VoiceXML 3.0. Named variables in expressions may be scope-qualified (using implicit variables) or scope-unqualified.
Some examples of scope-qualified variables that may occur in expressions are listed in the table below.
Expression | Result |
application.hello | The value of the "hello" named variable in the application scope. |
dialog.retries | The value of the "retries" named variable in the dialog scope. |
dialog.properties$.bargein | The value of the "bargein" platform property defined at the current "dialog" scope. |
The above table assumes that all the named variables used in the expressions exist. If any of the named variables do not exist, an error.semantic will result.
In cases where the named variables are unqualified i.e. there is no implicit variable indicating the scope in use, the following variable resolution mechanism is used:
The steps corresponding to any scopes that do not exist at the time of expression evaluation are ignored. The resolution mechanism begins with the closest enclosing scope in the given document structure.
The following standard variables are available in the session scope:
The following standard variables are available in the application scope:
Interpretations are sorted by confidence score, from highest to lowest. Interpretations with the same confidence score are further sorted according to the precedence relationship among the grammars producing the interpretations. Different elements in application.lastresult$ will always differ in their utterance, interpretation, or both.
The number of application.lastresult$ elements is guaranteed to be greater than or equal to one and less than or equal to the system property "maxnbest". If no results have been generated by the system, then "application.lastresult$" shall be ECMAScript undefined.
Additionally, application.lastresult$ itself contains the properties confidence, utterance, inputmode, and interpretation corresponding to those of the 0th element in the ECMAScript array.
All of the shadow variables described above are set immediately after any recognition. In this context, a <nomatch> event counts as a recognition, and causes the value of "application.lastresult$" to be set, though the values stored in application.lastresult$ are platform dependent. In addition, the existing values of field variables are not affected by a <nomatch>. In contrast, a <noinput> event does not change the value of "application.lastresult$". After the value of "application.lastresult$" is set, the value persists (unless it is modified by the application) until the browser enters the next waiting state, when it is set to undefined. Similarly, when an application root document is loaded, this variable is set to the value undefined. The variable application.lastresult$ and all of its components are writeable and can be modified by the application.
Any data language available on a VoiceXML 3.0 platform must specify the structure of the underlying data model. For example, with XPath, the variable values (and hence, the constituents of the data model) are XML trees. Such a specification of the data model implicitly defines a set of "legal variable values", namely the objects that can be part of such a data model.
Similarly, any data access and manipulation language available on a VoiceXML 3.0 platform must specify the complete set of valid value expressions via the expression language syntax.
The syntax of the Data Access and Manipulation Module is described in terms of full support for CRUD operations (Create, Read, Update, Delete) on the Data layer in sections 2.3.1 through 2.3.4. The relevance of this syntax for properties is described in section 2.3.5.
The declaration of named variables is done using the <var> element. It can occur in executable content or as a child of <form> or <vxml>.
If it occurs in executable content, it declares a variable in the anonymous scope associated with the enclosing <block>, <filled>, or catch element. This declaration is made only when the <var> element is executed. If the variable is already declared in this scope, subsequent declarations act as assignments, as in ECMAScript.
If a <var> is a child of a <form> element, it declares a variable in the dialog scope of the <form>. This declaration is made during the form's initialization phase.
If a <var> is a child of a <vxml> element, it declares a variable in the document scope; and if it is the child of a <vxml> element in a root document then it also declares the variable in the application scope. This declaration is made when the document is initialized; initializations happen in document order.
Attributes of <var>
name | The name of the variable that will hold the result. This attribute must not specify a scope-qualified variable (if a variable is specified with a scope prefix, then an error.semantic event is thrown). The default scope in which the variable is defined is determined from the position in the document at which the element is declared. |
---|---|
expr | The initial value of the variable (optional). If there is no expr attribute, the variable retains its current value, if any. Variables start out with the default value determined by the data access expression language in use if they are not given initial values (for example, with ECMAScript the initial value is undefined). |
scope | The scope within which the named variable must be created (optional). Must be one of session, application, document or dialog. If the specified scope does not exist, then an error.semantic event is thrown. |
The addition of the "scope" attribute in VoiceXML 3.0 adds more flexibility for the creation of variables, and allows creation to be decoupled from document location of the <var> element, if desired by the application.
Children of <var>
The children of the <var> element represent an in-line specification of the value of the variable.
If "expr" attribute is present, then the element must not have any children. Thus "expr" and children are mutually exclusive for the <var> element.
<var> examples
This section is informative.
<var name="phone" expr="'6305551212'"/> <var name="y" expr="document.z+1"/> <var name="foo" scope="application" expr="dialog.bar * 2"/> <var name="itinerary"> <root xmlns=""> <flight>SW123</flight> <origin>JFK</origin> <depart>2009-01-01T14:32:00</depart> <destination>SFO</destination> <arrive>2009-01-01T18:14:00</arrive> </root> </var>
The above examples have the following result, in order:
<root xmlns=""> <flight>SW123</flight> <origin>JFK</origin> <depart>2009-01-01T14:32:00</depart> <destination>SFO</destination> <arrive>2009-01-01T18:14:00</arrive> </root>
Translating to the Data Model Resource API
Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API.
The above examples result in the following Data Model Resource API calls, in order:
<root xmlns=""> <flight>SW123</flight> <origin>JFK</origin> <depart>2009-01-01T14:32:00</depart> <destination>SFO</destination> <arrive>2009-01-01T18:14:00</arrive> </root>
The values of the named variables in the existing scopes in the scope stack are available for introspection and for further computation. These values can be read wherever expressions can be specified in the VoiceXML 3.0 document. Important examples include the "expr" and "cond" attributes on various syntactic elements. The "expr" attribute values are legal expressions as defined by the syntax of the data access and manipulation language (see Section 2.2.7 for details). The "cond" attribute values function as predicates, and in addition to being expressions, must evaluate to a boolean value.
The <value> element is used to insert the value of an expression into a prompt. 6.4 Prompt Module specifies prompts in detail.
Attributes of <value>
expr | The expression to render. See Section 2.2.7 for legal values of expressions. |
---|---|
scope | The scope within which the named variables in the expression are resolved (optional). Must be one of session, application, document or dialog. If the specified scope does not exist, then an error.semantic event is thrown. |
<value> examples
<value expr="application.duration + dialog.duration"/> <value expr="foo * bar"/> <value expr="foo + bar + application.baz" scope="document"/>
The above examples render the following, in order:
Translating to the Data Model Resource API
Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API.
The above examples result in the following Data Model Resource API calls:
The <assign> element assigns a value to a variable.
It is illegal to make an assignment to a variable that has not been explicitly declared using a <var> element or a var statement within a <script>. Attempting to assign to an undeclared variable causes an error.semantic event to be thrown.
Note that when an ECMAScript object, say "obj", has been properly initialized then its properties, for instance "obj.prop1", can be assigned without explicit declaration (in fact, an attempt to declare ECMAScript object properties such as "obj.prop1" would result in an error.semantic event being thrown).
Attributes of <assign>
name | The name of the variable being assigned to. The corresponding variable must have been previously declared otherwise an error.semantic event is thrown. By default, the scope in which the variable is resolved is the closest enclosing scope of the currently active element. To remove ambiguity, the variable name may be prefixed with a scope name. |
---|---|
expr | The expression evaluating to the new value of the variable (optional). |
Children of <assign>
The children of the <assign> element represent an in-line specification of the new value of the variable.
If "expr" attribute is present, then the element must not have any children. Thus "expr" and children are mutually exclusive for the <assign> element.
<assign> examples
This section is informative.
<assign name="phone" expr="'6305551212'"/> <assign name="y" expr="document.z+1"/> <assign name="application.foo" expr="dialog.bar * 2"/> <assign name="itinerary"> <root xmlns=""> <flight>SW123</flight> <origin>JFK</origin> <depart>2009-01-01T14:32:00</depart> <destination>SFO</destination> <arrive>2009-01-01T18:14:00</arrive> </root> </var>
The above examples have the following result, in order:
<root xmlns=""> <flight>SW123</flight> <origin>JFK</origin> <depart>2009-01-01T14:32:00</depart> <destination>SFO</destination> <arrive>2009-01-01T18:14:00</arrive> </root>
Translating to the Data Model Resource API
Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API.
The above examples result in the following Data Model Resource API calls, in order:
<root xmlns=""> <flight>SW123</flight> <origin>JFK</origin> <depart>2009-01-01T14:32:00</depart> <destination>SFO</destination> <arrive>2009-01-01T18:14:00</arrive> </root>
The <data> element allows a VoiceXML application to fetch an in-line specification of a new value for a named variable from a document server without transitioning to a new VoiceXML document. The data fetched is bound to the named variable.
Attributes of <data>
src | The URI specifying the location of the in-line data specification to retrieve (optional). This specification depends on the data language in use for the VoiceXML document (XML, JSON). |
---|---|
name | The name of the variable that the data fetched will be bound to. |
scope | The scope within which the named variable to bind the data is found (optional). Must be one of session, application, document or dialog. If the specified scope does not exist, then an error.semantic event is thrown. |
srcexpr | Like src, except that the URI is dynamically determined by evaluating the given expression when the data needs to be fetched (optional). If srcexpr cannot be evaluated, an error.semantic event is thrown. |
method | The request method: get (the default) or post (optional). |
namelist | The list of variables to submit (optional). By default, no variables are submitted. If a namelist is supplied, it may contain individual variable references which are submitted with the same qualification used in the namelist. Declared VoiceXML variables can be referenced. |
enctype | The media encoding type of the submitted document (optional). The default is application/x-www-form-urlencoded. Interpreters must also support multipart/form-data [RFC2388] and may support additional encoding types. |
fetchaudio | See Section 6.1 of [VXML2] (optional). This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2]. |
fetchhint | See Section 6.1 of [VXML2] (optional). This defaults to the datafetchhint property described in Section 2.3.3.2.3. |
fetchtimeout | See Section 6.1 of [VXML2] (optional). This defaults to the fetchtimeout property described in Section 6.3.5 of [VXML2]. |
maxage | See Section 6.1 of [VXML2] (optional). This defaults to the datamaxage property described in Section 2.3.3.2.3. |
maxstale | See Section 6.1 of [VXML2] (optional). This defaults to the datamaxstale property described in Section 2.3.3.2.3. |
Exactly one of "src" or "srcexpr" must be specified; otherwise, an error.badfetch event is thrown. If the content cannot be retrieved, the interpreter throws an error as specified for fetch failures in Section 5.2.6 of [VXML2].
If the value of the src or srcexpr attribute includes a fragment identifier, the processing of that fragment identifier is platform-specific.
Platforms should support parsing XML data into a DOM. If an implementation does not support DOM, the name attribute must not be set, and any retrieved content must be ignored by the interpreter. If the name attribute is present, these implementations will throw error.unsupported.data.name.
If the name attribute is present, and the returned document is XML as identified by [RFC3023], the VoiceXML interpreter must expose the retrieved content via a read-only subset of the DOM as specified in Appendix D of [VXML2.1]. An interpreter may support additional data formats by recognizing additional media types. If an interpreter receives a document in a data format that it does not understand, or the data is not well-formed as defined by the specification of that format, the interpreter throws error.badfetch. If the media type of the retrieved content is one of those defined in [RFC3023] but the content is not well-formed XML, the interpreter throws error.badfetch.
If use of the DOM causes an uncaught DOMException to be thrown, the VoiceXML interpreter throws error.semantic.
Before exposing the data in an XML document referenced by the <data> element via the DOM, the interpreter should check that the referring document is allowed to access the data. If access is denied the interpreter must throw error.noauthorization.
Note: One strategy commonly implemented in voice browsers to control access to data is the "access-control" processing instruction described in the WG Note: Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0 [DATA_AUTH].
Like the <var> element, the <data> element can occur in executable content or as a child of <form> or <vxml>. In addition, it shares the same default scoping rules as the <var> element. If a <data> element has the same name as a variable already declared in the same scope, the variable is assigned a reference to the new value exposed by the <data> element.
Like the <submit> element, when variable data is submitted to the server its value is first converted into a string before being submitted. If the variable is a DOM Object, its serialized as the corresponding XML. If the variable is an ECMAScript Object, the mechanism by which it is submitted is not currently defined. If a <data> element's namelist contains a variable which references recorded audio but does not contain an enctype of multipart/form-data [RFC2388], the behavior is not specified. It is discouraged to attempt to URL-encode large quantities of data.
<data> example
The example discussed in this section uses XML as the data language and fetches the following XML document using the <data> element:
<?xml version="1.0" encoding="UTF-8"?> <quote xmlns="http://www.example.org"> <ticker>F</ticker> <name>Ford Motor Company</name> <change>0.10</change> <last>3.00</last> </quote>
The above stock quote is retrieved in one dialog, the document element is cached in a variable at document scope and used to playback the quote in another dialog. The data access and manipulation language in the example is XPath 2.0 [XPATH20].
<?xml version="1.0" encoding="UTF-8"?> <vxml xmlns="http://www.w3.org/2001/vxml" version="2.1" xmlns:ex="http://www.example.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/2007/REC-voicexml21-20070619/vxml.xsd"> <var name="quote"/> <var name="tickers"> <tickers xmlns=""> <ford>f</ford> <!-- etc., the dialog below hardcodes ford --> </tickers> </var> <form id="get_quote"> <block> <data name="quote" scope="document" srcexpr="'http://www.example.org/getquote?ticker=' + document('tickers')/ford"/> <goto next="#play_quote"/> </block> </form> <form id="play_quote"> <block> <var name="name" expr="document('quote')/ex:name"/> <var name="change" expr="document('quote')/ex:change"/> <var name="last" expr="document('quote')/ex:last"/> <var name="dollars" expr="fn:floor(last)"/> <var name="cents" expr="fn:substring(last,fn:string-length(last)-1)"/> <!--play the company name --> <audio expr="document('tickers')/ford + '.wav'"><value expr="name"/></audio> <!-- play 'unchanged, 'up', or 'down' based on zero, positive, or negative change --> <if cond="change = 0"> <audio src="unchanged_at.wav"/> <else/> <if cond="change > 0"> <audio src="up.wav"/> <else/> <!-- negative --> <audio src="down.wav"/> </if> <audio src="by.wav"/> <!-- play change in value as positive number --> <audio expr="fn:abs(change) + '.wav'"><value expr="fn:abs(change)"/></audio> <audio src="to.wav"/> </if> <!-- play the current price per share --> <audio expr="dollars + '.wav'"><value expr="dollars"/></audio> <if cond="cents > 0"> <audio src="point.wav"/> <audio expr="cents + '.wav'"><value expr="cents"/></audio> </if> </block> </form> </vxml>
Translating to the Data Model Resource API
Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API.
The single <data> usage in the above example results in the following behavior and Data Model Resource API calls:
At the time of <data> execution, the variable with name "quote" is updated in the document scope using the in-line specification for the new value retrieved from the URI expression 'http://www.example.org/getquote?ticker=' + document('tickers')/ford which evaluates to http://www.example.org/getquote?ticker=f
<quote xmlns="http://www.example.org"> <ticker>F</ticker> <name>Ford Motor Company</name> <change>0.10</change> <last>3.00</last> </quote>
<data> Fetching Properties
These properties pertain to documents fetched by the <data> element.
datafetchhint | Tells the platform whether or not data documents may be pre-fetched. The value is either prefetch (the default), or safe. |
---|---|
datamaxage | Tells the platform the maximum acceptable age, in seconds, of cached documents. The default is platform-specific. |
datamaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached data documents. The default is platform-specific. |
The <clear> element resets one or more variables, including form items.
For each specified variable name, the variable is resolved relative to the current scope by default (to remove ambiguity, each variable name in the namelist may be prefixed with a scope name). Once a declared variable has been identified, its value is assigned the default initial value defined by the data access expression language in use (for example, when using ECMAScript, the variables are reset to the undefined value). In addition, if the variable name corresponds to a form item, then the form item's prompt counter and event counter are reset.
Attributes of <clear>
namelist | The list of variables to be reset; this can include variable names other than form items. If an undeclared variable is referenced in the namelist, then an error.semantic is thrown. When not specified, all form items in the current form are cleared. |
---|---|
scope | The scope within which the named variables must be resolved (optional). Must be one of session, application, document or dialog. If the specified scope does not exist, then an error.semantic event is thrown. |
<clear> examples
This section is informative.
<clear namelist="city state zip"/> <clear namelist="application.foo dialog.bar baz"/> <clear namelist="alpha beta application.gamma" scope="document"/> <clear/> <clear scope="dialog"/>
The above examples have the following result, in order:
Translating to the Data Model Resource API
Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API.
The above examples result in the following Data Model Resource API calls, in order:
ResetVariable()?
While this section uses the DeleteVariable() method, a ResetVariable() method that better aligns with <clear> semantics should be considered for addition to the 5.1.1 Data Model Resource API. This will allow resetting variable values to the initial in-line specification when such is present, for instance.
Resolution:
None recorded.
Platform properties are discussed in detail in 8.2 Properties. VoiceXML 3.0 provides a consistent mechanism to unambiguously read these properties in any scope using the data access and manipulation language in a manner similar to accessing and manipulating named variables as illustrated in section 2.3.2. However, properties cannot be created, updated or deleted using any of the syntax described in this module. The <property> element syntax must be used for such operations.
VoiceXML 3.0 adds some new features to data access and manipulation but does not change any existing behavior. Thus, this module is backwards compatible with Voice XML 2.1.
Likewise, the VoiceXML 2.1 profile in VoiceXML 3.0 is not required to support any of the new features added in this module. In particular, the following features may be excluded by implementors supporting the VoiceXML 2.1 profile.
properties$ implicit variable | The VoiceXML 2.1 profile does not require the properties$ implicit variable to be supported in any scopes. |
---|---|
"scope" attribute | The optional "scope" attribute is not required to be supported in the VoiceXML 2.1 profile for the <var>, <value>, <assign>, <data> and <clear> elements. |
<var> and <assign> children | The VoiceXML 2.1 profile does not require in-line specifications for the initial value using the children of the <var> element or in-line specifications for the new value using the children of the <assign> element to be supported. |
Implicit variables described in section 2.2.3 to qualify scope are amenable to certain data access and manipulation languages (such as ECMAScript) but are not as elegant while incorporating in the syntax of others, such as XPath. VoiceXML 3.0 permits the use of functions rather than variables to address this. The following table illustrates how scope qualifiers are exposed as XPath functions.
session() | This single String argument function retrieves the value of the variable named in the argument from the session scope. |
---|---|
application() | This single String argument function retrieves the value of the variable named in the argument from the application scope. |
document() | This single String argument function retrieves the value of the variable named in the argument from the document scope. |
dialog() | This single String argument function retrieves the value of the variable named in the argument from the dialog scope. |
anonymous() | This single String argument function retrieves the value of the variable named in the argument from the anonymous scope. |
properties$ | This read-only implicit variable refers to the defined properties which affect platform behavior in a given scope. The value is an XML tree with a <properties> root element and multiple children as necessary where each child element has the name of an existing platform property in that scope and body content corresponding to the value of the platform property. CDATA sections are used if necessary. |
The following table shows how these qualifier functions are used, and the examples are XPath variants of the examples illustrated in Table 53.
Usage | Result |
application('hello') | The value of the "hello" named variable in the application scope. |
dialog('retries') | The value of the "retries" named variable in the dialog scope. |
dialog('properties$')/bargein | The value of the "bargein" platform property defined at the current "dialog" scope. |
This module supports the sending and receiving of external messages by a voice application by introducing the <send> and <receive> elements into VoiceXML. The application developer chooses to send and receive external messages synchronously or asynchronously. When sending a message, the developer chooses whether or not it should represent a named event. The developer also chooses whether or not to include a payload. These choices can be made statically or dynamically at run-time.
Note that this section only covers receiving messages that the interpreter does not handle. In other words, at application level events. Some events, like lifecycle events targeted at creating or destroying sessions are not targetted at the application author but instead are handled by the browser itself. The complete list of all these interpreter level events is TBD but might include events such as "create session", "pause", "resume", or "disconnect".
Although this section handles many of the easy and moderately difficult cases, for certain very complicated cases it may be appropriate to put a gatekeeper filter between the VXML interpreter and the external events to filter and only allow certain events to interrupt the processing of the VXML document. For example if someone wanted a "operator" event to only be allowed to interrupt the VXML document if its data variable held a certain value, or if they wanted the "operator" event but not the "caller" event, then a filter might be appropriate. SCXML is one method that is suitable for providing these type of more advanced filters.
Because external messages can arrive at any time, they can be disruptive to a voice application. A voice application developer decides whether these messages are delivered to the application synchronously or asynchronously using the "externalevents.enable" property. The property can be set to one of the following values:
true | External messages are delivered asynchronously as VoiceXML events. |
---|---|
false | External messages are delivered synchronously. This is the default. |
When external messages are delivered synchronously, an application developer decides whether these messages are preserved or discarded by setting the "externalevents.queue" property. The property can be set to one of the following values:
true | External messages are queued. |
---|---|
false | An external messages that is not delivered as a VoiceXML event is discarded. This is the default. |
If "externalevents.enable" is set to true and an external message arrives, the external message is reflected to the application in the application.lastmessage$ variable. application.lastmessage$ is an ECMAScript object with the following properties:
contenttype | The media type of the external message. |
---|---|
event | The event name, if any, or ECMAScript undefined if no event name was included in the external message. |
content | The content of the message, if any, or ECMAScript undefined. If the Content-Type of the message is one of the media types described in [RFC 3023], the VoiceXML interpreter must expose the retrieved content via a read-only subset of the DOM as described in [VXML21]. An interpreter may support additional data formats by recognizing additional media types. If an interpreter receives an external message with a payload in a data format that it does not understand, or the payload is not well-formed as defined by the specification of that format, the interpreter throws "error.badfetch". |
If no external messages have been received, application.lastmessage$ is ECMAScript undefined. Only the last received message is available. To preserve a message for future reference during the lifetime of the application, the application developer can copy the data to an application-scoped variable.
To receive an external message asynchronously, an application defines an "externalmessage" event handler. The event handler must be declared within the appropriate scope since the user-defined <catch> handler is selected using the algorithm described in section 5.2.4 of [VXML2].
If the payload of an external message includes an event name, the name is appended to the name of the event that is thrown to the application separated by a dot (e.g. "externalmessage.ready"). This allows applications to handle external messages using different event handlers.
Asynchronous external messages are processed in the same manner that a disconnect event is handled in VXML2.
Events are dispatched to the application serially. Since the interpreter only reflects the data associated with a single external message at a time, it is the application's responsibility to manage the data associated with each external message once that message has been delivered.
The following example demonstrates asynchronous receipt of an external message. The catch handler copies the reflected external message into an array at application scope.
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"> <property name="externalevents.enable" value="true"/> <var name="myMessages" expr="new Array()"/> <catch event="externalmessage"> <var name="lm" expr="application.lastmessage$"/> <if cond="lm.contenttype == 'text/xml' || lm.contenttype == 'application/xml'"> <log>received XML with root document element <value expr="lm.content.documentElement.nodeName"/> </log> <elseif cond="typeof lm.content == 'string'"/> <log>received <value expr="lm.content"/></log> <else/> <log>received unknown external message type <value expr="typeof lm.content"/> </log> </if> <script> myMessages.push({'content' : lm.content, 'ctype' : lm.contenttype}); </script> </catch> <form> <field name="num" type="digits"> <prompt>pick a number any number</prompt> <catch event="noinput nomatch"> sorry. didn't get that. <reprompt/> </catch> <filled> you said <value expr="num"/> <clear/> </filled> </field> </form> </vxml>
To receive an external message synchronously set the "externalevents.enable" property to false and the "externalevents.queue" property to true, and use the <receive> element to pull messages off the queue. <receive> blocks until an external message is received or the timeout specified by the maxtime attribute is exceeded.
To support receipt of external messages within a voice application, use the <receive> element. <receive> is allowed wherever executable content is allowed in [VXML21], for example a <block> element.
<receive> supports the following attributes:
Name | Description | Required | Default |
---|---|---|---|
fetchaudio | See Section 6.1 of [VXML2]. This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2]. | No | N/A |
fetchaudioexpr | An ECMAScript expression evaluating to the fetchaudio URI. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | N/A |
maxtime | A W3C time specifier indicating the maximum amount of time the interpreter waits to receive an external message. If the timeout is exceeded, the interpreter throws "error.badfetch." A value of "none" indicates the interpreter blocks indefinitely. | No | 0s |
maxtimeexpr | An ECMAScript expression evaluating to the maxtime value. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | 0s |
Only one of fetchaudio and fetchaudioexpr can be specified or "error.badfetch" is thrown.
Only one of maxtime and maxtimeexpr can be specified or "error.badfetch" is thrown.
When present, the attributes fetchaudioexpr and maxtimeexpr are evaluated when the <receive> is executed.
The following example demonstrates synchronously receiving an external message. In this example, the interpreter blocks for up to 15 seconds waiting for an external message to arrive. If no external message is received during that interval, the interpreter throws "error.badfetch". If a message is received, the interpreter proceeds by executing the <log> element.
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"> <property name="externalevents.queue" value="true"/> <form> <catch event="error.badfetch"> <log>timed out waiting for external message</log> </catch> <block> Hold on ... <receive maxtime="15s" fetchaudio="http://www.example.com/audio/fetching.wav"/> <log>got <value expr="application.lastmessage$.content"/></log> </block> </form> </vxml>
To send a message from a VoiceXML application to a remote endpoint, use the <send> element. <send> is allowed within executable content. Implementations must support the following attributes:
Name | Description | Required | Default |
---|---|---|---|
async | A boolean indicating whether or not to block until the final response to the transaction created by sending the external event is received, or a timeout. | No | true |
asyncexpr | An ECMAScript expression evaluating to the value of the async attribute. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | N/A |
body | A string representing the data to be sent in the body of the message. | No | N/A |
bodyexpr | An ECMAScript expression evaluating to the body of the message to be sent. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | N/A |
contenttype | A string indicating the media type of the body being sent, if any. The set of content types may be limited by the underlying platform. If an unsupported media type is specified, the interpreter throws "error.badfetch.<protocol>.400." The interpreter is not required to inspect the data specified in the body to validate that it conforms to the specified media type. | No | text/plain |
contenttypeexpr | An ECMAScript expression evaluating to the media type of the body. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | N/A |
event | The name of the event to send. The value is a string which only includes alphanumeric characters and the "." (dot) character. The first character must be a letter. If the value is invalid, then an "error.badfetch" event is thrown. | No | N/A |
eventexpr | An ECMAScript expression evaluating to the name of the event to be sent. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | N/A |
fetchaudio | See Section 6.1 of [VXML2]. This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2]. | No | N/A |
fetchaudioexpr | An ECMAScript expression evaluating to the fetchaudio URI. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | N/A |
namelist | A list of zero or more whitespace-separated variable names to send. By default, no variables are submitted. Values for these variables are evaluated when the <send> element is executed. Only declared variables can be referenced; otherwise, "error.semantic" is thrown. Variables must be submitted to the server with the same qualification used in the namelist. When an ECMAScript variable is submitted to the server, its value must be converted first into a string before being sent. If the variable is an ECMAScript object, the mechanism by which it is submitted is platform-specific. Instead of submitting an ECMAScript object directly, the application developer can explicitly submit the individual properties of the object (e.g. "date.month date.year"). | No | N/A |
target | Specifies the URI to which the event is sent. If the attribute is not specified, the event is sent to the component which invoked the VoiceXML session. | No | Invoking component |
targetexpr | An ECMAScript expression evaluating to the target URI. If evaluation of the expression fails, the interpreter throws "error.semantic". | No | N/A |
timeout | See 6.13.2.1 sendtimeout. This defaults to the sendtimeout property. | No | N/A |
timeoutexpr | An ECMAScript expression evaluating to the timeout interval for a synchronous <send>. If evaluation of the expression fails, the interpreter throws "error.semantic" | No | N/A |
Only one of async and asyncexpr can be specified or "error.badfetch" is thrown.
Only one of event or eventexpr can be specified or "error.badfetch" is thrown.
Only one of body, bodyexpr, namelist, event, or eventexpr must be specified or "error.badfetch" is thrown.
Only one of contenttype and contenttypeexpr can be specified or "error.badfetch" is thrown.
Only one of fetchaudio and fetchaudioexpr can be specified or "error.badfetch" is thrown.
Only one of target and targetexpr can be specified or "error.badfetch" is thrown.
Only one of timeout and timeoutexpr can be specified or "error.badfetch" is thrown.
When present, the attributes asyncexpr, bodyexpr, contenttypeexpr, eventexpr, fetchaudioexpr, targetexpr, and timeoutexpr are evaluated when the <send> is executed.
If a synchronous <send> succeeds, execution proceeds according to the Form Interpretation Algorithm. If the <send> times out, the interpreter throws "error.badfetch" to the application. If the interpreter encounters an error upon sending the external message, the interpreter throws "error.badfetch.<protocol>.<status_code>" to the application. If no status code is available, the interpreter throws "error.badfetch.<protocol>".
The following example demonstrates the use of <send> synchronously:
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"> <form> <field name="user_id" type="digits"> <prompt>please type your five digit i d</prompt> <filled> <send async="false" bodyexpr="'<userinfo><id>' + user_id + '</id></userinfo>'" contenttype="text/xml"/> <goto next="mainmenu.vxml"/> </filled> </field> </form> </vxml>
Upon executing an asynchronous <send>, the interpreter continues execution of the voice application immediately and disregards the disposition of the message that was sent.
The following example demonstrates the use of <send> asynchronously:
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"> <form> <var name="tasktarget" expr="'http://www.example.com/taskman.pl'"/> <var name="taskname" expr="'cc'"/> <var name="taskstate"/> <block> <assign name="taskstate" expr="'start'"/> <send async="true" targetexpr="tasktarget" namelist="taskname taskstate"/> </block> <field name="ccnum"/> <field name="expdate"/> <block> <assign name="taskstate" expr="'end'"/> <send async="true" targetexpr="tasktarget" namelist="taskname taskstate"/> </block> </form> </vxml>
The sendtimeout property controls the interval to wait for a synchronous <send> to return before an "error.badfetch" event is thrown. The value is a Time Designation as specified in Section 6.5 of [VXML2]. If not specified, the value is derived from the innermost sendtimeout property.
The session root module allows for a VXML document to exist across a VXML session (I.e., transition from one application to another) similar to the way an application root document allows for a VXML document to exist across VXML document transitions.
The syntax of the session root module defines the addition of two attributes on the root <vxml> element. These two new attributes are summarized below.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
session | URI | URI location of document to be loaded as session root | No | N/A |
requiresession | Boolean | if the error of a duplicated new session root should fail the document | No | false |
The session attribute is an optional attribute on vxml tag. It is a URI reference just like the application attribute is (same URI resolution). If a VXML session has not yet encountered a document with a session root then upon encountering the first vxml document that has a session root, the session root document is loaded and parsed just like how a normal vxml document would load and parse an application root document. If a VXML session has already loaded a different session root then the behavior when a future session attributes is encountered is controlled by the requiresession attribute. If the requiresession attribute is true then encountering a session root attribute with a different URL then the already loaded session root is an error and an error.badfetch is generated. If the requiresession attribute is false then the new session attribute is ignored and the old one is used. The requiresession attribute defaults to false if not present. The behavior of the session root is completely the same as the behavior of the application root, except that while executing in the session root the vxml browser is allowed to write to the javascript session scope, and variables declared as child of the vxml tag thus become session scope variables. In particular, in VXML 2.0 section 5.1.2 when talking about the variable scopes the text for application in table 40 is also appropriate for session (new text "These are declared with <var> and <script> elements that are children of the session root document's <vxml> element. They are initialized when the session root document is loaded. They exist while the session document is loaded, and are visible to the session root document, the application root document, and any other loaded application leaf document.").
This session document then is loaded and active in the hierarchy of documents that follows the javascript scope chaining (that is a document is below an application root is below a session root). This means that if a variable is declared in the session root and then in some local form in the leaf document the variable would be shadowed (just like how the shadowing from the application root).
This also implies that the catch selection algorithm as described in VXML 2.0 section 5.2.4 would have to change to include the session root document as a potential source of catch handlers (new text "Form an ordered list of catches consisting of all catches in the current scope and all enclosing scopes (form item, form, document, application root document, session root document, interpreter context), ordered first by scope (starting with the current scope), and then within each scope by document order."). Then all catch handling would remain the same, in particular the as-if-by copy semantics are retained so if an event from a leaf document was handled by a catch handler from the session root the catch handler wouldn't execute within the context of the session root document but would instead execute as if by copy into the local leaf document context.
This also implies that property lookup from section 6.3 of VXML 2.0 would have to change to say that property value lookup can also go to the session root, if a more local value for the property isn't found (new text "Properties may be defined for the whole session, for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item.). This doesn't change the usual way properties work where a property at a lower level override one at a higher level.
This also implies that the behavior for link's that are document-level link of session roots are active which would be a change to section 2.5 of VXML 2.0 (new text "If an application root document has a document-level link, its grammars are active no matter what document of the application is being executed. If an session root document has a document-level link, its grammars are active no matter what document of the session is being executed. If execution is in a modal form item, then link grammars at the form, document, application or session level are not active.").
Similar to for links, the scope of grammars from section 3.1.3 of VXML 2.0 would be changed to specify what happens when a grammar from a session root has document scope (new text "Form grammars are by default given dialog scope, so that they are active only when the user is in the form. If they are given scope document, they are active whenever the user is in the document. If they are given scope document and the document is the application root document, then they are also active whenever the user is in another loaded document in the same application. If they are given scope document and the document is the session root document, then they are also active throughout the session.". Note that this active throughout the session can still be trumped by modal listen states (just like the application root can). Section 3.1.4 of VXML 2.0 also changes the activation of grammars bulleted list to include the session root (new text: "grammars contained in links in its application root document or session root document, and grammars for menus and forms in its application root document or session root document which are given document scope.").
For the sake of compactness assume throughout this example that the single letters used are actually fully qualified URIs. A VXML document "A" transitions to VXML document "B" which is partially represented below:
<vxml session="C" application="D" … >
Before "B" can finish initialization of "B" it loads, parses, and initializes the VXML documents at both "C" and "D". While executing in "B" any grammars, properties, links, and variables included from either "C" or "D" influence execution. Document "B" then transitions to document "E", with no session attribute, partially represented below:
<vxml application="D" … >
While executing in "E" having come from "B", everything from both "C" and "D" are still active. "D" is still active as we haven't left the application yet. "C" is still active as we are part of the same session. Document "E" now transitions to document "F" partially represented below:
<vxml application="G" … >
Now, since we have changed applications, the application root document form "D" is unloaded and grammars, variables, properties, etc. from "D" are no longer influencing our execution. Document "G" defines our application root and it, along with "C" which is still active since we are in the same session, now influence our execution. Document "F" now transitions to "H" partially represented below:
<vxml session="I" … >
Now, since there is already "C" as our session root document defined we cannot load document "I" and treat this as our session root. In the absence of requiresession "I" is ignored and "H" is executed using "C" as our session document. If instead "H" looked as below:
<vxml session="I" requiresession="true" … >
Now "H" would fail to load and execution would revert to document "F" where the appropriate error.badfetch for "H" would be thrown.
Run time controls are represented by voice or dtmf grammars that are always active, even when the interpreter is not waiting for user input (e.g., when transitioning between documents.) When the grammar representing the rtc is matched, the action specified by the rtc is taken. When an rtc grammar completes recognition, it is immediately restarted whether it matched the input or not. Other grammars, including standard recognition grammars, may be active at the same time as an rtc.
grammar | The uri of the grammar defining the rtc. |
---|---|
priority | The priority of the rtc relative to other rtc grammars. |
action | The action to take when the grammar matches. |
params | One or more parameters that modify the action. |
leadingsilence | Indicates the amount of silence that must precede the utterance. |
trailingsilence | Indicates the amount of silence that must follow the utterance. |
Possible values for 'action' and 'params' are as follows:
<cancelrtc> can be used to cancel an rtc defined by the <rtc> element. Note that rtcs are identified by their grammars.
Both <rtc> and <cancelrtc>are scoped to the nearest enclosing control element (<item>,<form>, ...). If not within a control element, they are scoped to the document they are in. A given rtc may be defined multiple times within an application. At any point during execution, the most narrowly scoped <rtc> or <cancelrtc> element will be in effect. If an active rtc is turned off by a <cancelrtc> tag, it will be reactivated when the interpreter leaves the scope of the <cancelrtc> tag (unless it comes within the scope of another <cancelrtc> tag). For example, if an <rtc> tag is scoped to a <form> and a <cancelrtc> tag is scoped to a field within the <form>, the rtc will be active while the form is executing, except when it is the field in question. Application authors may thus use the <rtc> and <cancelrtc> tags along with the scoping rules to exercize fine grained control over the activity of rtcs.
Logically rtc grammars behave as a prefilter on the speech stream, replacing any input they match with silence. Since rtc grammars operate upstream of normal recognition grammars, the recognition grammars never see input that matched an rtc grammar. Thus input that matches an rtc grammar will not trigger barge in, since barge-in is triggered only by input matching a normal recognition grammar. Since rtc grammars apply upstream of recognition grammars, all rtc grammars have, in effect, a higher priority than any recognition grammar. Thus the priority attribute on an rtc grammar affects its priority only relative to other rtc grammars.
When processing speech or type-ahead input, rtcs again apply upstream of normal speech grammars. Remember also that rtcs may be active even when the system is not in the wait state, so they may match the input while it is being entered and other grammars are not active.
The platform executes the 'volume', 'speed' and 'skip' actions immediately, as soon as the rtc grammar matches. The platform also executes the 'cancel' and 'goto' actions automatically, but only once the interpreter is in an event-processing state. Thus the platform may complete the processing of non-interruptable tasks before it processes the 'cancel' or 'goto' actions.
Speaker biometric resources supported in VoiceXML 3.0 provide three types of functions:
The following figure shows an overview of the flow of information in SIV processing:
The SIV engine computes a match based on one or more utterances from the user, a voicemodel or reference voice model, thresholds and other configuration parameters. The results are presented to VoiceXML in EMMA 1.0 format.
Verification is the process by which a user's utterances are matched against a pre-computed reference voice model. The next figure details this process.
Verification decisions are based upon a wide variety of criteria, and applications must choose how to evaluate trade-offs between application-specific factors:
Good Enough |
|
More Data Needed |
|
Abort |
|
The core functions of SIV processing are divided into creating a voice model and enrolling a user in the system (Enrollment) and using a pre-existing voice model (either Verification or Identification). Additionally, some engines support adaptation of an existing voice model based on new user utterances. An SIV dialog may consist of a sequence of one or more SIV dialog turns.
The SIV resource is defined in terms of a data model and state model. The data model is composed of the following elements:
An SIV Resource manages SIV devices during a processing cycle:
[Schema definition TBD]
Name |
Type |
Description |
Required |
Default Value |
---|---|---|---|---|
mode |
enroll verify identify |
Defines the mode of SIV function |
Yes |
None If no mode is provided, throw error.badfetch |
type |
Text-independent Text-dependent [other] |
Type of speech technology used by the SIV engine |
No |
None [Should V3 support "other" types?] |
identity |
URI |
Claim of identity passed to SIV engine to select a voice model. |
Yes when mode="verify"; Must not be supplied for mode="identify" If mode="enroll" and identity URI is provided, a voice model will be created at the URI specified. Else, the form item variable will contain the URI of the created voice model. |
If mode="verify" and no identity is supplied, throw error.badfetch. |
fetchhint |
One of the values "safe" or "prefetch" |
Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. |
No |
None |
fetchtimeout |
The interval to wait for the content to be returned before throwing an error.badfetch event. |
No |
None |
|
maxage |
An unsigned integer |
Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. |
No |
None |
maxstale |
An unsigned integer |
Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. |
No |
None |
The voicemodel RC is the primary RC for the <voicemodel> element.
The voicemodel RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The voicemodel RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.
[NB: The execution model will follow the process defined for the grammar RC.]
The voicemodel RC is defined to receive the following events:
Event |
Source |
Payload |
Description |
initialize |
any |
controller(M) |
causes the element and its children to be initialized |
execute |
controller |
Adds the grammar to the appropriate Recognition Resource |
and the events it sends:
Event |
Target |
Payload |
Description |
initialized |
controller |
response to initialize event indicating that it has been successfully initialized |
|
executed |
controller |
response to execute event indicating that it has been successfully executed |
The external events sent and received by the voicemodel RC are those defined in this table:
Event |
Source |
Target |
Description |
addVoiceModel |
voicemodelRC |
SIV Resource |
Activates voicemodel |
createVoiceModel |
voicemodelRC |
SIV Resource |
Creates a new voicemodel |
adaptVoiceModel |
voicemodelRC |
SIV Resource |
Adapts the reference voice model based on the current dialog |
Prefetch |
voicemodelRC |
SIV Resource |
Requests that the voicemodel be fetched in advance, if possible |
The events in this table may be raised during initialization and execution of the <voicemodel> element.
Event |
Description |
State |
error.semantic |
indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. |
execution |
TBD
This module defines the syntactic and semantic features of a <disconnect> element. The <disconnect> element causes the interpreter context to disconnect from the user. It provides the interpreter context a way to enter into final processing state by throwing the "connection.disconnect.hangup" event. [See 5.5 Connection Resource].
Processing the <disconnect> element also causes the interpreter context to flush the prompt queue before disconnecting the interpreter context from the user and subsequently throwing "connection.disconnect.hangup" event.
The attributes and content model of <disconnect> are specified in 6.18.1 Syntax. Its semantics are specified in 6.18.2 Semantics.
[See XXX for schema definitions].
The <disconnect> element has the attributes specified in Table 78.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
namelist | List of variable names | Variable names to be returned to the interpreter context. The precise mechanism by which these variables are made available to the interpreter context is platform specific. If an undeclared variable is referenced in the namelist, then an error.semantic is thrown. | No | The default is to return no variables; this means that the interpreter context will receive an empty ECMAScript object. |
The disconnect RC is the primary RC for the <disconnect> element.
The disconnect RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The disconnect RC's state model consists of the following states: Idle, Initializing, Ready, Executing, and Disconnecting. The initial state is the Idle state.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.
In the Initializing state, the disconnect RC essentially does nothing. The RC sends the controller an 'initialized' event and transitions to the Ready state.
In the Ready state, when the disconnect RC receives an 'execute' event it sends an 'execute' event to the Play RC (6.19 Play Module), causing any queued prompts to be played. Note that the event data passed to Play RC must have:
This RC transitions to the Executing state after sending the event request to the Play RC.
In the Executing state, when the disconnect RC receives the "playDone" event, it instructs the connection resource to disconnect the interpreter context from the user and enters into the "Disconnecting" state.
In the "Disconnecting" state, when the disconnect RC receives "userDisconnected" event,
Editorial note | |
Play RC is not yet defined. |
The Disconnect RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | Causes the element to be initialized |
execute | controller | Causes prompts in the prompt queue to be played |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized | |
executed | controller | Response to execute event indicating that it has been successfully executed | |
error.semantic | controller | Response to an undeclared variable in namelist | |
disconnectUser | Connection Resource | Instructs interpreter context to disconnect the user | |
connection.disconnect.hangup | parent | Throw this event to the parent element |
The following table shows the events sent and received by the Disconnect RC to Resources and other RCs which define the events.
Event | Source | Target | Description |
execute | Disconnect RC | Play RC |
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data id="controller"/> <data id="namelist"/> <data id="bargein"/> <data id="activeGrammars"/> </datamodel> <state id="Created"> <initial id="Idle"/> <state id="Idle"> <onentry> <assign location="$controller" val="null"/> <assign location="$bargein" val="false"/> <assign location="$activeGrammars" val="false"/> </onentry> <transition event="initialize" target="Initializing"> <assign name="$controller" expr="_eventData/controller"/> <send target="controller" event="initializing.done" /> </transition> </state> <!-- end Idle --> <state id="Initializing"> <transition event="initializing.done" target="Ready"> <send target="controller" event="initialized"/> </transition> </state> <!-- end Initializing --> <state id="Ready"> <transition event="execute" target="Executing"> <send target="Play" event="execute"/> </transition> </state> <!-- end Ready --> <state id="Executing"> <transition event="playDone" target="Disconnecting"> // Currently no such event as "Play.done" <send target="controller" event="disconnectUser" /> </transition> </state> <!-- end Executing --> <state id="Disconnecting"> <transition event="userDisconnected" target="Ready"> <!-- A call to disconnect the interpreter context from the user --> <send target="parent" event="connection.disconnect.hangup" /> <send target="controller" event="executed" /> </transition> </state> <!-- end Disconnecting --> </scxml>
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"> <catch event="connection.disconnect.hangup"> <goto next="finalprocessing.jsp" /> </catch> <form> <field name="phoneNum" type="digits"> <prompt>Due to some technical difficulties, we cannot process your request now. Please enter your phone number so that we can get back to you later.</prompt> <catch event="noinput"> Disconnecting. <disconnect/> </catch> <filled> <disconnect namelist="phoneNum"/> </filled> </field> </form> </vxml>
This module defines the semantic features of a Play capability. Note that there is no XML syntax associated with this RC. It is only used by the Disconnect module (6.18 Disconnect Module).
The Play RC coordinates media output with the PromptQueue Resource (5.2 Prompt Queue Resource).
The PlayRC coordinates media output with the PromptQueue Resource for scenarios when only prompts need to be played to completion without doing any recognition.
This RC activates prompt queue playback. The prompt RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The Play RC's state model consists of the following states: Idle, StartPlaying. The initial state is the Idle state.
While in the Idle state, the prompt RC may receive an 'execute' event, whose event data is used to update the data model. The event information includes: controller. The RC then transitions to the StartPlaying state.
In the StartPlaying state, the RC sends a "play" event to the PromptQueue resource. The PromptQueue returns with a "play.done" event when all the propmts in the queue are done playing. The Play RC then sends the "play.done" event to its controller and exits the RC.
The Play RC is defined to receive the following events:
Event | Source | Payload | Description |
and the events it sends:
Event | Target | Payload | Description |
Table 84 shows the events sent and received by the Play RC to resources and other RCs which define the events. [Table contents TBD]
Event | Source | Target | Description |
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data id="controller"/> <data id="bargein"/> </datamodel> <state id="Created"> <initial id="Idle"/> <state id="Idle"> <onentry> <assign location="$controller" val="null"/> <assign location="$bargein" val="false"/> </onentry> <transition event="execute" target="Playing"> <assign name="$controller" expr="_eventData/controller"/> <send target="PromptQueue" event="play" namelist="/datamodel/data[@name='bargein']"/> </transition> </state> <!-- end Idle --> <state id="Playing"> <transition event="prompt.done" target="Idle"> <send target="controller" event="playDone"/> </transition> </state> <!-- end StartPlaying --> </state> <!-- end Created --> </scxml>
This module defines the syntactic and semantic features of a <record> element which collects and stores a recording from the user.
The attributes and content model of <record> are specified in 6.20.1 Syntax. Its semantics are specified in 6.20.2 Semantics.
[See XXX for schema definitions].
The <record> element has the attributes specified in Table 85.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
name | The name must conform to the variable naming conventions in (TODO). | The form item variable in the dialog scope that will hold the recording. The name must be unique among form items in the form. If the name is not unique, then a badfetch error is thrown when the document is fetched. Note that how this variable is implemented may vary between platforms (although all platforms must support its behavior in <audio> and <submit> as described in this specification). | Yes | |
expr | TBD | The initial value of the form item variable. If initialized to a value, then the form item will not be visited unless the form item variable is cleared. | No | ECMAScript undefined |
cond | data model expression | A data model expression that must evaluate to true after conversion to boolean in order for the form item to be visited. | No | true |
modal | boolean | If this is true, all non-local speech and DTMF grammars are not active while making the recording. If this is false, non-local speech and DTMF grammars are active. | No | true |
beep | boolean | If true, a tone is emitted just prior to recording. | No | false |
maxtime | Time Designation | The maximum duration to record. | No | Platform-specific value |
finalsilence | Time Designation | The interval of silence that indicates end of speech. | No | Platform-specific value |
dtmfterm | boolean | If true, any DTMF keypress will be treated as a match of an active (anonymous) local DTMF grammar. | No | true |
type | Required audio file formats specified in <xref>Appendix E of VoiceXML 2</xref> (Other formats may also be supported) | The media format of the resulting recording. | No | Platform-specific (one of the required formats) |
The <record> element content model is the same as any other form input item and consists of:
The <record> module also sets the shadow variables covered in Table 86 in the data module after the recording has been made.
Variable | Description |
---|---|
name$.duration | The duration of the recording in milliseconds. |
name$.size | The size of the recording in bytes. |
name$.termchar | If the dtmfterm attribute is true, and the user terminates the recording by pressing a DTMF key, then this shadow variable is the key pressed (e.g. "#"). Otherwise it is undefined. |
name$.maxtime | Boolean, true if the recording was terminated because the maxtime duration was reached. |
The semantics of the record module are defined using the following resource controllers: RecordInputItem (6.20.2.1 RecordInputItem RC), PlayandRecognize (6.10.2.2 PlayandRecognize RC), Record (6.20.2.2 Record RC).
The Record Input Item Resource Controller is the primary RC for the record element.
The record input item RC is defined in terms of a data model and a state model.
The data model is composed of the following parameters:
The RecordInputItem RC is defined to receive the following events:
[Table contents TBD]
Event | Source | Payload | Description | |
Event | Target | Payload | Description |
The Record RC coordinates media with the Record resource and a form item. It is expected that other functionality that may be related or simultaneous such as prompt playing or recognition will be handled by the form item's rc. The record RC only handles the recording.
The record RC is defined in terms of a data model and a state model.
The data model is composed of the following parameters:
The Record RC is defined to receive the following events:
Event | Source | Payload | Sequencing | Description |
---|---|---|---|---|
execute | any | controller (M), recordParams (O) |
Event | Target | Payload | Sequencing | Description |
---|---|---|---|---|
recordDone | controller | RecordResults (M) | One-of: recordDone, noinput | |
noinput | controller | One-of: recordDone, noinput |
Table 92 shows the events sent by the record RC to various resources.
Event | Target | Payload | Sequencing | Description |
---|---|---|---|---|
start | Timer | timeout (M) | ||
cancel | Timer | |||
start | Record Resource | recoParams (M) | ||
stop | Record Resource | maxtime (M) | This causes the recording to stop and return the recording information, setting a Boolean about if the recording was stopped because of maxtime | |
kill | Record Resource | This immediately stops the record resource, throwing away any recording | ||
assign | Datamodel Resource | name (M), value (M) |
Table 93 shows the events received by the record RC. Their definition is provided by the sending component.
Event | Source | Payload | Sequencing | Description |
---|---|---|---|---|
TimerExpired | Timer | |||
Started | Record Resource | This event means a recording has begun. This may be automatic as soon as start was sent to the resource, or there may be an optimization where this doesn't get sent until after the record resource detects speech. |
The <property> element sets a property value. Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc. For a more complete list of individual property values see 8.2 Properties.
Properties may be defined for the session, for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item. Properties apply to their parent element and all the descendants of the parent. A property at a lower level overrides a property at a higher level. When different values for a property are specified at the same level, the last one in document order applies. Properties specified in the session root document provide default values for properties throughout the session; properties specified in the application root document provide default values for properties in every document in the application; properties specified in an individual document override property values specified in the application root document.
[See XXX for schema definitions].
The <property> element has the attributes specified in Table 94.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
name | The name of the property | Yes | ||
value | A static value of the property | No | (Required if "expr" not present) | |
expr | data model expression | A data model expression to be evaluated when the property in question is used. | No | (Required if "value" not present) |
The value and expr attributes are mutually exclusive. A <property> element may have only one of value and expr defined.
Whenever a particular property needs to be used, its value must be looked up. The mechanism to lookup the value of a property is dependent on both the syntactic structure of the authored document and the current location in execution. The timing of whenever a property needs to be used is dependent on the property in question. For instance, the speedvsaccuracy recognizer property needs to have its value be looked up whenever a recognition is to take place (like in a <field> tag) but it doesn't need to be looked up at other times like when a generic prompt is queued in a block or when a script file is fetched. In contrast, a property like bargein needs to be looked up whenever a prompt is queued in a block and the scripttimeout property needs to be looked up whenever a script file is loaded. For a more complete list of individual properties and when they are evaluated see section 8 on environment that lists the various properties and details about them.
When a property is looked up first it must be determined if this property is defined at the current execution level. If it is then the last instance of the property element is the one that is evaluated and used. If no property elements for this property exist at the current document level then the next higher level XML container is checked, including going to application root documents and session root documents. This parent chain is halted whenever the property is found. The chain of parents to follow for a look up can span a number of different document elements. For instance it may have had to be looked up at the current level (I.e., is there a property defined in the field), the dialog level (I.e., what about in the form), the document level (I.e., what about in the vxml element), the application level (I.e., what about the application root), the session level (I.e., what about the session root). If the property isn't found at any of these levels then either a specification specified default or a platform default may need to be used (see the platform root). For instance the speedvsaccuracy default value is defined to be 0.5 by the VXML specification. The completetimeout property however has no specification default and instead must use a platform default if it is not present.
If the platform has trouble evaluating the value of a property (I.e., an expr failure) or the value of a property is invalid (I.e., a completetimeout of "orange") then the platform should throw error.semantic.
This module defines the syntactic and semantic features of a <controller> element. Transition controllers provide the basis for managing flow in VoiceXML applications. Resource controllers for some elements like <form> have associated transition controllers which influence how form items get selected and executed. In addition to form, there is a transition controller for each of the following higher VoiceXML scopes:
The above transition controllers influence the selection of the first form item resource controller to execute, and subsequent ones through the session.
The attributes and content model of <controller> are specified in 6.22.1 Syntax. Its semantics are specified in 6.22.2 Semantics.
Transition controllers serve as dialog managers for VoiceXML forms and documents and are described syntactically using the <controller> element. A VoiceXML 3.0 <form> element or <vxml> element may have at most one <controller> child, which allows mixed content.
[See XXX for schema definitions].
The <controller> element has the attributes specified in Table 95.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
id | xml:id | This is a document-unique identifier for the <controller> element. This may be used to register listeners against or dispatch events to transition controllers in the future. | No | |
type | MIME type | This is the MIME type of the transition controller description. The default transition controller description type is SCXML. The types supported depend on the VoiceXML 3.0 profile in use. | No | application/xml+scxml |
src | URI | The URI specifying the location of the controller document, if it is external. | No | |
srcexpr | data model expression | Equivalent to src, except that the URI is dynamically determined by evaluating the content as a data model expression. | No |
The controller RC is the primary RC for the <controller> element.
Whenever a form item or form or document finishes execution, the relevant transition controller is consulted for selecting the subsequent one for execution. To find the relevant transition controller, begin at the current resource controller and navigate along the associated VoiceXML element's parent axis until you reach a resource controller with an associated transition controller. For example, for a form item, the parent form's resource controller has the relevant transition controller.
When a transition controller runs to completion, control is returned to the next higher transition controller along with any results that need to be passed up. For example, when the last form item in a form is filled, the transition controller associated with the form returns control to the document level transition controller along with the results for the filled form. Control may be returned to the parent transition controller in case of such run to completion semantics as well as error semantics.
EXAMPLE 1: If the transition controller is described using an XML type or vocabulary, the description is a direct child of the <controller> element.
<v3:vxml version="3.0" ...> <!-- document level transition controller, type defaults to SCXML --> <v3:controller id="controller" type="application/xml+scxml"> <scxml:scxml version="1.0" ...> <!-- Transition controller as a state chart --> </scxml:scxml> </v3:controller> <!-- Remainder of VoiceXML 3.0 document --> </v3:vxml>
EXAMPLE 2: If the transition controller is described using a non-XML type or vocabulary, the description is the text content of the <controller> element. A CDATA section may be used if needed.
<v3:vxml version="3.0" ...> <!-- document level transition controller --> <v3:controller id="controller" type="text/some-controller-notation"> <![CDATA[ // Some text-based transition controller description ]]> </v3:controller> <!-- Remainder of VoiceXML 3.0 document --> </v3:vxml>
EXAMPLE 3: [Support for this convenience syntax not yet decided -- here's the tentative text: If the transition controller is described using SCXML, then a convenience syntax of placing the <scxml> root element as a direct child of the <vxml> or <form> element is supported without the need of a <controller> wrapper. Thereby, the following two variants are equivalent:]
<v3:vxml version="3.0" ...> <v3:controller> <scxml:scxml version="1.0" ...> <!-- Transition controller as a state chart --> </scxml:scxml> </v3:controller> <!-- Remainder of VoiceXML 3.0 document --> </v3:vxml> Variant 2: <v3:vxml version="3.0" ...> <scxml:scxml version="1.0" ...> <!-- Transition controller as a state chart --> </scxml:scxml> <!-- Remainder of VoiceXML 3.0 document --> </v3:vxml>
VoiceXML 3.0, like SMIL, is a specification that contains a variety of functional modules. Not all implementers of VoiceXML will be interested in implementing all of the functionality defined in the document. For example, an implementer may have no interest in speech or DTMF recognition but still be interested in speech output. An example might be an implementer of book reading products for the visually impaired. Also, the syntax defined for each of the VoiceXML 3.0 modules is fairly low-level, and authors familiar with a more declarative language may wish to have higher-level syntax that is easier to program in.
To address these interests while maintaining sufficiently precise behavior definition to enhance portability, we encourage the use of profiles.
A profile is an implementation of VoiceXML 3 that
This specification defines the following profiles:
It should be possible for other profiles to be created, perhaps by modifying an existing profile, combining different modules, or even adding new module functionality and syntax.
Implementers may differ in their choice of which profiles they implement. Implementers must support a designated set of modules in order to claim support for VoiceXML 3. That designated set is TBD.
[ISSUES:
The Legacy profile is included demonstrating how profiles are defined in VoiceXML 3.0. Using existing elements from the [VOICEXML21] specification is helpful as the semantics of these elements are already well defined and well understood. Thus changes in how they are presented are a result of the module and profile style of VoiceXML 3.0 and of making more explicit and formal the precise detailed semantics.
The Legacy profile also plays a transitional role as VoiceXML 3.0 as a whole is built on top of VoiceXML 2.1. VoiceXML 3.0 is a superset of VoiceXML 2.1 and includes the traditional 2.1 functionality plus some new modules. The Legacy profile is the set of modules that were always present in VoiceXML 2.1 but that weren't expressed in the specification as individual modules. This also allows a clear path for the VoiceXML application developer as the application developer will not need to learn substantial new syntax or semantics when they develop in the Legacy profile of VoiceXML 3.0.
The Legacy profile also represents a proof of concept to ensure that the new modular profile method of describing the specification is in no way limited. VoiceXML 3.0 in its entirety will be in no way limited or constrained because of the use of profiles and modules and formalized semantic models. Anything that was standardized in VoiceXML 2.1 can be standardized in this new format and the Legacy profile reveals that.
This profile uses the prompt module (6.4 Prompt Module) extended with the Builtin SSML module (6.5 Builtin SSML Module) and (6.8 Foreach Module).
This section defines semantics of how different modules coexist with each other to simulate the behavior of VoiceXML 2.1. It outlines all the required modules and any additions/deletions from each of these modules to make it conform to this profile. It also talks about the interaction amongst various modules so that behavior similar to that in VoiceXML 2.1 is achieved.
To conform with this profile, processors must implement the following modules:
Editorial note | |
The following content is missing from the Vxml 3.0 specification and needs to be defined:
|
Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.
Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.
Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.
The basic profile provides basic media capabilities including the capture and presentation of temporal media such as audio and video. The basic profile serves two different purposes:
The basic profile is a "lean and mean" collection of media functions which the application can invoke, but does not duplicate functions from the scripting or programming languages.
The basic profile is intended for single-turn prompt and collect applications.
The following enumerates the modules included in the Basic Profile.
The SIV Module (See 6.16 SIV Module) includes verification, identification, and enrollment functions.
The Builtin-SSML Module (6.5 Builtin SSML Module), the Media Module (6.6 Media Module), and the Parseq Module (6.7 Parseq Module) provide functions for presenting information to the user.
Capture functions include
The Basic Profile also includes the Data Access and Manipulation module (6.12 Data Access and Manipulation Module) for accessing local variables, parameters, returned values, etc. This module is not intended to access external databases.
Results are returned using EMMA notation, which may include includes confidence scores and multiple results represented in a lattice.
Editorial note | |
How does the host scripting or programming language access returned results? TBD: define the amount of ECMAScript and data access capability. |
The basic profile does NOT support the capture of key strokes, mouse movements, or joystick movements and does not present static information such as text and bit map graphics on a display. Other languages such as HTML, SVG, and XHTML can be used to perform these functions.
The basic profile does NOT support higher-level flow control constructs such as the VoiceXML 2.0 <form> element and the associated Form Interpretation Algorithm. The Basic Profile assumes that application developers specify flow control using a scripting or programming language, which controls all application-specific procedural flow and interaction with non-VoiceXML functions such as databases, application functions, and Web services.
The following VoiceXML 2.0/2.1 control elements are NOT included in the Basic Profile:
The maximal server profile represents the closure over the VoiceXML 3.0 platform feature set and is intended for applications providing feature-rich voice user interfaces. This profile provides data access and manipulation capabilities, full media capabilities, higher-level flow control constructs such as <form> and <field>, and full support for environment properties. We believe that control flow capabilities such as those provided by SCXML and CCXML are necessary to take full advantage of the features in the maximal server profile.
Specifically, the maximal server profile provides support for [... enumerate all modules in Section 6 of WD2 draft ...]
[Issue: Should support for SCXML be required when implementing the profile? Should support for CCXML be required when implementing the profile? What are the interoperability ramifications of requiring one, multiple, or no flow control languages for support of this profile?]
At a minimum, the enhanced profile should include the modules not included in the basic and legacy profiles, specifically SIV and receive. The enhanced profile may include modules which are included in the basic and legacy profiles. The enhanced profile should demonstrate the new capabilities of VoiceXML 3.0, including prompt control and alternative flow control strategies.
Profiles can provide convenience syntax to simplify authoring for that profile without decreasing portability. Convenience Syntax, as we define it here, can be implemented via a straightforward text mapping from the convenience syntax to profile code that uses only the syntax defined by the modules in the profile. Convenience syntax cannot add functionality. It only makes existing functionality easier to code.
A Convenience syntax definition must include
The existence and definition of the mapping above means that an author can write VoiceXML applications using the (presumably simpler) convenience syntax, while being assured that the code will execute *as if* the code had been replaced by the (presumably more complex but well-defined) module syntax. This allows authors to code simple cases in the convenience syntax, and make use of other VoiceXML syntax elements and attributes only as needed.
E Convenience Syntax in VoiceXML 2.x shows how the VoiceXML 2.1 <menu> and pre-defined catch handlers could be coded using other V2 notation (i.e., as convenience syntax).
[Examples TBD.]
A VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as media files, grammars, scripts, and XML data. Each fetch of the content associated with a URI is governed by the following attributes:
fetchtimeout | The interval to wait for the content to be returned before throwing an error.badfetch event. The value is a Time Designation. If not specified, a value derived from the innermost fetchtimeout property is used. |
---|---|
fetchhint | Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. If not specified, a value derived from the innermost relevant fetchhint property is used. |
maxage | Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. If not specified, a value derived from the innermost relevant maxage property, if present, is used. |
maxstale | Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. If not specified, a value derived from the innermost relevant maxstale property, if present, is used. |
When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below.
The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource. Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint.
When transitioning from one dialog to another, through either a <subdialog>, <goto>, <submit>, <link>, or <choice> element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins.
Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, <submit> always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch.
Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see 4.5.2.2 Application Root and [RFC2396]). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched.
Elements that fetch VoiceXML documents also support the following additional attribute:
fetchaudio | The URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. |
---|
The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch.
The VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes.
The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ([RFC2616]). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource:
The "maxstale check" is:
Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server.
The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform).
While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt.
VoiceXML allows the author to override the default caching behavior for each use of each resource (except for any document referenced by the <vxml> element's application attribute: there is no markup mechanism to control the caching policy for an application root document).
Each resource-related element may specify maxage and maxstale attributes. Setting maxage to a non-zero value can be used to get a fresh copy of a resource that may not have yet expired in the cache. A fresh copy can be unconditionally requested by setting maxage to zero.
Using maxstale enables the author to state that an expired copy of a resource, that is not too stale (according to the rules of HTTP 1.1), may be used. This can improve performance by eliminating a fetch that would otherwise be required to get a fresh copy. It is especially useful for authors who may not have direct server-side control of the expiration dates of large static files.
Prefetching is an optional feature that an interpreter context may implement to obtain a resource before it is needed. A resource that may be prefetched is identified by an element whose fetchhint attribute equals "prefetch". When an interpreter context does prefetch a resource, it must ensure that the resource fetched is precisely the one needed. In particular, if the URI is computed with an expr attribute, the interpreter context must not move the fetch up before any assignments to the expression's variables. Likewise, the fetch for a <submit> must not be moved prior to any assignments of the namelist variables.
The expiration status of a resource must be checked on each use of the resource, and, if its fetchhint attribute is "prefetch", then it is prefetched. The check must follow the caching policy specified in Section 6.1.2.
Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc.
The following types of properties are defined: speech recognition (8.2.1 Speech Recognition Properties), DTMF recognition (8.2.2 DTMF Recognition Properties), prompt and collect (8.2.3 Prompt and Collect Properties), media (8.2.4 Media Properties), fetching (8.2.5 Fetch Properties) and miscellaneous (8.2.6 Miscellaneous Properties) properties.
Editorial note | |
Open issue: should the specification provide specific default values rather than platform-specific? Open issue: Should we add a 'type' column for all properties? |
The following generic speech recognition properties are defined.
Name | Description | Default |
---|---|---|
confidencelevel | The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. A value of 0.0 means minimum confidence is needed for a recognition, and a value of 1.0 requires maximum confidence. The value is a Real Number Designation (see 8.4 Value Designations). | 0.5 |
sensitivity | Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation (see 8.4 Value Designations). | 0.5 |
speedvsaccuracy | A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation (see 8.4 Value Designations). | 0.5 |
completetimeout | The length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or throwing a nomatch event). The complete timeout is used when the speech is a complete match of an active grammar. By contrast, the incomplete timeout is used when the speech is an incomplete match to an active grammar. A long complete timeout value delays the result completion and therefore makes the computer's response slow. A short complete timeout may lead to an utterance being broken up inappropriately. Reasonable complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties. Although platforms must parse the completetimeout property, platforms are not required to support the behavior of completetimeout. Platforms choosing not to support the behavior of completetimeout must so document and adjust the behavior of the incompletetimeout property as described below. | platform-dependent |
incompletetimeout | The required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is rejected (with a nomatch event). The incomplete timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the complete timeout is used when the speech is a complete match to an active grammar and no further words can be spoken. A long incomplete timeout value delays the result completion and therefore makes the computer's response slow. A short incomplete timeout may lead to an utterance being broken up inappropriately. The incomplete timeout is usually longer than the complete timeout to allow users to pause mid-utterance (for example, to breathe). See 8.3 Speech and DTMF Input Timing Properties Platforms choosing not to support the completetimeout property (described above) must use the maximum of the completetimeout and incompletetimeout values as the value for the incompletetimeout. The value is a Time Designation (see 8.4 Value Designations). | undefined? |
maxspeechtimeout | The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "maxspeechtimeout" is thrown. The value is a Time Designation (see 8.4 Value Designations). | platform-dependent |
The following generic DTMF recognition properties are defined.
Name | Description | Default |
---|---|---|
interdigittimeout | The inter-digit timeout value to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties. | platform-dependent |
termtimeout | The terminating timeout to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations). 8.3 Speech and DTMF Input Timing Properties. | 0s |
termchar | The terminating DTMF character for DTMF input recognition. See 8.3 Speech and DTMF Input Timing Properties. | # |
The following properties are defined to apply to the fundamental platform prompt and collect cycle.
Name | Description | Default |
---|---|---|
bargein | The bargein attribute to use for prompts. Setting this to true allows bargein by default. Setting it to false disallows bargein. | true |
bargeintype | Sets the type of bargein to be speech or hotword. See "Bargein type" (link TBD). | platform-specific |
timeout | The time after which a noinput event is thrown by the platform. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties. | platform-dependent |
The following properties are defined to apply to output media.
Name | Description | Default |
---|---|---|
outputmodes |
Determines which modes may be used for media output. The value is a space separated list of media types (see media 'type' in TBD). This property is typically used with container file formats, such as "video/3gpp", which support storage of multiple media types. For example, to play both audio and video to the remote connection, the property would be set to "audio video". To play only the video, the property is set to "video". If the value contains a media type which is not supported by the
platform, the connection or the value of the <media> element
|
The default value depends on the negotiated media between the
local and remote devices. It is the space separated list of media
types specified in the session.connection.media array
elements' type property where the associated
direction property is sendrecv or
recvonly . |
The following properties pertain to the fetching of new documents and resources.
Note that maxage and maxstale properties may have no default value - see 8.1.2 Caching.
Name | Description | Default |
---|---|---|
audiofetchhint | This tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio. | prefetch |
audiomaxage | Tells the platform the maximum acceptable age, in seconds, of cached audio resources. | platform-specific |
audiomaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources. | platform-specific |
documentfetchhint | Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch. | safe |
documentmaxage | Tells the platform the maximum acceptable age, in seconds, of cached documents. | platform-specific |
documentmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents. | platform-specific |
grammarfetchhint | Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch (the default), or safe. | prefetch |
grammarmaxage | Tells the platform the maximum acceptable age, in seconds, of cached grammars. | platform-specific |
grammarmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars. | platform-specific. |
objectfetchhint | Tells the platform whether the URI contents for <object> may be pre-fetched or not. The values are prefetch, or safe. | prefetch |
objectmaxage | Tells the platform the maximum acceptable age, in seconds, of cached objects. | platform-specific |
objectmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached objects. | platform-specific |
scriptfetchhint | Tells whether scripts may be pre-fetched or not. The values are prefetch (the default), or safe. | prefetch |
scriptmaxage | Tells the platform the maximum acceptable age, in seconds, of cached scripts. | platform-specific |
scriptmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts. | platform-specific. |
fetchaudio | The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. | undefined |
fetchaudiodelay | The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation (see 8.4 Value Designations). The default interval is platform-dependent, e.g. "2s". The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off. | platform-specific |
fetchaudiominimum | The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation (see 8.4 Value Designations). The default is platform-dependent, e.g., "5s". The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly. | platform-specific |
fetchtimeout | The timeout for fetches. The value is a Time Designation (see 8.4 Value Designations). | platform-specific |
The following miscellaneous properties are defined.
Name | Description | Default |
---|---|---|
inputmodes | This property determines which input modality to use. The input modes to enable: dtmf and voice. On platforms that support both modes, inputmodes defaults to "dtmf voice". To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active. | ??? |
universals | Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items - see "Activation of Grammars" (link TBD)) and which generate specific events. Note that relying on platform-provided grammars is not good practice for production-grade applications (see 6.11 Builtin Grammar Module). Applications choosing to migrate from universals grammars to a more the robust developer-specified grammars should replace the universals <property> with one or more <link> (TODO, hyperlink) element(s). Because <link>s can also generate the same events as universal grammars, and because the <catch> handlers for the universal grammars persist outside the universals <property>, the migration should be seamless. The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help". | none |
maxnbest | This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1. | 1 |
The various timing properties for speech and DTMF recognition work together to define the user experience. The ways in which these different timing parameters function are outlined in the timing diagrams below. In these diagrams, the start for wait of DTMF input, or user speech both occur at the time that the last prompt has finished playing.
DTMF grammars use timeout, interdigittimeout, termtimeout and termchar as described in 8.2.2 DTMF Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.
The timeout parameter determines when the <noinput> event is thrown because the user has failed to enter any DTMF. Once the first DTMF has been entered, this parameter has no further effect.
In the following diagram, the interdigittimeout determines when the nomatch event is thrown because a DTMF grammar is not yet recognized, and the user has failed to enter additional DTMF.
The example below shows the situation when a DTMF grammar could terminate, or extend by the addition of more DTMF input, and the user has elected not to provide any further input.
In the example below, a termchar is non-empty, and is entered by the user before an interdigittimeout expires, to signify that the users DTMF input is complete; the termchar is not included as part of the recognized value.
In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is expected. Since termchar is empty, there is no optional terminating character permitted, thus the recognition ends and the recognized value is returned.
In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. If the termchar is non-empty, then the user can enter an optional termchar DTMF. If the user fails to enter this optional DTMF within termtimeout, the recognition ends and the recognized value is returned. If the termtimeout is 0s (the default), then the recognized value is returned immediately after the last DTMF allowed by the grammar, without waiting for the optional termchar. Note: the termtimeout applies only when no additional input is allowed by the grammar; otherwise, the interdigittimeout applies.
In this example, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. Since the termchar is non-empty, the user enters the optional termchar within termtimeout causing the recognized value to be returned (excluding the termchar).
While waiting for the first or additional DTMF, three different timeouts may determine when the user's input is considered complete. If no DTMF has been entered, the timeout applies; if some DTMF has been entered but additional DTMF is valid, then the interdigittimeout applies; and if no additional DTMF is legal, then the termtimeout applies. At each point, the user may enter DTMF which is not permitted by the active grammar(s). This causes the collected DTMF string to be invalid. Additional digits will be collected until either the termchar is pressed or the interdigittimeout has elapsed. A nomatch event is then generated.
Speech grammars use timeout, completetimeout, and incompletetimeout as described in 8.2.3 Prompt and Collect Properties and 8.2.1 Speech Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.
In the example below, the timeout parameter determines when the noinput event is thrown because the user has failed to speak.
Several VoiceXML parameter values follow the conventions used in the W3C's Cascading Style Sheet Recommendation [CSS2].
Integers are specified in decimal notation only. Integers may be preceded by a "-" or "+" to indicate the sign.
An integer consists of one or more digits "0" to "9".
This section presents some initial thoughts on how VoiceXML might be embedded within SCXML and how flow control languages such as SCXML and CCXML might be integrated into VoiceXML.
The following bank application example demonstrates how external vxml application could be invoked by a scxml script and vice versa. The state machine and flow control was implemented in the BankApp.scxml. The call is started from the BankApp.scxml, it will first call BankApp.vxml's form "getAccountNum" to collect the account number, then query database for checking and saving balance. The BankApp.scxml will then invoke the form "playBalance". If this form finds the accountType is not defined, it will invoke the AccountType.scxml, which will call BankApp.vxml form "getAccountType" to get the accountType. The "playBalance" will then play the balance on the corresponding account and return the call back to the BankApp.scxml.
BankApp.scxml
<?xml version="1.0" encoding="UTF-8"?> <scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:my="http://scxml.example.org/" version="1.0" initial="getAccountNum" profile="ecmascript" > <state id="getAccountNum"> <invoke targettype="vxml3" src="BankApp.vxml#getAccountNum" /> <transition event="vxml3.gotAccountNum" target="getBalance"/> </state> <state id="getBalance"> <datamodel> <data name="method" expr="'getBalance'"/> <data name="accountNum" expr="_data.accountNum"/> </datamodel> <send targettype="basichttp" target="BankDB.do" namelist="method accountNum" /> <transition event="basichttp.gotBalance" target="playingBalance"/> </state> <state id="playBalance"> <datamodel> <data name="checking_balance" expr="_data.checking.balance" /> <data name="saving_balance" expr="_data.saving.balance" /> </datamodel> <invoke targettype="vxml3" target="BankApp.vxml#playBalance" namelist="checking_balance saving_balance" /> <transition event="vxml3.playedBalance" target="exit" /> </state> <final id="exit"/> </scxml>
AccountType.scxml
<?xml version="1.0" encoding="UTF-8"?> <scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:my="http://scxml.example.org/" version="1.0" initial="getAccountType" profile="ecmascript" > <state id="getAccountType"> <invoke targettype="vxml3" src="BankApp.vxml#getAccountType" /> <transition event="vxml3.gotAccountType" target="exit"/> </state> <final id="exit"/> </scxml>
BankApp.vxml
<?xml version="1.0" encoding="UTF-8"?> <!-- TODO: need to add final namespace, schema, etc. for vxml element. --> <vxml version="3.0"> <form id="getAccountNum"> <field name="accountNum"> <grammar src=“accountNum.grxml" type="application/grammar+xml"/> <prompt> Please tell me your account number. </prompt> <filled> <exit namelist="accountNum"/> </filled> </field> </form> <form id="getAccountType"> <field name="accountType"> <grammar src=“accountType.grxml" type="application/grammar+xml"/> <prompt> Do you want the balance on checking or saving account? </prompt> <filled> <exit namelist="accountType"/> </filled> </field> </form> <form id="playBalance"> <var name="checking_balance"/> <var name="saving_balance"/> <block> <if cond="accountType == undefined"> <!--Here we are trying to invoke the external scxml script. At the time this example is written, the syntax to do this has not yet been decided. --> <goto next="AccountType.scxml#getAccountType"/> </if> <if cond="accountType == 'checking'"> <prompt> The checking account balance is <value expr="checking_balance"/>. </prompt> <else> <prompt> The saving account balance is <value expr="saving_balance"/>. </prompt> </if> </block> </form> </vxml>
State Chart XML (SCXML) could be used as the controller for managing the dialog in VoiceXML 3.0 applications. A recursive MVC technique allows SCXML controllers to be placed at session, document and form levels. Examples of resulting compound documents (containing V3 and SCXML namespaced elements) appear below for illustration. A graceful degradation / fallback approach could be used to ensure backwards compatibility with V2 applications. Note that the examples below use a new v3:scxmlform element.
[ISSUE: It has been suggested that using the existing v3:form element instead of a new v3:scxmlform element would be simpler and more elegant. Although the working group currently knows of no particular reason why the existing v3:form couldn't be used instead of a new v3:scxmlform element, the group has not yet discussed this in detail or agreed that using v3:form in this way is desirable. The group plans to discuss this and is interested in receiving public feedback on this possibility.]
Example application scenario:
Below are two flavors of this application using SCXML as the form-level controller, a system-driven and a user-driven approach. These use similar set of fields in the form but different dialog management styles. In the simpler example here, the VUI might appear similar though there is a system vs. user driven flavor.
Consider the following sketch of a V3 form for this purpose:
<v3:scxmlform> <scxml:scxml initial="choice"> <scxml:state id="choice"> <scxml:invoke type="vxml3field" src="#choicefield"/> <scxml:transition event="filled.choice" cond="choicefield" target="locator"/> <scxml:transition event="filled.choice" cond="!choicefield" target="lastname"/> </scxml:state> <scxml:state id="locator"> <scxml:invoke type="vxml3field" src="#locatorfield"/> <!-- Retrieve record, transition to app menu --> </scxml:state> <scxml:state id="lastname"> <scxml:invoke type="vxml3field" src="#lastnamefield"/> <!-- Collect other information needed to retrieve record, then retrieve record and go to app menu --> </scxml:state> <!-- Remaining dialog control flow logic omitted --> </scxml:scxml> <v3:field name="choicefield"> <v3:grammar src="boolean.grxml" type="application/srgs+xml"/> <v3:prompt> Welcome. Do you have the record locator for your itinerary? <v3:prompt> <v3:filled> <v3:throw event="filled.choice"/> </v3:filled> </v3:field> <v3:field name="locatorfield"> <v3:grammar src="locator.grxml" type="application/srgs+xml"/> <v3:prompt> What is the record locator for the itinerary? <v3:prompt> <v3:filled> <v3:throw event="filled.locator"/> </v3:filled> </v3:field> <v3:field name="lastnamefield"> <v3:grammar src="lastname.grxml" type="application/srgs+xml"/> <v3:prompt> Please say or spell your last name. <v3:prompt> <v3:filled> <v3:throw event="filled.lastname"/> </v3:filled> </v3:field> <!-- Other form items, such as the subsequent application menu omitted --> </v3:scxmlform>
Consider the following sketch of a V3 form for this purpose:
<v3:scxmlform> <scxml:scxml initial="choice"> <scxml:state id="choice"> <scxml:invoke type="vxml3field" src="#choicefield"/> <scxml:transition event="filled.choice" cond="choicefield == 'locator'" target="locator"/> <scxml:transition event="filled.choice" cond="choicefield == 'lastname'" target="lastname"/> </scxml:state> <scxml:state id="locator"> <scxml:invoke type="vxml3field" src="#locatorfield"/> <!-- Retrieve record, transition to app menu --> </scxml:state> <scxml:state id="lastname"> <scxml:invoke type="vxml3field" src="#lastnamefield"/> <!-- Collect other information needed to retrieve record, then retrieve record and go to app menu --> </scxml:state> <!-- Remaining dialog control flow logic omitted --> </scxml:scxml> <v3:field name="choicefield"> <v3:grammar src="choice.grxml" type="application/srgs+xml"/> <v3:prompt> Welcome. How would you like to look up your itinerary? <v3:prompt> <v3:filled> <v3:throw event="filled.choice"/> </v3:filled> </v3:field> <v3:field name="locatorfield"> <v3:grammar src="locator.grxml" type="application/srgs+xml"/> <v3:prompt> What is the record locator for the itinerary? <v3:prompt> <v3:filled> <v3:throw event="filled.locator"/> </v3:filled> </v3:field> <v3:field name="lastnamefield"> <v3:grammar src="lastname.grxml" type="application/srgs+xml"/> <v3:prompt> Please say or spell your last name. <v3:prompt> <v3:filled> <v3:throw event="filled.lastname"/> </v3:filled> </v3:field> <!-- Other form items, such as the subsequent application menu omitted --> </v3:scxmlform>
One possibility with this approach is that the absence of an <scxml:scxml> child element in a <v3:scxmlform> could revert behavior to be identical to a <v2:form> element where the V2 Form Interpretation Algorithm would be in charge. In the presence of a <scxml:scxml> child, the FIA would be suppressed and the more expressive SCXML controller used, which would allow application developers to design the form VUI in a very flexible manner. In other words, the following <v3:scxmlform> below:
<v3:scxmlform> <!-- No SCXML child --> <!-- Various form items etc. --> </v3:scxmlform>
behaves as would:
<v2:form> <!-- Various form items etc. --> </v2:form>
The above example illustrated a form-level SCXML controller. SCXML could perhaps also be used as a document level controller, where it would be managing the interaction across v3:forms, rather than v3:fields. To illustrate:
<v3:vxml> <scxml:scxml ...> <!-- document level controller managing interaction across form1, form2 and form3 --> </scxml:scxml> <v3:form id="form1"> <!-- form1 content, might also have a form level SCXML controller --> </v3:form> <v3:form id="form2"> <!-- form2 content, might also have a form level SCXML controller --> </v3:form> <v3:form id="form3"> <!-- form3 content, might also have a form level SCXML controller --> </v3:form> </v3:vxml>
This version of VoiceXML was written with the participation of members of the W3C Voice Browser Working Group. The work of the following members has significantly facilitated the development of this specification:
The W3C Voice Browser Working Group would like to thank the W3C team, especially Kazuyuki Ashimura and Matt Womer, for their invaluable administrative and technical support.
<?xml version="1.0" encoding="UTF-8"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="TBD" targetNamespace="TBD" blockDefault="#all"> <xsd:annotation> <xsd:documentation> This is the XML Schema driver for Legacy Profile of Vxml 3.0 specification. Please use this namespace for the Legacy Profile: "TBD:URL to schema" </xsd:documentation> <xsd:documentation source="vxml3-copyright.xsd"/> </xsd:annotation> <xsd:annotation> <xsd:documentation> This is the Schema Driver file for Legacy Profile of Vxml 3.0 Specification This schema + sets the namespace for Legacy Profile of Vxml 3.0 Specification + imports external schemas (xml.xsd) + imports schema modules Legacy Profile includes the following Modules * Vxml Root module * Form module * Field module * Prompt module * Grammar module * Data Access and Manipulation Module </xsd:documentation> </xsd:annotation> <xsd:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"> <xsd:annotation> <xsd:documentation> This import brings in the XML namespace attributes The XML attributes are used by various modules. </xsd:documentation> </xsd:annotation> </xsd:import> <xsd:include schemaLocation="vxml-datatypes.xsd"> <xsd:annotation> <xsd:documentation> This imports brings in the common datatypes for Vxml. </xsd:documentation> </xsd:annotation> </xsd:include> <xsd:include schemaLocation="vxml-attribs.xsd"> <xsd:annotation> <xsd:documentation> This imports brings in the common attributes for Vxml. </xsd:documentation> </xsd:annotation> </xsd:include> <xsd:include schemaLocation="vxml3-module-vxmlroot.xsd"> <xsd:annotation> <xsd:documentation> This imports the Vxml Root module for Vxml 3.0 </xsd:documentation> </xsd:annotation> </xsd:include> <xsd:include schemaLocation="vxml3-module-form.xsd"> <xsd:annotation> <xsd:documentation> This imports the Form module for Vxml 3.0 </xsd:documentation> </xsd:annotation> </xsd:include> <xsd:include schemaLocation="vxml3-module-field.xsd"> <xsd:annotation> <xsd:documentation> This imports the Field module for Vxml 3.0 </xsd:documentation> </xsd:annotation> </xsd:include> <xsd:include schemaLocation="vxml3-module-prompt.xsd"> <xsd:annotation> <xsd:documentation> This imports the Prompt module for Vxml 3.0 </xsd:documentation> </xsd:annotation> </xsd:include> <xsd:include schemaLocation="vxml3-module-grammar.xsd"> <xsd:annotation> <xsd:documentation> This imports the Grammar module for Vxml 3.0 </xsd:documentation> </xsd:annotation> </xsd:include> <xsd:include schemaLocation="vxml3-module-dataacces.xsd"> <xsd:annotation> <xsd:documentation> This imports the Data Access and Manipulation module for Vxml 3.0 </xsd:documentation> </xsd:annotation> </xsd:include> </xsd:schema>
Editorial note | |
The schema is incomplete. It merely imports the schemas for various modules, but doesn't contain parent/child relationships between modules or constraints on them. These all need to be specified in the future. |
VoiceXML 2 defines shorthand notation for several fundamental capabilities. For example, some <catch> elements can be represented in a shortened form:
<noinput>I didn’t hear anything. </noinput>
is equivalent to:
<catch event="noinput"> I didn’t hear anything. </catch>
This notation could be transformed via standard text substitution tools.
In addition to the fundamental <form> / <field> / <grammar> structure, VoiceXML 2 defines several simplified dialog specification mechanisms. These "syntactic shorthand" representations make it easier to build some types of dialogs.
For example, <menu> provides a means of allowing a user to select from a short list of items. The syntactic notation provides several shortcuts:
Another shorthand notation uses the standard <form> and <field> but simplifies the specification of a speech grammar through the <option> element. It isn’t necessary for a developer to understand SRGS in order to specify a phrase and associated semantic return value.
The following examples demonstrate the different V2 syntactic mechanisms that provide identical functionality.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" …> <menu dtmf="true"> <prompt> Say one of: <enumerate/> </prompt> <!-- DTMF choices are automatically generated since dtmf="true" is -- -- set, except where explicitly specified. --> <choice dtmf="0" next="#operator"> operator </choice> <choice next="http://sports.example.com/start.vxml"> sports </choice> <choice next="http://weather.example.com/intro.vxml"> weather </choice> <choice next="http://news.example.com/news.vxml"> news </choice> </menu> </vxml>
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" …> <form id="get_choice"> <!-- Link emulates ‘operator’ choice in menu --> <link next="#operator"/> <grammar mode="voice" version="1.0" root="root"> <rule id="root" scope="public"> operator </rule> </grammar> <grammar mode="dtmf" version="1.0" root="root2"> <rule id="root2" scope="public"> 0 </rule> </grammar> </link> <field name="choice"> <prompt> Say one of: <enumerate/> </prompt> <option dtmf="1" value="sports"> sports </option> <option dtmf="2" value="weather"> weather </option> <option dtmf="3" value="news"> news </option> <filled> <if cond="choice == 'sports'"> <goto next="http://sports.example.com/start.vxml"/> <elseif cond="choice == 'weather'"/> <goto next="http://weather.example.com/intro.vxml"/> <elseif cond="choice == 'news'"/> <goto next="http://news.example.com/news.vxml"/> <else/> </if> </filled> </field> </form> </vxml>
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0" …> <form id="get_choice"> <!-- Link emulates ‘operator’ choice in menu --> <link next="#operator"/> <grammar mode="voice" version="1.0" root="root"> <rule id="root" scope="public"> operator </rule> </grammar> <grammar mode="dtmf" version="1.0" root="root2"> <rule id="root2" scope="public"> 0 </rule> </grammar> </link> <field name="choice"> <prompt> Say one of sports, weather, news. </prompt> <grammar mode="voice" version="1.0" root="root"> <rule id="root" scope="public"> <one-of> <item> sports </item> <item> weather </item> <item> news </item> </one-of> </rule> </grammar> <grammar mode="dtmf" version="1.0" root="root2"> <rule id="root2" scope="public"> <one-of> <item> 1 </item> <item> 2 </item> <item> 3 </item> </one-of> </rule> </grammar> <filled> <if cond="choice == '1' || 'sports'"> <goto next="http://sports.example.com/start.vxml"/> <elseif cond="choice == '2' || 'weather'"/> <goto next="http://weather.example.com/intro.vxml"/> <elseif cond="choice == '3' || 'news'"/> <goto next="http://news.example.com/news.vxml"/> <else/> </if> </filled> </field> </form> </vxml>