Copyright © 2004 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use rules apply.
The W3C Multimodal Interaction working group aims to develop specifications to enable access to the Web using multi-modal interaction. This document is part of a set of specifications for multi-modal systems, and provides details of an XML markup language for describing the interpretation of user input. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user's input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user's inputs such as interaction managers.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is a W3C Working Draft for review by W3C members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This specification describes markup for representing interpretations of user input (speech, keystrokes, pen input etc.) together with annotations for confidence scores, timestamps, input medium etc., and forms part of the proposals for the W3C Multimodal Interaction Framework. This version of EMMA is the first to include the associated XML schema, see section 7.1.
This document has been produced as part of the W3C Multimodal Interaction Activity, following the procedures set out for the W3C Process, with the intention of advancing it along the W3C Recommedation track. The authors of this document are members of the W3C Multimodal Interaction Working Group (members only).
This document was produced under the 24 January 2002 CPP as amended by the W3C Patent Policy Transition Procedure. The Working Group maintains a public list of patent disclosures relevant to this document; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with section 6 of the W3C Patent Policy.
Your feedback is welcomed. Please send comments about this document to the public mailing list: [email protected] (public archives). See W3C mailing list and archive usage guidelines.
This document presents an XML specification for EMMA, an Extensible MultiModal Annotation markup language, responding to the requirements documented in W3C Requirements for EMMA. This markup language is intended for use by systems that provide semantic interpretations for a variety of inputs, including but not necessarily limited to, speech, natural language text, GUI and ink input.
It is expected that this markup will be used primarily as a standard data interchange format between the components of a multimodal system; in particular, it will normally be automatically generated by interpretation components to represent the semantics of users' inputs, not directly authored by developers.
The language is focused on annotating the interpretation information of single and composed inputs, as opposed to (possibly identical) information that might have been collected over the course of a dialog.
The language provides a set of elements and attributes that are focused on accurately representing annotations on the input interpretations.
An EMMA document can be considered to hold three types of data:
instance data
Application-specific markup corresponding to input information which is meaningful to the consumer of an EMMA document. Instances are application-specific and built by input processors at runtime. Given that utterances may be ambiguous with respect to input values, an EMMA document may hold more than one instance.
data model
Constraints on structure and content of an instance. The data model is typically pre-established by an application, and may be implicit, that is, unspecified.
metadata
Annotations associated with the data contained in the instance. Annotation values are added by input processors at runtime.
Given the assumptions above about the nature of data represented in an EMMA document, the following general principles apply to the design of EMMA:
The annotations of EMMA should be considered 'normative' in the sense that if an EMMA component produces annotations as described in Section 3, these annotations must be represented using the EMMA syntax. The Multimodal Interaction Working Group may address in later drafts the issues of modularization and profiling, that is: which sets of annotations are to be supported by which classes of EMMA component.
The general purpose of EMMA is to represent information automatically extracted from a user's input by an interpretation component, where input is to be taken in the general sense of a meaningful user input in any modality supported by the platform. The reader should refer to the sample architecture in W3C Multimodal Interaction Framework, which shows EMMA conveying content between user input modality components and an interaction manager.
Components that generate EMMA markup:
Components that use EMMA include:
Although not a primary goal of EMMA, a platform may also choose to use this general format as the basis of a general semantic result that is carried along and filled out during each stage of processing. In addition, future systems may also potentially make use of this markup to convey abstract semantic content to be rendered into natural language by a natural language generation component.
As noted above, the main components of an interpreted user input in EMMA are the instance data, an optional data model, and the metadata annotations that may be applied to that input. The realization of these components in EMMA is as follows:
An EMMA interpretation is the primary unit for holding user input as interpreted by an EMMA processor. As will be seen below, multiple interpretations of a single input are possible.
EMMA provides a simple structural syntax for the organization of interpretations and instances, and an annotative syntax derived from RDF to apply the annotation to the input data at any level.
An outline of the structural syntax of EMMA documents is as follows. A fuller definition may be found in the description of individual features in section 3.
EMMA annotations may apply to interpretations and any node within the XML tree for the application-specific markup for a specific interpretation.
Here is an example of a complete EMMA document, illustrating annotations at various levels. The time annotation of EMMA is specified in 3.2.10 Timestamps.
Example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:one-of id="r1" emma:start="1087995961542" emma:end="1087995963542"> <emma:interpretation id="int1" emma:confidence="0.75"> <origin>Boston</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> <emma:interpretation id="int2" emma:confidence="0.68"> <origin>Austin</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> </emma:one-of> </emma:emma>
This example shows a recognition result (id="r1") with two exclusive interpretations (id="int1" and id="int2"). There are four annotations. The first gives the start and end timestamps for the result. The second and third give confidence scores for the two interpretations. The fourth gives a timestamp for the date value in the first interpretation.
An EMMA data model expresses the constraints on the structure and content of instance data, for the purposes of validation. As such, the data model may be considered as a particular kind of annotation (although, unlike other EMMA annotations, it is not a feature pertaining a specific user input at a specific moment in time, it is rather a static and, by very definition, application-specific structure). Its specification in EMMA is optional.
Since Web applications today use different formats to specify data models, e.g. XML Schema, XForms, Relax-NG, etc., EMMA itself is agnostic to the format of data model used.
Data model definition and reference is defined in section 3.1.
This section defines annotations in the EMMA namespace. The values are specified in terms of the data types defined by XML Schema Part 2: Datatypes [XSD].
Annotation | emma:emma |
---|---|
Definition | The root element of an EMMA document. |
Attributes |
1. version (required) the version of EMMA used for the
interpretation(s). Interpretations expressed using this
specification (EMMA 1.0) should use "1.0" for the value.
2. namespace declaration for EMMA, see below. 3. any other namespace declarations for application specific namespaces (optional). |
The root element of an EMMA document is named emma. It holds one or more interpretation or grouping elements, and attributes for information pertaining to EMMA itself, along with any namespaces which are declared for the entire document, and any other EMMA annotative data. The emma element and other elements and attributes defined in this specification belong to the XML namespace identified by the URI "http://www.w3.org/2003/04/emma". In the examples, the EMMA namespace is generally declared using the attribute xmlns:emma on the root emma element. EMMA processors must support the full range of ways of declaring XML namespaces as defined by the W3C Recommendation "Namespaces in XML 1.1" [XMLNS]. Application markup can be declared in an explicit application namespace, or an undefined namespace (equivalent to setting xmlns="").
For example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> .... </emma:emma>
or
<emma version="1.0" xmlns="http://www.w3.org/2003/04/emma"> .... </emma>
Annotation | emma:interpretation |
---|---|
Definition | An element with attribute emma:id of type xsd:ID. The element acts as a container for EMMA application instance data or for a single emma:lattice element. It can also contain an (optional) emma:defived-from, emma:data-model, emma:info, and emma:endpint-info element. |
Applies to | EMMA container elements emma:interpretation, emma:group, emma:one-of, emma:sequence |
The emma:interpretation element holds a single interpretation represented in application specific markup.
Attributes:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="r1"> ... </emma:interpretation> </emma:emma>
or
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="r1"> ... </emma:interpretation> <emma:interpretation id="r2"> ... </emma:interpretation> </emma:emma>
Annotation | emma:one-of |
---|---|
Definition | An element with attribute emma:id of type xsd:ID. The element acts as a container for one or more of the same EMMA elements drawn from the set emma:interpretation, emma:group, emma:one-of, emma:sequence. It can also contain emma:data-model, emma:info, and emma:endpoint-info elements. |
Applies to | EMMA elements emma:emma, emma:group, emma:one-of, emma:sequence |
The emma:one-of element acts as a container for two or more emma:interpretation elements, and denotes that these are mutually exclusive interpretations.
Attributes:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:one-of id="r1"> <emma:interpretation id="int1"> <origin>Boston</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> <emma:interpretation id="int2"> <origin>Austin</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> </emma:one-of> </emma:emma>
The interpretations must be sorted best-first by some measure of quality. The quality measure is "emma:confidence" if present, otherwise, the quality metric is platform-specific.
Annotation | emma:model |
---|---|
Definition | An element with the attribute ref of type xsd:anyURI referencing the data model, alternatively the data model can be provided inline as the content of the emma:model element. Note that either a "ref" attribute or in-line data model (but not both) must be specified. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
The data model that may be used to express constraints on the structure and content of instance data is specified as one of the annotations of the instance. Specifying the data model is optional, in which case the data model can be said to be implicit. Typically the data model is pre-established by the application.
The data model is specified with the emma:model annotation defined as an element in the EMMA namespace. Note that since emma:model can be a child of any EMMA container element, it is possible for multiple data models to be referenced within a single EMMA document. For example, different alternative interpretations under an emma:one-of might have different data models.
The data model is closely related to the interpretation data, and is typically specified as the annotation related to the <interpretation> or <one-of> elements.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="int1"> <emma:model ref="http://myserver/models/city.xml"/> <city> London </city> <country> UK </country> </emma:interpretation> </emma:emma>
The emma:model annotation can reference any element or attribute in the application instance data, as well as any EMMA container element (emma:one-of, emma:group, or emma:sequence).
Annotation | emma:derived-from |
---|---|
Definition | An empty element with the attribute resource of type xsd:anyURI that references the interpretation from which the current interpretation is derived. |
Applies to | emma:interpretation |
Instances of interpretations are in general derived from other instances of interpretation in a process that goes from raw data to increasingly refined representations of the input. The derivation annotation is used to link any two interpretations that are related by representing the source and the outcome of an interpretation process. For instance, a speech recognition process can return the following result in the form of raw text:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="raw"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> </emma:emma>
A first interpretation process will produce:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="better"> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> </emma:emma>
A second interpretation process, aware of the current date, will be able to produce a more refined instance, such as:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="best"> <origin>Boston</origin> <destination>Denver</destination> <date>20030315</date> </emma:interpretation> </emma:emma>
The interaction manager may need to have access to the three levels of interpretation. The emma:derived-from annotation can be used to establish a chain of derivation relationships as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="raw"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation id="better"> <emma:derived-from resource="#raw" /> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> <emma:interpretation id="best"> <emma:derived-from resource="#better" /> <origin>Boston</origin> <destination>Denver</destination> <date>20030315</date> </emma:interpretation> </emma:emma>
Section 4 provides further examples of the use of <emma:derived-from> to represent both sequential derivations like those above and composite derivations in which inputs from multiple different modalities are combined, and addresses the issue of the scope of EMMA annotations across derivations of user input.
Annotation | emma:group |
---|---|
Definition | An element with attribute emma:id of type xsd:ID. The element acts as a container for one or more EMMA container elements emma:interpretation, emma:group, emma:one-of, emma:sequence. It can also contain emma:data-model, emma:info, and emma:endpoint-info elements. |
Applies to | EMMA elements emma:emma, emma:group, emma:one-of, emma:sequence |
Introduced in section 2.1, the emma:group element is used to indicate that the contained interpretations are related in some manner. The following example shows three interpretations derived from the speech input "Move this ambulance here" and the tactile input related to two consecutive points on a map. The group is associated with time stamps defining the beginning and end of a time window used to group the events.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:group id="grp" emma:start="2003-03-26T0:00:00.15" emma:end="2003-03-36T0:00:00.515"> <emma:interpretation> <action>move</action> <object>ambulance</object> <destination>here</destination> </emma:interpretation> <emma:interpretation> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretation> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:group> </emma:emma>
The emma:one-of and emma:group containers can be nested arbitrarily.
Annotation | emma:group-info |
---|---|
Definition | An element with the attribute ref of type xsd:anyURI referencing the grouping criteria, alternatively the criteria can be provided inline as the content of the emma:group-info element. |
Applies to | emma:group |
Sometimes it may be convenient to indirectly associate a given group with information, such as grouping criteria. The emma:group-info annotation can be used to associate a group. In the following example, a group of two points is associated with a description of grouping criteria based upon a sliding temporal window of two seconds duration.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:ex="http://www.example.com/ns/group"> <emma:group id="grp"> <emma:group-info> <ex:mode>temporal</ex:mode> <ex:duration>2s</ex:duration> </emma:group-info> <emma:interpretation> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretation> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:group> </emma:emma>
You can also use emma:group-info to refer to a named grouping criterion using external reference, for instance:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:ex="http://www.example.com/ns/group"> <emma:group id="grp"> <emma:group-info ref="http://www.example.com/criterion42"/> <emma:interpretation> <x>0.253</x> <y>0.124</y> </emma:interpretation> <emma:interpretation> <x>0.866</x> <y>0.724</y> </emma:interpretation> </emma:group> </emma:emma>
Annotation | emma:sequence |
---|---|
Definition | An element that can contain one or more of the EMMA container elements emma:interpretation, emma:group, emma:one-of, emma:sequence. It has an optional attribute emma:id of type xsd:ID. It can also contain (optional) emma:data-model, emma:info, and emma:endpoint-info elements. |
Applies to | EMMA elements emma:emma, emma:group, emma:one-of |
Introduced in section 2.1, the emma:sequence element is used to indicate that the contained interpretations are sequential in time, as in the following example:.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation> <action>move</action> <object>this-battleship</object> <destination>here</destination> </emma:interpretation> <emma:sequence> <emma:interpretation> <x>0.253</x> <y>0.124<y> </emma:interpretation> <emma:interpretation> <x>0.866</x> <y>0.724<y> </emma:interpretation> </emma:sequence> </emma:emma>
The emma:sequence container can be combined with emma:one-of and emma:group in arbitrary nesting structures. The order of children in the content of emma:sequence element corrresponds to a sequence of interpretations. This ordering does not imply any particular definition of sequentiality. EMMA processors may therefore use the emma:sequence element to hold interpretations which are either strictly sequential in nature (e.g. the end-time of an interpretation precedes the start-time of its follower), or which overlap in some manner (e.g. the start-time of a follower interpretation precedes the end-time of its precedent). It is possible to use timestamps to provide fine grid annotation for the sequence of interpretations that are sequential in time.
Annotation | emma:grammar |
---|---|
Definition | An element with the attribute href of type xsd:anyURI referencing a grammar and the attribute id of type xsd:ID |
Applies to | emma:emma |
The grammar that was used to derive the EMMA result is specified with the emma:grammar annotation defined as an element in the EMMA namespace.
Example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:grammar id="gram1" href="someURI"/> <emma:grammar id="gram2" href="anotherURI"/> <emma:one-of id="r1"> <emma:interpretation id="int1" grammar-ref="gram1"> <origin>Boston</origin> </emma:interpretation> <emma:interpretation id="int2" grammar-ref="gram1"> <origin>Austin</origin> </emma:interpretation> <emma:interpretation id="int3" grammar-ref="gram2"> <command>help</command> </emma:interpretation> </emma:one-of> </emma:emma>
The emma:grammar annotation is a child of emma:emma.
In addition to providing the ability to represent N-best lists of interpretations using <emma:one-of>, EMMA also provides the capability to represent lattices of words or other symbols using the <emma:lattice> element. Lattices provide a compact representation of large lists of possible recognition results or interpretations for speech, pen, or multimodal inputs.
In addition to providing a representation for lattice output from speech recognition, another important use case for lattices is for representation of the results of gesture and handwriting recognition from a pen modality component. Lattices can also be uses to compactly represent multiple possible meaning representations. Another use case for the lattice representation is that it enables the association of confidence scores and other annotations with individual words within a speech recognition result string.
Lattices can be compactly described by a list of transitions between nodes. For each transition the start and end nodes need to be defined, along with the label for the transition. Initial and final nodes also need to be indicated. The following figure provides a graphical representation of a speech recognition lattice which compactly represents eight different sequences of words.
which expands to:
a. flights to boston from portland today please b. flights to austin from portland today please c. flights to boston from oakland today please d. flights to austin from oakland today please e. flights to boston from portland tomorrow f. flights to austin from portland tomorrow g. flights to boston from oakland tomorrow h. flights to austin from oakland tomorrow
Annotation | emma:lattice |
---|---|
Definition | An element which can only appear as a child of <emma:interpretation> and which encodes a lattice representation of user input. This element acts as a container for the elements <emma:arc> and <emma:node>. This element has two required attributes emma:initial and emma:final. emma:initial has an integer value indicating the number of the initial node of the lattice. emma:final contains a space delimited sequence of integers indicating the numbers of the final nodes in the lattice. |
Applies to | emma:interpretation |
Annotation | emma:arc |
Definition | An element which only appears as a child of <emma:lattice> and which encodes a transition between two nodes in the lattice. This element must has two required attributes emma:from and emma:to. They have integer values indicating the number of the starting and ending nodes for the arc. The label associated with the arc in the lattice is represented in the content of <emma:arc>. |
Applies to | emma:lattice |
Annotation | emma:node |
Definition | An element which only appears as a child of <emma:lattice> and which represents node in the lattice. This element has one required attribute emma:node-number which indicates the number of the node in the lattice. <emma:node> specifications are not required to describe a lattice but can be added to provide a location for annotations on nodes in the lattice. There can only be one <emma:node> specification for each numbered node in the lattice. |
Applies to | emma:lattice |
In EMMA a lattice is represented using an element <emma:lattice>, which has attributes emma:initial and emma:final for indicating the initial and final nodes of the lattice. For the lattice above, this will be: <emma:lattice emma:initial="1" emma:final="8"/>. The nodes are numbered with integers. If there is more than one distinct final node in the lattice they should be represented as a space separated list in the value of the emma:final attribute e.g. <emma:lattice emma:initial="1" emma:final="9 10 23"/>. There can only be one initial node in an EMMA lattice. Each transition in the lattice represented as an element <emma:arc> with attributes emma:from and emma:to which indicate the nodes where the transition starts and ends. The arc's label is represented as the content of the <emma:arc> element, and can be any well-formed character or XML content. In the example here the contents are words. Empty (epsilon) transitions in a lattice should be represented in the <emma:lattice> representation as <emma:arc> elements with no content, e.g <emma:arc emma:from="1" emma:to="8"/>.
The example speech lattice above would be represented in EMMA markup as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation> <emma:lattice emma:initial="1" emma:final="8"> <emma:arc emma:from="1" emma:to="2">flights</emma:arc> <emma:arc emma:from="2" emma:to="3">to</emma:arc> <emma:arc emma:from="3" emma:to="4">boston</emma:arc> <emma:arc emma:from="3" emma:to="4">austin</emma:arc> <emma:arc emma:from="4" emma:to="5">from</emma:arc> <emma:arc emma:from="5" emma:to="6">portland</emma:arc> <emma:arc emma:from="5" emma:to="6">oakland</emma:arc> <emma:arc emma:from="6" emma:to="7">today</emma:arc> <emma:arc emma:from="7" emma:to="8">please</emma:arc> <emma:arc emma:from="6" emma:to="8">tomorrow</emma:arc> </emma:lattice> </emma:interpretation> </emma:emma>
Alternatively, if we wish to represent the same information as N-best list using emma:one-of, we would have the more verbose representation:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:one-of> <emma:interpretation> <text>flights to boston from portland today please</text> </emma:interpretation> <emma:interpretation> <text>flights to boston from portland tomorrow</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from portland today please</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from portland tomorrow</text> </emma:interpretation> <emma:interpretation> <text>flights to boston from oakland today please</text> </emma:interpretation> <emma:interpretation> <text>flights to boston from oakland tomorrow</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from oakland today please</text> </emma:interpretation> <emma:interpretation> <text>flights to austin from oakland tomorrow</text> </emma:interpretation> </emma:one-of> </emma:emma>
The lattice representation avoids the need to enumerate all of the possible word sequences. Also, as detailed below, the <emma:lattice> representation enables placement of annotations on individual words in the input.
The encoding of lattice arcs as XML elements (<emma:arc>) enables arcs to be annotated with metadata such as timestamps, costs, or confidence scores:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation> <emma:lattice emma:initial="1" emma:final="8"> <emma:arc emma:from="1" emma:to="2" emma:start="1087995961542" emma:end="1087995962042" emma:cost="30"> flights </emma:arc> <emma:arc emma:from="2" emma:to="3" emma:start="1087995962042" emma:end="1087995962542" emma:cost="20"> to </emma:arc> <emma:arc emma:from="3" emma:to="4" emma:start="1087995962542" emma:end="1087995963042" emma:cost="50"> boston </emma:arc> <emma:arc emma:from="3" emma:to="4" emma:start="1087995963042" emma:end="1087995963742" emma:cost="60"> austin </emma:arc> ... </emma:lattice> </emma:interpretation> </emma:emma>
Costs are typically application and device dependent. There are a variety of ways that individual arc costs can be combined to produce costs for specific paths through the lattice. This specification does not standardize the way for these costs to be combined; it is up to the applications and devices to determine how such derived costs would be computed and used.
For some lattice formats, it is also desirable to annotate the nodes in the lattice themselves with information such as costs. For example in speech recognition, costs may be placed on nodes as a result of word penalities or redistribution of costs. For this purpose EMMA also provides an <emma:node> element which can host annotations such as emma:cost. <emma:node> elements must have an attribute emma:node-number which indicates the number of the node. There can only be one <emma:node> specification for a given numbered node in the lattice. In our example, if there was a cost of 100 on the final state this could be represented as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation> <emma:lattice emma:initial="1" emma:final="8"> <emma:arc emma:from="1" emma:to="2" emma:start="1087995961542" emma:end="1087995962042" emma:cost="30"> flights </emma:arc> <emma:arc emma:from="2" emma:to="3" emma:start="1087995962042" emma:end="1087995962542" emma:cost="20"> to </emma:arc> <emma:arc emma:from="3" emma:to="4" emma:start="1087995962542" emma:end="1087995963042" emma:cost="50"> boston </emma:arc> <emma:arc emma:from="3" emma:to="4" emma:start="1087995963042" emma:end="1087995963742" emma:cost="60"> austin </emma:arc> ... <emma:node emma:node-number="8" emma:cost="100"/> </emma:lattice> </emma:interpretation> </emma:emma>
The relative timestamp mechanism in EMMA can be used to provide temporal information about arcs in a lattice in relative terms using offsets in milliseconds. In order to do this the absolute time should be specified on <emma:interpretation>. Since emma:time-ref-uri and emma:time-ref-anchor apply to <emma:lattice> and can be used there to set the anchor point for offset to the start of the absolute time specified on <emma:interpretation>. The offset in milliseconds to the beginning of each arc can then be indicated on each <emma:arc> in the emma:offset-to-start attribute.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:id="interp1" emma:start="1087995961542" emma:end="1087995963042"> <emma:lattice emma:time-ref-uri="#interp1" emma:time-ref-anchor="start" emma:initial="1" emma:final="4"> <emma:arc emma:from="1" emma:to="2" emma:offset-to-start="0"> flights </emma:arc> <emma:arc emma:from="2" emma:to="3" emma:offset-to-start="500"> to </emma:arc> <emma:arc emma:from="3" emma:to="4" emma:offset-to-start="1000"> boston </emma:arc> </emma:lattice> </emma:interpretation> </emma:emma>
Note that the offset for the first <emma:arc> will always be zero since the EMMA attribute emma:offset-to-start indicates the number of milliseconds from the anchor point to the start of the piece of input associated with the <emma:arc>, in this case the word "flights".
Annotation | emma:info |
---|---|
Definition | An element with attribute emma:id of type xsd:ID. The element acts as a container for vendor and/or application specific metadata regarding a user's input. |
Applies to | EMMA elements emma:emma, emma:interpretation, emma:group, emma:one-of, emma:sequence |
In Section 3.2, a series of attributes are defined for representation of metadata about user inputs in a standardized form. EMMA also provides an extensibility mechanism for annotation of user inputs with vendor or application specific metadata not covered by the standard set of EMMA annotations. The element <emma:info> should be used as a container for these annotations. For example, if an input to a dialog system needed to be annotated with the number that the call originated from, their state, some indication of the type of customer, and the name of the service, these pieces of information could be represented within <emma:info> as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:info> <caller_id> <phone_number>2121234567</phone_number> <state>NY</state> </caller_id> <customer_type>residential</customer_type> <service_name>acme_travel_service</service_name> </emma:info> <emma:one-of id="r1" emma:start="1087995961542" emma:end="1087995963542"> <emma:interpretation id="int1" emma:confidence="0.75"> <origin>Boston</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> <emma:interpretation id="int2" emma:confidence="0.68"> <origin>Austin</origin> <destination>Denver</destination> <date>03112003</date> </emma:interpretation> </emma:one-of> </emma:emma>
It is important to have an EMMA container element for application/vendor specific annotations since EMMA elements provide a structure for representation of multiple possible interpretations of the input. As a result it is cumbersome to state application/vendor specific metadata as part of the application data within each <emma:interpretation>. An element is used rather than an attribute so that internal structure can be given to the annotations within <emma:info>.
In addition to <emma:emma>, <emma:info> can also appear as a child of other structural elements such as <emma:interpretation>, <emma:info> and so on. When <emma:info> appears as a child of one of these elements the application/vendor specific annotations contained within <emma:info> are assumed to apply to all of the <emma:interpretation> elements within the containing element. The semantics of conflicting annotations in <emma:info>, for example when different values are found within <emma:emma> and <emma:interpretation>, are left to the developer of the vendor/application specific annotations.
Annotation | emma:endpoint-info |
---|---|
Definition | An element with the attribute emma:id of type xsd:ID. The element acts as a container for all application specific annotation regarding the communication environment. |
Applies to | EMMA container elements emma:emma, emma:interpretation, emma:group, emma:one-of, emma:sequence |
Annotation | emma:endpoint |
Definition | An element with attribute emma:id of type xsd:ID. The element acts as a container for application specific endpoint information. |
Applies to | EMMA container elements emma:endpoint-info, emma:interpretation, emma:group, emma:one-of, emma:sequence |
In order to conduct multimodal interaction, there is a need in EMMA to specify the properties of the endpoint that receives the input which leads to the EMMA annotation. This allows the subsequent components to utilize the endpoint properties as well as the annotated inputs to conduct meaningful multimodal interaction. EMMA element <emma:endpoint> can be used for this purpose. It can specify the endpoint properties based on a set of common endpoint property attributes in EMMA, such as emma:endpoint-address, emma:port-num, emma:port-type, etc. Moreover, it provides an extensible annotation structure that allows the inclusion of application and vendor specific endpoint properties.
It should be noted that the usage of the term "endpoint" in this context is different from the way that the term is used in speech processing, where it refers to the end of a speech input. As used here, "endpoint" refers to a network location which is the source or receipient of an EMMA document.
In Multimodal interaction, multiple devices can be used and each device can open multiple communication endpoints at the same time. These endpoints are used to transmit and receive data, such as raw input, EMMA documents, etc. Moreover, these communication endpoints can be based on a varity of protocols and data formats, such as SIP, TCP, SOAP, HTTP, SMTP, MRCP, etc. The EMMA element <emma:endpoint> provides a generic representation of endpoint information which is relevant to multimodal interaction. It allows the annotation to be interoperable, and it shields the need from EMMA processors to create their own specialized annotations for existing protocols, potential protocols or yet undefined private protocols that they may use.
Moreover, <emma:endpoint-info> provides a container to hold all annotations regarding the endpoint information, including <emma:endpoint> and other application and vendor specific annotations that are related to the communication, allowing the same communication environment to be referenced and used in multiple interpretations.
It should be noted that EMMA provides two locations (i.e. <emma:info> and <emma:endpoint-info>) for specifying vendor/application specific annotations. If the annotation is specifically related to the description of the endpoint, then the vendor/application specific annotation should be placed within <emma:endpoint-info>, otherwise it should be placed within <emma:info>.
The following example illustrates the annotation of endpoint reference properties in EMMA.
<emma:emma emma:version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns: ex="http://www.example.com/emma/port"> <emma:interpretation id="int1" emma:start="1087995961542" emma:end="1087995963542"> <emma:endpoint-info emma:id="audio-channel-1"> <emma:endpoint emma:id="endpoint1" emma:endpoint-role="sink" emma:endpoint-address="135.61.71.103" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="#endpoint2" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="speech"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> <emma:endpoint emma:id="endpoint2" emma:endpoint-role="source" emma:endpoint-address="136.62.72.104" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="#endpoint1" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="speech"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> </emma:endpoint-info> <destination>Chicago</destination> </emma:interpretation> </emma:emma>
The <ex:app-protocol> is provided by the application or the vendor specification. It specifies that the application layer protocol used to establish the speech transmission from the "source" port to the "sink" port is Session Initiation Protocol (SIP). This is specific to SIP based VoIP communication, in which the actual media transmission and the call signaling that controls the communication sessions, are separated and typically based on different protocols. In the above example, the Real-time Transmission Protocol (RTP) is used in the media transmission between the source port and the sink port.
Annotation | emma:grammar-ref |
---|---|
Definition | An attribute of type xsd:idref referring to the id attribute of an emma:grammar element |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence) |
The emma:grammar-ref annotation associates the EMMA result in the container element with an emma:grammar element.
Example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:grammar id="gram1" href="someURI"/> <emma:grammar id="gram2" href="anotherURI"/> <emma:one-of id="r1"> <emma:interpretation id="int1" grammar-ref="gram1"> <origin>Boston</origin> </emma:interpretation> <emma:interpretation id="int2" grammar-ref="gram1"> <origin>Austin</origin> </emma:interpretation> <emma:interpretation id="int3" grammar-ref="gram2"> <command>help</command> </emma:interpretation> </emma:one-of> </emma:emma>
Annotation | emma:tokens |
---|---|
Definition | An attribute of type xsd:string holding a sequence of input tokens. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
The emma:tokens annotation holds a list of input tokens. In the following description, the term tokens is used in the computational and syntactic sense of units of input, and not in the sense of XML tokens.
The value held in emma:tokens is the list of the tokens of input as produced by the processor which generated the EMMA document. In the case where a grammar is used to constrain input, the value will correspond to tokens as defined by the grammar. So for an EMMA document produced by input to a W3C SRGS grammar [SRGS], the value of emma:tokens will be the list of words and/or phrases that are defined as tokens in SRGS (through white-spaced character data or the <token>; element, see SRGS section 2.1 Tokens). Items in the emma:tokens list are delimited by white space and/or quotation marks for phrases containing white space. For example:
emma:tokens="arriving at 'Liverpool Street'"
where the three tokens of input are arriving, at and Liverpool Street.
The tokens annotation may be applied not just to the lexical words and phrases of language but to any level of input processing. Other examples of tokenization include phonemes, ink strokes, gestures and any other discrete units of input at any level.
Examples:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:tokens="From Cambridge to London tomorrow"> <origin emma:tokens="From Cambridge">Cambridge</origin> <destination emma:tokens="to London">London</destination> <date emma:tokens="tomorrow">20030315</date> </emma:interpretation> </emma:emma>
Annotation | emma:process |
---|---|
Definition | An attribute of type xsd:anyURI referencing the process used to generate the interpretation. |
Applies to | emma:interpretation |
A reference to the information concerning the processing that was used for generating an interpretation can be made as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="raw"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation id="better" emma:process="http://example.com/mysemproc1.xml"> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> <emma:derived-from emma:resource="#raw"/> </emma:interpretation> <emma:interpretation id="best" emma:process="http://example.com/mysemproc2.xml"> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> <emma:derived-from emma:resource="#better"/> </emma:interpretation> </emma:emma>
The process description document, referenced by the emma:process annotation can include information on the process itself, such as grammar, type of parser, etc. EMMA is not normative about the format of the process description document.
Annotation | emma:no-input |
---|---|
Definition | Attribute holding xsd:boolean value that is true if there was no input. |
Applies to | emma:interpretation, application instance data |
The case of lack of input can be annotated as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="int1" emma:no-input="true" /> </emma:emma>
Annotation | emma:uninterpreted |
---|---|
Definition | Attribute holding xsd:boolean value that is true if the input could not be interpreted |
Applies to | emma:interpretation, application instance data |
Input that cannot be interpreted is annotated as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="interp1" emma:uninterpreted="true"/> </emma:emma>
The notation for uninterpretable input can refer to any possible stage of interpretation processing, including raw transcriptions. For instance, if input speech cannot be correctly recognized or the spoken input is not matched by a grammar (or by a language constraint given to the recognition), it can be tagged as emma:uninterpreted as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="raw" emma:process="http://example.com/myasr.xml" emma:uninterpreted="true" emma:tokens="From Cambridge to London tomorrow"/> </emma:emma>
Note that sometimes an input is classified as "uninterpreted" because its score falls below a confidence threshold set in the processor. In this case it still may be useful for further stages of processing to know what the highest scoring interpretation was, even if that interpretation's confidence did not exceed the threshold. If the interpretation is a raw speech recognition result, an emma:tokens attribute can be used to represent the best scoring result, as in the above example. If the interpretation is a semantic result, the best scoring interpretation can be included within the emma:interpretation element, as in the following example:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="interp1" emma:uninterpreted="true"> <source>philadelphia</source> <destination>boston</destination> </emma:interpretation> </emma:emma>
Annotation | emma:lang |
---|---|
Definition | An attribute of type xsd:language indicating the language for the input. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
The emma:lang annotation is used to indicate the human language for the input that it annotates. The values of the emma:lang attribute are language identifiers as defined by [IETF RFC 1766]. For example, emma:lang="fr" denotes French, and emma:lang="en-US" denotes US English. emma:lang can be applied to any emma:interpretation element. Its annotative scope follows the annotative scope of these elements. In contrast, the attribute xml:lang in XML 1.0 is used to specify the language used in the contents and attribute values of any element in an XML document. The attribute emma:lang must be used if the xml:lang can no longer apply. For example, the contents and attribute values of an element in the EMMA document are from different languages, such as in the case where the input language is in French, and the language of the annotated attributes is in English.
The following example shows the use of emma:lang for annotating an input interpretation.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="int1" emma:lang="fr"> <answer>arretez</answer> </emma:interpretation> </emma:emma>
The following example shows the annotation of different interpretations derived from the same input in a multilingual application:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="int1"> <rawtext>please stop arretez s'il vous plait</rawtext> </emma:interpretation> <emma:interpretation id="int2" emma:lang="en" emma:process="http:/example.com/EnglishInterpreter.xml" > <command> CANCEL </command> <emma:derived-from resource="#int1"/> </emma:interpretation> <emma:interpretation id="int3" emma:lang="fr" emma:process="http:/example.com/FrenchInterpreter.xml"> <command> CANCEL </command> <emma:derived-from resource="#int1"/> </emma:interpretation> </emma:emma>
Annotation | emma:signal |
---|---|
Definition | An attribute of type xsd:anyURI referencing the input signal. |
Applies to | emma:interpretation, application instance data. |
A URI reference to the signal that originated the input recognition process may be represented in EMMA using the emma:signal annotation.
Here is an example where the reference to the signal is applied to the emma:interpretation element:
<emma:emma version="1.0" xmlns="http://www.w3.org/2003/04/emma"> <emma:interpretation id="intp1" emma:signal="http://example.com/signals/sg23.bin"> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation> </emma:emma>
Annotation | emma:media-type |
---|---|
Definition | An attribute of type xsd:string holding the MIME type associated with the signal's data format. |
Applies to | emma:interpretation, application instance data. |
The data format of the signal that originated the input may be represented in EMMA using the emma:media-type annotation. An initial set of MIME media types is defined by [RFC2046].
Here is an example where the media type for the ETSI ES 202 212 audio codec for Distributed Speech Recognition (DSR) is applied to the emma:interpretation element. The example also specifies an optional sampling rate of 8 kHz and maxptime of 40 milliseconds.
<emma:emma version="1.0" xmlns="http://www.w3.org/2003/04/emma"> <emma:interpretation id="intp1" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40"> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation> </emma:emma>
Annotation | emma:confidence |
---|---|
Definition | An attribute of type xsd:decimal in range 0.0 to 1.0, indicating the processor's confidence in the result. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
The confidence score in EMMA is used to indicate the quality of the input, and it is the value assigned to emma:confidence in the EMMA namespace. The confidence score is a number in the range from 0.0 to 1.0 inclusive. A value of 0.0 indicates minimum confidence, and a value of 1.0 indicates maximum confidence. Note that emma:confidence should not be assumed to mean only the confidence of the speech recognizer, but rather the confidence of the whatever processor was responsible for creating the EMMA result, based on whatever evidence it has. For a natural language interpretation, for example, this might include semantic heuristics in addition to speech recognition scores. Moreover, the confidence score values do not have to be interpreted as probabilities. In fact confidence score values are platform-dependent, since their computation is likely to differ between platforms and different EMMA processors. Confidence scores are annotated explicitly in EMMA in order to provide this information to the subsequent processes for multimodal interaction. The example below illustrates how confidence scores are annotated in EMMA.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:one-of> <emma:interpretation id="meaning1" emma:confidence="0.6"> <location>Boston</location> </emma:interpretation> <emma:interpretation id="meaning2" emma:confidence="0.4"> <location> Austin </location> </emma:interpretation> </emma:one-of> </emma:emma>
In addition to its use as an attribute on the EMMA container elements, emma:confidence can also be used to assign confidences to elements in instance data in the application namespace. This can be seen in the following example, where the <destination> and <origin> elements have confidences.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="meaning1" emma:confidence="0.6"> <destination emma:confidence="0.8"> Boston</destination> <origin emma:confidence="0.6"> Austin </origin> </emma:interpretation> </emma:emma>
Although in general instance data can be represented in XML using a combination of elements and attributes in the application namespace, EMMA does not provide a standard way to annotate processors' confidences in attributes. Consequently, instance data that is expected to be assigned confidences should be represented using elements, as in the above example.
Annotation | emma:source |
---|---|
Definition | An attribute of type xsd:anyURI referencing the source of input. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
The source of an interpreted input may be represented in EMMA as a URI resource using the emma:source annotation.
Here is an example that shows different input sources for different input interpretations.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:myapp="http://www.example.com/myapp"> <emma:one-of> <emma:interpretation id="intp1" emma:source="http://example.com/microphone/NC-61"> <myapp:destination>Boston</myapp:destination> </emma:interpretation> <emma:interpretation id="intp2" emma:source="http://example.com/microphone/NC-4024"> <myapp:destination>Austin</myapp:destination> </emma:interpretation> </emma:one-of> </emma:emma>
Annotation | emma:start, emma:end |
---|---|
Definition | Attributes indicating the absolute starting and ending times of an input in terms of the number of milliseconds since 1 January 1970 00:00:00 GMT |
Applies to | emma:interpretation, emma:group, emma:one-of, emma:sequence, emma:arc, emma:node |
Annotation | emma:time-ref-uri |
Definition | Attribute of type xsd:URI indicating the URI used to anchor the relative timestamp. |
Applies to | emma:interpretation, emma:group, emma:one-of, emma:sequence, emma:lattice |
Annotation | emma:time-ref-anchor |
Definition | Attribute with a value of "start" or "end", defaulting to "start". It indicates whether to measure the time from the start or end of the interval designated with emma:reference-uri. |
Applies to | emma:interpretation, emma:group, emma:one-of, emma:sequence, emma:lattice |
Annotation | emma:offset-to-start |
Definition | Attribute with a signed integer value, defaulting to zero. It specifies the offset in milliseconds for the start of input from the anchor point designated with emma:reference-uri and emma:anchor |
Applies to | emma:interpretation, emma:group, emma:one-of, emma:sequence, emma:arc, emma:node |
Annotation | emma:duration |
Definition | Attribute with an unsigned integer value, defaulting to zero. It specifies the duration of the input in milliseconds. |
Applies to | emma:interpretation, emma:group, emma:one-of, emma:sequence, emma:arc |
The start and end times for input can be indicated using either absolute timestamps, or as relative timestamps. Both are in milliseconds for ease in processing timestamps. Note that the absolute time may be conveniently determined using the ECMAScript Date object's getTime() function.
Here is an example of a timestamp for an absolute time.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" <emma:interpretation id="int1" emma:start="1087995961542" emma:end="1087995963542"> <destination>Chicago</destination> </emma:interpretation> </emma:emma>
Relative timestamps define the start of an input relative to the start or end of a reference interval such as another input.
The reference interval is designated with emma:time-ref-uri. This can be combined with emma:time-ref-anchor to specify whether the anchor point is the start or end of this interval. The start of an input relative to this anchor point is then specified with emma:offset-to-start. Finally, the duration of an input can be specified with emma:duration. The emma:duration attribute can be used independently of absolute or relative timestamps, e.g. for annotation of speech corpora.
Here is an example where the referenced input is in the same document:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="int1"> <origin>Denver</origin> </emma:interpretation> <emma:interpretation id="int2" emma:time-ref-uri="#int1" emma:time-ref-anchor="start" emma:offset-to-start="5000"> <destination>Chicago</destination> </emma:interpretation> </emma:emma>
Note that the reference point refers to an input, but not necessarily to a complete input. For example, if a speech recognizer timestamps each word in an utterance, the anchor point might refer to the timestamp for just one word.
The absolute and relative timestamps are not mutually exclusive; that is, it is possible to have both relative and absolute timestamp attributes on the same EMMA container element.
Timestamps of inputs collected by different devices will be subject to variation if the times maintained by the devices are not synchronized. This concern is outside of the scope of the EMMA working group.
Annotation | emma:medium |
---|---|
Definition | An attribute of type xsd:String constrained to values in the set {acoustic, tactile, visual}. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
Annotation | emma:mode |
Definition | An attribute of type xsd:String constrained to values in the open set {speech, dtmf_keypad, ink, gui, keys, video,photograph, ...}. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
Annotation | emma:function |
Definition | An attribute of type xsd:String constrained to values in the open set {recording, transcription, dialog, verification, ...}. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
Annotation | emma:verbal |
Definition | An attribute of type xsd:boolean. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data |
EMMA provides two properties for the annotation of input modality. One indicating the broader medium or channel (medium) and another indicating the specific mode of communication used on that channel (mode).The input medium is defined from the users perspective and indicates whether they use their voice (acoustic), touch (tactile), or visual appearance/motion (visual) as input. Tactile includes most hand-on input device types such as pen, mouse, keyboard, and touch screen. Visual is used for camera input.
emma:medium ::= [acoustic|tactile|visual]
The mode property provides the ability to distinguish between different modes of communication that may be within a particular medium. For example, in the tactile medium, modes include electronic ink (ink), and pointing and clicking on a graphical user interface.
emma:mode ::= [speech|dtmf_keypad|ink|gui|keys|video|photograph| ... ]
Orthogonal to the mode, user inputs can also be classified with respect to their communicative function. This enables a simpler mode classification.
emma:function ::= [recording|transcription|dialog|verification| ... ]
For example, speech can be used for recording (e.g. voicemail), transcription (e.g. dictation), dialog (e.g interactive spoken dialog systems), and verification (e.g. identifying the user through their voiceprint).
EMMA also supports an additional property verbal which distinguishes verbal use of an input mode from non-verbal. This can be used to distinguish the use of electronic ink to convey handwritten commands from the user of electronic ink for symbolic gestures such as circles and arrows. Handwritten commands, such as writing downtown in order to change a map display to show the downtown are classified as verbal (verbal="true"). Pen gestures (arrows, lines, circles, etc), such as circling a building, are classified as non-verbal dialog (function="dialog" verbal="false"). The use of handwritten words to transcribe an email message are classified as transcription (function="transcription").
emma:verbal ::= [true|false|0|1]
Handwritten words and ink gestures are typically recognized using different kinds of recognition components (handwriting recognizer vs. gesture recognizer) and the verbal annotation will be added by the recognition component which classifies the input. The original input source, a pen in this case, will not be aware of this difference. The input source identifier will tell you that the input was from a pen of some kind but will not tell you if the mode of input was handwriting (show downtown) or gesture (e.g. circling an object or area).
Here is an example of the EMMA annotation for a pen input where the user's ink is recognized as either a word or as an arrow:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:one-of> <emma:interpretation id="interp1" emma:confidence="0.6" emma:medium="tactile" emma:mode="ink" emma:function="dialog" emma:verbal="true"> <location>Boston</location> </emma:interpretation> <emma:interpretation id="interp2" emma:confidence="0.4" emma:medium="tactile" emma:mode="ink" emma:function="dialog" emma:verbal="false"> <direction>45</direction> </emma:interpretation> </emma:one-of> </emma:emma>
Here is an example of the EMMA annotation for a spoken command which is recognized as either Boston or Austin:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:one-of> <emma:interpretation id="interp1" emma:confidence="0.6" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true"> <location>Boston</location> </emma:interpretation> <emma:interpretation id="interp2" emma:confidence="0.4" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true"> <location>Austin</location> </emma:interpretation> </emma:one-of> </emma:emma>
The following table shows the relationship between the medium, mode, and function properties and serves as an aid for classifying inputs. For the dialog function it also shows some examples of the classification of inputs as verbal vs. non-verbal.
Medium | Device | Mode | Function | |||
---|---|---|---|---|---|---|
recording | dialog | transcription | verification | |||
acoustic | microphone | speech | audiofile (e.g. voicemail) | spoken command / query / response (verbal = true) | dictation | speaker recognition |
singing a note (verbal = false) | ||||||
tactile | keypad | dtmf | audiofile / character stream | typed command / query / response (verbal = true) | text entry (T9-tegic, word completion, or word grammar) | password / pin entry |
command key "Press 9 for sales" (verbal = false) | ||||||
keyboard | keys | character / key-code stream | typed command / query / response (verbal = true) | typing | password / pin entry | |
command key "Press S for sales" (verbal = false) | ||||||
pen | ink | trace, sketch | handwritten command / query / response (verbal = true) | handwritten text entry | signature, handwriter recognition | |
gesture (e.g. circling building) (verbal = false) | ||||||
gui | N/A | tapping on named button (verbal = true) | soft keyboard | password / pin entry | ||
drag and drop, tapping on map (verbal = false) | ||||||
mouse | ink | trace, sketch | handwritten command / query / response (verbal = true) | handwritten text entry | N/A | |
gesture (e.g. circling building) (verbal = false) | ||||||
gui | N/A | clicking named button (verbal = true) | soft keyboard | password / pin entry | ||
drag and drop, clicking on map (verbal = false) | ||||||
joystick | ink | trace,sketch | gesture (e.g. circling building) (verbal = false) | N/A | N/A | |
gui | N/A | pointing, clicking button / menu (verbal = false) | soft keyboard | password / pin entry | ||
visual | page scanner | photograph | image | handwritten command / query / response (verbal = true) | optical character recognition, object/scene recognition (markup, e.g. SVG) | N/A |
drawings and images (verbal = false) | ||||||
still camera | photograph | image | objects (verbal = false) | visual object/scene recognition | face id, retinal scan | |
video camera | video | movie | sign language (verbal = true) | audio/visual recognition | face id, gait id, retinal scan | |
face / hand / arm / body gesture (e.g. pointing, facing) (verbal = false) |
One of the most powerful aspects of multimodal interfaces is their ability to provide support for user inputs which are distributed over the available input modes. These composite inputs are contributions made by the user within a single turn which have component parts in different modes. For example, the user might say "zoom in here" in the speech mode while drawing an area on a graphical display in the ink mode. One of the central motivating factors for this kind of input is that different kinds of communicative content are best suited to different input modes. In the example of a user drawing an area on a map and saying "zoom in here", the zoom command is easiest to provide in speech but the spatial information, the specific area, is easier to provide in ink.
Enabling composite multimodality is critical in ensuring that multimodal systems support more natural and effective interaction for users. In order to support composite inputs, a multimodal architecture must provide some kind of multimodal integration mechanism. In the W3C Multimodal Interaction Framework, multimodal integration can be handled by an integration component which follows the application of speech understanding and other kinds of interpretation procedures for individual modes.
Given the broad range of different techniques being employed for multimodal integration and the extent to which this is an ongoing research problem, standardization of the specific method or algorithm used for multimodal integration is not appropriate at this time. In order to facilitate the development and inter-operation of different multimodal integration mechanisms EMMA provides markup language enabling application independent specification of elements in the application markup where content from another mode needs to be integrated. These representation 'hooks' can then be used by different kinds of multimodal integration components and algorithms to drive the process of multimodal integration. In the processing of a composite multimodal input, the result of applying a mode-specific interpretation component to each of the individual modes will be EMMA markup describing the possible interpretation of that mode. In the case of speech, this markup can be assigned to speech through the application of SRGS rules and their associated semantic interpretation (SI) code. For some modes, some of those interpretations may contain an application semantics which is incomplete until content is added from another input mode. In the example mentioned above, the speech command "zoom in here" is incomplete until it is combined with the pen input of the user circling an area.
Annotation | emma:hook |
---|---|
Definition | An attribute of type xsd:String constrained to values in the open set {speech, dtmf_keypad, ink, gui, keys, video,photograph, ...} or the wildcard 'any' |
Applies to | Application instance data |
The attribute emma:hook is used to mark the elements in the application semantics within an <emma:interpretation> which must be integrated with content from input in another mode. The emma:mode to be integrated at that point in the application semantics is indicated as the value of the emma:hook attribute. In the example above, the annotation would be emma:hook="ink". The possible values of emma:hook are the list of input modes that can be values of emma:mode such as speech,dtmf_keypad,ink,gui,keys. In addition to these, the value of emma:hook can also be the wildcard any indicating that the other content can come from any source. The annotation emma:hook differs in semantics from emma:mode as follows. Annotating an element in the application semantics with emma:mode="ink" indicates that that part of the semantics came from the ink mode. Annotating an element in the application semantics with emma:hook="ink" indicates that part of the semantics needs to be integrated with content from the ink mode.
To illustrate the use of emma:hook consider an example composite input in which the user says "zoom in here" in the speech input mode while drawing an area on a graphical display in the ink input mode. One possible way to represent the application semantics for "zoom in here" would be as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:mode="speech"> <command> <action>zoom</action> <location emma:hook="ink"> <type>area</type> </location> </command> </emma:interpretation> </emma:emma>
This representation would be assigned to the spoken input "zoom in here" by a natural language understanding component. For example, the semantics could be generated using the W3C Speech Recognition Grammar Specification (SRGS) using the Semantic Interpretation SI tags to build the application semantics with the emma:hook attribute. For more detailed explanation of this and an example see Appendix: emma:hook and SRGS.
Note that the elements in the application markup here such as <action>, <location>, and <points> are in no way intended to be standardized. What is standardized is the use of emma:hook="ink" to indicate where multimodal integration is required. The action to be performed is indicated in an element <action>. The location on which to perform the action is indicated by the element <location>. The annotation emma:hook="ink" on the <location> element indicates that content needs to be added to this element through integration with content from the ink input mode. In our example, the interpretation of an area gesture could be represented as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:mode="ink"> <location> <type>area</type> <points>42.1345 -37.128 42.1346 -37.120 ... </points> </location> </emma:interpretation> </emma:emma>
This representation could be generated by a pen modality component performing gesture recognition and interpretation. The input to the component would be an InkML specification of the ink trace and the output would be the EMMA document above.
There are two components to the process of integrating these two pieces of semantic markup. The first is to ensure that the two are compatible; that is, such that no semantic constraints are violated. The second is to fuse the content from the two sources. In our example, the <type>area</type> element is intended to indicate that this speech command requires integration with an area gesture rather than, for example, a line gesture, which would have the subelement <type>line</type>. This constraint needs to be enforced by whatever mechanism is responsible for multimodal integration. In our example, the result should be semantics of speech with the addition of new information from the gesture, in this case the <points> element and its contents:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation emma:mode="multimodal"> <command> <action>zoom</action> <location> <type>area</type> <points>42.1345 -37.128 42.1346 -37.120 ... </points> </location> </command> </emma:interpretation> </emma:emma>
Many different techniques could be used for achieving this integration of the semantic interpretation of the pen input, a <location> element, with the corresponding <location> element in the speech. The hook simply serves to indicate the existence of this relationship.
One way to achieve both the compatibility checking and fusion of content from the two modes is to use a well-defined general purpose matching mechanism such unification. Graph unification is a mathematical operation defined over directed acylic graphs which captures both of the components of integration in a single operation: the applications of the semantic constraints and the fusing of content. One possible semantics for the emma:hook markup indicates that content from the required mode needs to be unified with that position in the application semantics. In order to unify, two elements must not have any conflicting values for subelements or attributes. This procedure can be defined recursively so that elements within the subelements must also not clash and so on. The result of unification is the union of all of the elements and attributes of the two elements that are being unified.
In addition to the unification operation, in the resulting <emma:interpretation> the emma:hook attribute needs to be removed and the emma:mode attribute changed to multimodal.
Instead of the unification operation, for a specific application semantics, integration could be achieved using some other algorithm or script. The benefit of using the unification semantics for emma:hook is that it provides a general purpose mechanism for checking the compatibility of elements and fusing them, whatever the specific elements are in the application specific semantic representation.
The benefit of using the emma:hook annotation for authors is that it provides an application independent method for indicating where integration with content from another mode is required. If a general purpose integration mechanism is used, such as the unification approach described above, authors should be able to use the same integration mechanism for a range of different applications without having to change the integration rules or logic. For each application the speech grammar rules (SRGS) need to assign emma:hook to the appropriate elements in the semantic representation of the speech. The general purpose multimodal integration mechanism will use the emma:hook annotations in order to determine where to add in content from other modes. Another benefit of the emma:hook mechanism is that it facilitates interoperability among different multimodal integration components, so long as they are all general purpose and utilize emma:hook in order to determine where to integrate content.
The following provides a more detailed example of the use of the emma:hook annotation. In this example, spoken input is combined with two gestures made use from ink. The semantic representation assigned to the spoken input "send this file to this" indicates two locations where content is required from ink input using emma:hook="ink":
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation> <command> <action>send</action> <arg1> <object emma:hook="ink"> <type>file</type> <number>1</number> </object> </arg1> <arg2> <object emma:hook="ink"> <number>1</number> </object> </arg2> </command> </emma:interpretation> </emma:emma>
The user gesturing on the two locations on the display can be represented using <emma:sequence>:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:sequence> <emma:interpretation emma:mode="ink"> <object> <type>file</type> <number>1</number> <id>test.pdf</id> <object> </emma:interpretation> <emma:interpretation emma:mode="ink"> <object> <type>printer</type> <number>1</number> <id>lpt1</id> <object> </emma:interpretation> </emma:sequence> </emma:emma>
A general purpose unification-based multimodal integration algorithm could use the emma:hook annotation as follows. It identifies the elements marked with emma:hook in document order. For each of those in turn, it attempts to unify the element with the corresponding element in order in the <emma:sequence>. Since none of the subelements conflict, the unification goes through and as a result, we have the following EMMA for the composite result:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation> <command> <action>send</action> <arg1> <object> <type>file</type> <number>1</number> <id>test.pdf</id> </object> </arg1> <arg2> <object> <type>printer</type> <number>1</number> <id>lpt1</id> </object> </arg2> </command> </emma:interpretation> </emma:emma>
Annotation | emma:cost |
---|---|
Definition | An attribute of type xsd:decimal in range 0.0 to 10000000, indicating the processor's cost or weight associated with an input or part of an input. |
Applies to | EMMA container elements (emma:interpretation, emma:group, emma:one-of, emma:sequence), emma:arc, emma:node, and application instance data |
The cost annotation in EMMA is used to indicate the weight or cost associated with an user's input or part of their input. The most common use of emma:cost is for representing the costs encoded on a lattice output from speech recognition or other recognition or understanding processes. emma:cost can also be used to indicate the total cost associated with particular recognition results or semantic intepretations.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:one-of> <emma:interpretation id="meaning1" emma:cost="1600"> <location>Boston</location> </emma:interpretation> <emma:interpretation id="meaning2" emma:cost="400"> <location> Austin </location> </emma:interpretation> </emma:one-of> </emma:emma>
Annotation | emma:endpoint-role |
---|---|
Definition | An attribute of type xsd:String constrained to values in the set {source, sink, reply-to, router}. |
Applies to | EMMA container element emma:endpoint and application instance data |
Annotation | emma:endpoint-address |
Definition | An attribute of type xsd:anyURI that uniquely specifies the network address of the <emma:endpoint>. |
Applies to | EMMA container element emma:endpoint and application instance data |
Annotation | emma:port-type |
Definition | An attribute of type xsd:QName that specifies the type of the port |
Applies to | EMMA container element emma:endpoint and application instance data |
Annotation | emma:port-num |
Definition | An attribute of type xsd:nonNegativeInteger that specifies the port number |
Applies to | EMMA container element emma:endpoint and application instance data |
Annotation | emma:message-id |
Definition | An attribute of type xsd:anyURI that specifies the message ID associated with the data |
Applies to | EMMA container elements (emma:endpoint, emma:endpoint-info, emma:interpretation, emma:group, emma:one-of, emma:sequence) , and application instance data. |
Annotation | emma:service-name |
Definition | An attribute of type xsd:String that specifies the name of the service |
Applies to | EMMA container element emma:endpoint, emma:endpoint-info emma:interpretation, emma:group, emma:one-of, emma:sequence), and application instance data. |
Annotation | emma:endpoint-pair-ref |
Definition | An attribute of type xsd:anyURI that specifies the pairing between sink and source endpoints |
Applies to | EMMA container element emma:endpoint and application instance data |
The emma:endpoint-role attribute is to specify the role that the particular <emma:endpoint> performs in multimodal interaction. The role value "sink" indicates that the particular endpoint is the receiver of the input data. The role value "source" indicates that the particular endpoint is the sender of the input data. The role value "reply-to" indicates that the particular <emma:endpoint> is the endpoint that the reply should be sent to. The same emma:endpoint-address can appear in multiple <emma:endpoint> specifications, provided that the same endpoint address is used to serve multiple roles, e.g. sink, source, reply-to, router, etc., or associated with multiple interpretations.
The emma:endpoint-address specifies the network address of the <emma:endpoint>, and emma:port-type specifies the port type of the <emma:endpoint>. The emma:port-num annotates the port number of the endpoint (e.g. the typical port number for an http endpoint is 80). The emma:message-id annotates the message ID information associated with the annotated input. This meta information is used to establish and maintain the communication context for both inbound processing and outbound operation. The service specification of the <emma:endpoint> is annotated by emma:service-name which contains the definition of the service that the <emma:endpoint>performs. The matching of the "sink" endpoint and its pairing "source" endpoint is annotated by the emma:endpoint-pair-ref attribute. One sink endpoint can link to multiple source endpoints through emma:endpoint-pair-ref. Further boundling of the <emma:endpoint> can be realized through the annotation of <emma:group>[Ref: <emma:group>].
The following example illustrates the use of these attrubutes in multimodal interactions where multiple modalities are used.
<emma:emma emma:version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns: ex="http://www.example.com/emma/port"> <emma:group> <emma:interpretation emma:id="int1" emma:start="1087995961542" emma:end="1087995963542"> <emma:endpoint-info emma:id="audio-channel-1" > <emma:endpoint emma:id="endpoint1" emma:endpoint-role="sink" emma:endpoint-address="135.61.71.103" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="#endpoint2" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="speech"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> <emma:endpoint emma:id="endpoint2" emma:endpoint-role="source" emma:endpoint-address="136.62.72.104" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="#endpoint1" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="speech"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> </emma:endpoint-info> <destination>Chicago</destination> </emma:interpretation> <emma:interpretation emma:id="int2" emma:start="1087995961542" emma:end="1087995963542"> <emma:endpoint-info emma:id="ink-channel-1"> <emma:endpoint emma:id="endpoint3" emma:endpoint-role="sink" emma:endpoint-address="http://emma.example/sink" emma:endpoint-pair-ref="#endpoint4" emma:port-num="80" emma:port-type="http" emma:message-id="uuid:2e5678" emma:service-name="travel" emma:mode="ink"/> <emma:endpoint emma:id="endpoint4" emma:endpoint-role="source" emma:port-address="http://emma.example/source" emma:endpoint-pair-ref="#endpoint3" emma:port-num="80" emma:port-type="http" emma:message-id="uuid:2e5678" emma:service-name="travel" emma:mode="ink"/> </emma:endpoint-info> <location> <type>area</type> <points>34.13 -37.12 42.13 -37.12 ... </points> </location> </emma:interpretation> </emma:group> </emma:emma>
Annotation | emma:endpoint-info-ref |
---|---|
Definition | An attribute of type xsd:IDREF referring to the id attribute of an emma:endpoint-info element |
Applies to | EMMA container elements emma:interpretation, emma:group, emma:one-of, emma:sequence and application instance data |
The emma:endpoint-info-ref annotation associates the EMMA result in the container element with emma:endpoint-info element.
Example:
<emma:emma emma:version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns: ex="http://www.example.com/emma/port"> <emma:one-of> <emma:interpretation emma:id="int1" emma:start="1087995961542" emma:end="1087995963542"> <emma:endpoint-info emma:id="audio-channel-1" > <emma:endpoint emma:id="endpoint1" emma:endpoint-role="sink" emma:endpoint-address="135.61.71.103" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="#endpoint2" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="speech"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> <emma:endpoint emma:id="endpoint2" emma:endpoint-role="source" emma:endpoint-address="136.62.72.104" emma:port-num="50204" emma:port-type="rtp" emma:endpoint-pair-ref="#endpoint1" emma:media-type="audio/dsr-202212; rate:8000; maxptime:40" emma:service-name="travel" emma:mode="speech"> <ex:app-protocol>SIP</ex:app-protocol> </emma:endpoint> </emma:endpoint-info> <destination>Chicago</destination> </emma:interpretation> <emma:interpretation emma:id="int2" emma:start="1087995961542" emma:end="1087995963542" emma:endpoint-info-ref="#audio-channel-1"> <destination>Austin</destination> </emma:interpretation> </emma:one-of> </emma:emma>
This section concerns the scope of EMMA annotations across derivations of user input connected using the <derived-from> element (Section 3.2). The EMMA <derived-from> element (Section 3.2) can be used to capture both sequential and composite derivations. Sequential derivations involve processing steps that do not involve multimodal integration, such as applying natural language understanding and then reference resolution to a speech transcription.
Annotation scope in sequential derivations is addressed in Section 4.1. Composite derivations involve combination of inputs from multiple different input modes. These are addressed in Section 4.2 below. Note that an EMMA derivation may include both sequential and composite derivation steps. EMMA derivations describe only single turns of user input and are not intended to describe a sequence of dialogue turns.
In order to indicate whether an <emma:derived-from/> element describes a sequential derivation step or a composite derivation step, the <emma:derived-from/> has an attribute composite which has a boolean value. A composite <emma:derived-from/> needs to be marked as composite="true" while a sequential <emma:derived-from/> is marked as composite="false". If this attribute is not specified the value is "false" by default.
This section concerns the scope of EMMA annotations in sequential derivations. EMMA enables the annotation of whole derivations of user input. For example an EMMA document could contain <emma:interpretation> elements for the transcription, interpretation, and reference resolution of a speech input, utilizing the id values: raw, better, and best respectively:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="raw" emma:process="http://example.com/myasr1.xml"/>> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation id="better" emma:process="http://example.com/mynlu1.xml"> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> <emma:interpretation id="best" emma:process="http://example.com/myrefresolution1.xml"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation> </emma:emma>
Each member of the derivation chain is linked to the previous one by a <derived-from> element (Section 3.1.5), which has an attribute resource that provides a pointer to the <emma:interpretation> from which it is derived. The <emma:process> annotation (Section 3.2.2) provides a pointer to the process used to for each stage of the derivation.
The scope of EMMA annotations becomes in EMMA documents with a more fully specified set of the EMMA annotations, as illustrated in the following example.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="raw" emma:process="http://example.com/myasr1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.6" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation id="better" emma:process="http://example.com/mynlu1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> <emma:interpretation id="best" emma:process="http://example.com/myrefresolution1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation> </emma:emma>
EMMA annotations on earlier stages of the derivation may still be true of later stages of the derivation. Although this can be captured in EMMA by repeating the annotations on each emma:interpretation within the derivation, as in the example above, there are two disadvantages of this approach to annotation. First, the repetition of annotations makes the resulting EMMA documents significantly more verbose. Second, EMMA processors used for intermediate tasks such as natural language understanding and reference resolution will need to read in all of the annotations and write them all out again.
EMMA overcomes these problems by assuming that annotations on earlier stages of a derivation automatically apply to later stages of the derivation unless a new value is specified. Later stages of the derivation essentially inherit annotations from earlier stages in the derivation. For example, if there was an emma:source annotation on the transcription (raw) it would also apply to the later stages of the derivation such as the result of natural language understanding (better) or reference resolution (best).
Because of the assumption in EMMA that annotations have scope over later stages of a sequential derivation, the example EMMA document above can be equivalently represented as follows:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="raw" emma:process="http://example.com/myasr1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.6" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation id="better" emma:process="http://example.com/mynlu1.xml" emma:confidence="0.8"> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> <emma:interpretation id="best" emma:process="http://example.com/myrefresolution1.xml"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation> </emma:emma>
The fully specified derivation illustrated above is equivalent to the reduced form derivation following it where only annotations with new values are specified at each stage. These two EMMA documents should be yield the same result when processed by an EMMA processor.
The emma:confidence annotation is respecified on the better interpretation. This indicates the confidence score for natural language understanding, whereas emma:confidence on the raw interpretation indicates the speech recognition confidence score.
In order to determine the full set of annotations that apply to an <emma:interpretation> element an EMMA processor or script needs to access the annotations directly on that element and for any that are not specified follow the reference in the resource attribute of the <emma:derived-from> element to add in annotations from earlier stages of the derivation.
The EMMA annotations breakdown into three groups with respect to their scope in sequential derivations. One group of annotations always hold true for all members of a sequential derivation. A second group are always respecified on each stage of the derivation. A third group may or may not be respecified.
Classification | Annotation |
---|---|
Applies to whole derivation | emma:signal |
emma:source | |
emma:medium | |
emma:mode | |
emma:function | |
emma:verbal | |
emma/xml:lang | |
emma:tokens | |
emma:start | |
emma:end | |
emma:from-start-of | |
emma:from-end-of | |
emma:start-offset | |
emma:end-offset | |
Specified at each stage of derivation | <emma:derived-from> |
emma:process | |
May be respecified | emma:confidence |
emma:model | |
emma:no-input | |
emma:uninterpreted |
One potential problem with this annotation scoping mechanism is that earlier annotations could be lost if earlier stages of a derivation were dropped in order to reduce message size. This problem can be overcome by considering annotation scope at the point where earlier derivation stages are discarded and populating the final interpretation in the derivation with all of the annotations which it could inherit. For example, if the raw and better stages were dropped the resulting EMMA document would be:
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"> <emma:interpretation id="best" emma:start="1087995961542" emma:end="1087995963542" emma:process="http://example.com/myrefresolution1.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.8" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true" emma:tokens="from boston to denver tomorrow" emma:lang="en-US"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>03152003</date> </emma:interpretation> </emma:emma>
In addition to representing sequential derivations, the EMMA <emma:derived-from> element can also be used to capture composite derivations. Composite derivations involve combination of inputs from different modes. In the following composite derivation example the user said "destination" and circled Boston on a map:
<emma:emma version="1.0" xmlns="http://www.w3.org/2003/04/emma"> <emma:interpretation id="speech1" emma:start="1087995961542" emma:end="1087995963542" emma:process="http://example.com/myasr.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.6" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:tokens="destination"> <rawinput>destination</rawinput> </emma:interpretation> <emma:interpretation id="pen1" emma:start="1087995961542" emma:end="1087995963542" emma:process="http://example.com/mygesturereco.xml" emma:source="http://example.com/pen/wacom123" emma:signal="http://example.com/signals/ink5.inkml" emma:confidence="0.5" emma:medium="tactile" emma:mode="ink" emma:function="dialog" emma:verbal="false"> <rawinput>Boston</rawinput> </emma:interpretation> <emma:interpretation id="multimodal1" emma:process="http://example.com/myintegrator.xml"> <emma:derived-from resource="#speech1" composite="true"/> <emma:derived-from resource="#pen1" composite="true"/> <destination>Boston</destination> </emma:interpretation> </emma:emma>
In this example, annotations on the multimodal interpretation indicate the process used for the integration and there are two <emma:derived-from> elements, one pointing to the speech and one pointing to the pen gesture.
In EMMA, while annotations are assumed to have scope over later stages in sequential derivation, they are not assumed to have scope over compositional derivation steps. Annotations do not have scope over composition derivation steps because the combining inputs often have different values of a given annotation, as in the annotations: emma:signal, emma:source, emma:confidence, <emma:start>, and <emma:end>. For some of these annotations, no single value can be determined for the multimodal intepretation, for example, emma:signal and emma:source. For others a single value may be computed for the multimodal interpretation, but it may involve more than simple inheritance. For example, the value of <emma:start> for the multimodal interpretation should be the earlier of the two time values from the two combining inputs. In the above example: emma:start="2003-03-26T0:00:00.1". For <emma:end> it should be the later of the two values on the combining inputs: emma:end="2003-03-26T0:00:00.4". In the case of emma:confidence, the value for the composite is result of a numerical function defined by the author of the multimodal integration component or script. In the case of other annotations such as emma:verbal, if either of the inputs has the value true then the multimodal interpretation is emma:verbal="true". In other words the annotation for the composite input is the result of an inclusive OR of the boolean values of the annotations on the inputs.
If an annotation is only specified in one of the combining inputs then it can be assumed to apply to the multimodal interpretation of the composite input. For example, emma:lang="en-US" is only specified for the speech input.
Given the complexity of annotation scope across composite derivation steps, EMMA does not require any annotations to have scope over composite derivation steps. However, guidance is provided here for authors of multimodal integration components as to how EMMA annotations should be handled in composite derivations. The following table breaks down EMMA annotations in categories depending on their behavior in composite derivations.
Classification | Annotation | Function for value |
---|---|---|
1. Always has different values | emma:signal | 'multiple' |
emma:source | ||
emma:tokens | ||
emma:process | New value(s) describing composite integration | |
<emma:derived-from> | ||
2. Sometimes has different values | emma:medium | Common value or 'multiple' if they conflict |
emma:mode | ||
emma/xml:lang | ||
emma:model | ||
3. Function combines values | emma:start | The earlier of the two start timestamps (standard) |
emma:end | The later of the two end timestamps (standard) | |
emma:from-start-of | TBD (see open issue below) | |
emma:from-end-of | TBD (see open issue below) | |
emma:start-offset | TBD (see open issue below) | |
emma:end-offset | TBD (see open issue below) | |
emma:confidence | combination of confidence scores (author-defined) | |
emma:function | some functions are dominant (e.g. 'dialog') (standard) | |
emma:verbal | inclusive OR of values (standard) | |
4. Not integrated | emma:uninterpreted | Not applicable |
emma:no-input |
When a multimodal integration component generates the EMMA document for composite intepretation, each of these sets of EMMA annotations should be handled as indicated below.
1. Always has different values: The value of the annotation on the multimodal interpretation should be multiple indicating the presence of the conflict. In the case of emma:process and <emma:derived-from>, there will be new value(s) describing the integration process and references to the combined inputs.
2. Sometimes has different values: If the values of an annotation are the same for the combined inputs then that value should be used in the annotation on the composite. If they are not the same then the annotation value on the multimodal interpretation should be multiple indicating the presence of the conflict. If an annotation only appears on one of the inputs, then the value for the input that has the annotation should be used for the composite.
3. Function combines values: The values should be combined in accordance with the specific function require for that annotation. For some annotations the combination function is standard; e.g. earliest value for emma:start, latest value for emma:end, inclusive OR for emma:verbal. For others, such as emma:confidence there is no standard function and the function used will be defined by the application developer.
4. Not integrated: Inputs with these annotations will not be part of composite inputs and so they will not need to be annotated in composite interpretations.
For 1. and 2. above, conflicts are indicated on the annotations on the composite using the value multiple. If the values of the annotations on the combining inputs are needed then they can be accessed through the pointers in the resource attributes in the <emma:derived-from> elements. However if the early stages of the derivation have been dropped or are only remotely accessible this may not be feasible. Unlike the sequential derivation case, since the values may clash, the problem cannot be avoided by fully instantiating the <emma:interpretation> at the end of the derivation chain.
In order to address this problem, values of conflicting annotations must be indicated directly on the <emma:derived-from> element. There will be one <emma:derived-from> element for each combining input, providing a place holder for annotations with conflicting values.
The fully specified EMMA document for the composite input described above is as follows:
<emma:emma version="1.0" xmlns="http://www.w3.org/2003/04/emma"> <emma:interpretation id="speech1" emma:start="1087995961542" emma:end="1087995963542" emma:process="http://example.com/myasr.xml" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:confidence="0.6" emma:medium="acoustic" emma:mode="speech" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:tokens="destination"> <rawinput>destination</rawinput> </emma:interpretation> <emma:interpretation id="pen1" emma:start="1087995961542" emma:end="1087995963542" emma:process="http://example.com/mygesturereco.xml" emma:source="http://example.com/pen/wacom123" emma:signal="http://example.com/signals/ink5.inkml" emma:confidence="0.5" emma:medium="tactile" emma:mode="ink" emma:function="dialog" emma:verbal="false"> <rawinput>Boston</rawinput> </emma:interpretation> <emma:interpretation id="multimodal1" emma:source="multiple" emma:signal="multiple" emma:confidence="0.3" emma:medium="multiple" emma:mode="multiple" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:tokens="destination"> <emma:derived-from resource="#speech1" composite="true" emma:source="http://example.com/microphone/NC-61" emma:signal="http://example.com/signals/sg23.wav" emma:medium="acoustic" emma:mode="speech"/> <emma:derived-from resource="#pen1" composite="true" emma:source="http://example.com/pen/wacom123" emma:signal="http://example.com/signals/ink5.inkml" emma:medium="tactile" emma:mode="ink"/> <destination>Boston</destination> </emma:interpretation> </emma:emma>
In this example, the annotations for emma:source, emma:signal, emma:medium, and emma:mode all have conflicting values on the inputs (#speech1and #pen1) and are marked as "multiple" on the composite interpretation (#multimodal1). The emma:lang and emma:tokens are only specified on the speech (#speech1) and therefore are inherited by the composite interpretation (#multimodal1). The <emma:start> and <emma:end> annotations are combined by standard functions yielding the earliest and latest time values respectively on #multimodal1. The emma:verbal annotation and emma:function annotations are determined by standard combination functions. Since the emma:verbal annotation is "true" on the speech (#speech1)and "false" on the pen (#pen1), the annotation on the composite interpretation is "true". Since both the speech and pen have emma:function="dialog", the composite is annotated as emma:function="dialog". The emma:confidence annotation on the composite is determined by a non-standard function defined by the author of the integration component. In this case the function is multiplication and the resulting annotation is emma:confidence="0.3".
In implementing an EMMA processor for composite input, the EMMA annotations for timestamps, emma:function and emma:verbal on the EMMA document representing the composite input should be handled as indicated in the table above. This is a constraint on documents representing composite derivation in EMMA.
(TBD)
Conformance issues are deferred until a later revision of the specification.
This section defines the formal syntax for EMMA documents in terms of a normative XML Schema.
This schema is also available as emma.xsd.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema attributeFormDefault="unqualified" elementFormDefault="unqualified" targetNamespace="http://www.w3.org/2003/04/emma" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:annotation> <xs:documentation> EMMA 1.0 schema (20041130) </xs:documentation> </xs:annotation> <xs:annotation> <xs:documentation> Copyright 1998-2004 W3C (MIT, ERCIM, Keio), All Rights Reserved. Permission to use, copy, modify and distribute the EMMA schema and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that the above copyright notice and this paragraph appear in all copies. The copyright holders make no representation about the suitability of the schema for any purpose. It is provided "as is" without expressed or implied warranty. </xs:documentation> </xs:annotation> <xs:annotation> <xs:documentation> property annotations </xs:documentation> </xs:annotation> <xs:attribute name="grammar-ref" type="xs:IDREF"/> <xs:attribute name="tokens" type="xs:string"/> <xs:attribute name="lang" type="xs:language"/> <xs:attribute name="confidence"> <xs:simpleType> <xs:restriction base="xs:decimal"> <xs:minInclusive value="0.0"/> <xs:maxInclusive value="1.0"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="source" type="xs:anyURI"/> <xs:attribute name="process" type="xs:anyURI"/> <xs:attribute name="no-input" type="xs:boolean"/> <xs:attribute name="uninterpreted" type="xs:boolean"/> <xs:attribute name="signal" type="xs:anyURI"/> <xs:attribute name="media-type" type="xs:string"/> <xs:annotation> <xs:documentation> endpoint annotations </xs:documentation> </xs:annotation> <xs:attribute name="endpoint-address" type="xs:anyURI"/> <xs:attribute name="port-num" type="xs:nonNegativeInteger"/> <xs:attribute name="port-type" type="xs:QName"/> <xs:attribute name="endpoint-role"> <xs:simpleType> <xs:restriction base="xs:NMTOKEN"> <xs:enumeration value="source"/> <xs:enumeration value="sink"/> <xs:enumeration value="reply-to"/> <xs:enumeration value="router"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="message-id" type="xs:anyURI"/> <xs:attribute name="service-name" type="xs:string"/> <xs:attribute name="endpoint-pair-ref" type="xs:anyURI"/> <xs:attribute name="endpoint-info-ref" type="xs:IDREF"/> <xs:annotation> <xs:documentation> timestamp annotations </xs:documentation> </xs:annotation> <xs:attribute name="start" type="xs:unsignedLong"/> <xs:attribute name="end" type="xs:unsignedLong"/> <xs:attribute name="time-ref-uri" type="xs:anyURI"/> <xs:attribute default="start" name="time-ref-anchor"> <xs:simpleType> <xs:restriction base="xs:NMTOKEN"> <xs:enumeration value="start"/> <xs:enumeration value="end"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:attribute name="offset-to-start" type="xs:integer"/> <xs:attribute default="0" name="duration" type="xs:nonNegativeInteger"/> <xs:annotation> <xs:documentation> medium, mode, and function annotations </xs:documentation> </xs:annotation> <xs:attribute name="medium" type="xs:string"/> <xs:attribute name="mode" type="xs:string"/> <xs:attribute name="function" type="xs:string"/> <xs:attribute name="verbal" type="xs:boolean"/> <xs:annotation> <xs:documentation> hook for composite integration </xs:documentation> </xs:annotation> <xs:attribute name="hook" type="xs:string"/> <xs:annotation> <xs:documentation> cost </xs:documentation> </xs:annotation> <xs:attribute name="cost"> <xs:simpleType> <xs:restriction base="xs:decimal"> <xs:minInclusive value="0.0"/> <xs:maxInclusive value="10000000"/> </xs:restriction> </xs:simpleType> </xs:attribute> <xs:annotation> <xs:documentation> group annotations </xs:documentation> </xs:annotation> <xs:attributeGroup name="group.attribs"> <xs:attribute ref="emma:message-id"/> <xs:attribute ref="emma:service-name"/> <xs:attribute ref="emma:endpoint-info-ref"/> <xs:attribute ref="emma:grammar-ref"/> <xs:attribute ref="emma:tokens"/> <xs:attribute ref="emma:lang"/> <xs:attribute ref="emma:confidence"/> <xs:attribute ref="emma:source"/> <xs:attribute ref="emma:start"/> <xs:attribute ref="emma:end"/> <xs:attribute ref="emma:time-ref-uri"/> <xs:attribute ref="emma:time-ref-anchor"/> <xs:attribute ref="emma:offset-to-start"/> <xs:attribute ref="emma:duration"/> <xs:attribute ref="emma:medium"/> <xs:attribute ref="emma:mode"/> <xs:attribute ref="emma:function"/> <xs:attribute ref="emma:verbal"/> <xs:attribute ref="emma:cost"/> </xs:attributeGroup> <xs:annotation> <xs:documentation> interpretation annotations </xs:documentation> </xs:annotation> <xs:attributeGroup name="interpretation.attribs"> <xs:attributeGroup ref="emma:group.attribs"/> <xs:attribute ref="emma:process"/> <xs:attribute ref="emma:no-input"/> <xs:attribute ref="emma:uninterpreted"/> <xs:attribute ref="emma:signal"/> <xs:attribute ref="emma:media-type"/> </xs:attributeGroup> <xs:attributeGroup name="lattice.attribs"> <xs:attribute name="initial" type="xs:nonNegativeInteger" use="required"/> <xs:attribute name="final" type="xs:nonNegativeInteger" use="required"/> </xs:attributeGroup> <xs:annotation> <xs:documentation> endpoint definition </xs:documentation> </xs:annotation> <xs:complexType name="endpoint"> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="1"> <xs:any namespace="##other" processContents="lax"/> </xs:choice> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> <xs:attribute ref="emma:endpoint-address"/> <xs:attribute ref="emma:endpoint-role"/> <xs:attribute ref="emma:port-num"/> <xs:attribute ref="emma:port-type"/> <xs:attribute ref="emma:message-id"/> <xs:attribute ref="emma:service-name"/> <xs:attribute ref="emma:endpoint-pair-ref"/> </xs:complexType> <xs:annotation> <xs:documentation> endpoint-info definition </xs:documentation> </xs:annotation> <xs:complexType name="endpoint-info"> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="1"> <xs:element ref="emma:endpoint"/> </xs:choice> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> <xs:annotation> <xs:documentation> lattice definition </xs:documentation> </xs:annotation> <xs:complexType name="lattice"> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:element ref="emma:arc"/> </xs:choice> <xs:choice maxOccurs="1" minOccurs="0"> <xs:element ref="emma:node"/> </xs:choice> </xs:sequence> <xs:attributeGroup ref="emma:lattice.attribs"/> </xs:complexType> <xs:attributeGroup name="arc.attribs"> <xs:attribute name="from" type="xs:nonNegativeInteger" use="required"/> <xs:attribute name="to" type="xs:nonNegativeInteger" use="required"/> </xs:attributeGroup> <xs:complexType mixed="true" name="arc"> <xs:attributeGroup ref="emma:arc.attribs"/> <xs:attributeGroup ref="emma:interpretation.attribs"/> </xs:complexType> <xs:attributeGroup name="node.attribs"> <xs:attribute name="node-number" type="xs:nonNegativeInteger" use="required"/> <xs:attribute ref="emma:cost"/> </xs:attributeGroup> <xs:complexType name="node"> <xs:attributeGroup ref="emma:node.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> group-info definition </xs:documentation> </xs:annotation> <xs:attributeGroup name="group-info.attribs"> <xs:attribute name="ref" type="xs:anyURI"/> </xs:attributeGroup> <xs:complexType name="group-info"> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:any namespace="##other" processContents="lax"/> </xs:choice> </xs:sequence> <xs:attributeGroup ref="emma:group-info.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> model definition </xs:documentation> </xs:annotation> <xs:attributeGroup name="model.attribs"> <xs:attribute name="ref" type="xs:anyURI" use="required"/> </xs:attributeGroup> <xs:complexType name="model"> <xs:attributeGroup ref="emma:model.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> derived-from definition </xs:documentation> </xs:annotation> <xs:attributeGroup name="derived-from.attribs"> <xs:attribute name="resource" type="xs:anyURI" use="required"/> </xs:attributeGroup> <xs:complexType name="derived-from"> <xs:attributeGroup ref="emma:derived-from.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> grammar definition </xs:documentation> </xs:annotation> <xs:attributeGroup name="grammar.attribs"> <xs:attribute name="id" type="xs:ID" use="required"/> <xs:attribute name="href" type="xs:anyURI" use="required"/> </xs:attributeGroup> <xs:complexType name="grammar"> <xs:attributeGroup ref="emma:grammar.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> info definition </xs:documentation> </xs:annotation> <xs:complexType name="info"> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:any namespace="##other" processContents="lax"/> </xs:choice> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> <xs:annotation> <xs:documentation> interpretation definition </xs:documentation> </xs:annotation> <xs:group name="interpretation.class"> <xs:sequence> <xs:choice maxOccurs="1" minOccurs="0"> <xs:element ref="emma:model"/> </xs:choice> <xs:choice maxOccurs="1" minOccurs="0"> <xs:element ref="emma:endpoint-info"/> </xs:choice> <xs:choice maxOccurs="1" minOccurs="0"> <xs:element ref="emma:info"/> </xs:choice> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:element ref="emma:derived-from" maxOccurs="unbounded"/> </xs:choice> <xs:choice maxOccurs="1" minOccurs="0"> <xs:element ref="emma:lattice"/> <xs:any namespace="##other" processContents="lax" maxOccurs="unbounded"/> </xs:choice> </xs:sequence> </xs:group> <xs:complexType name="interpretation"> <xs:group ref="emma:interpretation.class"/> <xs:attribute name="id" type="xs:ID" use="required"/> <xs:attributeGroup ref="emma:interpretation.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> group definition </xs:documentation> </xs:annotation> <xs:group name="group.class"> <xs:sequence> <xs:choice maxOccurs="1" minOccurs="0"> <xs:element ref="emma:group-info"/> </xs:choice> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:element ref="emma:model"/> <xs:element ref="emma:endpoint-info"/> <xs:element ref="emma:info"/> <xs:element ref="emma:interpretation"/> <xs:element ref="emma:one-of"/> <xs:element ref="emma:group"/> <xs:element ref="emma:sequence"/> </xs:choice> </xs:sequence> </xs:group> <xs:complexType name="group"> <xs:group ref="emma:group.class"/> <xs:attribute name="id" type="xs:ID" use="required"/> <xs:attributeGroup ref="emma:group.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> one-of definition </xs:documentation> </xs:annotation> <xs:group name="one-of.class"> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:element ref="emma:model"/> <xs:element ref="emma:endpoint-info"/> <xs:element ref="emma:info"/> <xs:element ref="emma:interpretation"/> <xs:element ref="emma:group"/> <xs:element ref="emma:sequence"/> </xs:choice> </xs:sequence> </xs:group> <xs:complexType name="one-of"> <xs:group ref="emma:one-of.class"/> <xs:attribute name="id" type="xs:ID" use="required"/> <xs:attributeGroup ref="emma:group.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> sequence definition </xs:documentation> </xs:annotation> <xs:group name="sequence.class"> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:element ref="emma:model"/> <xs:element ref="emma:endpoint-info"/> <xs:element ref="emma:info"/> <xs:element ref="emma:interpretation"/> <xs:element ref="emma:group"/> <xs:element ref="emma:one-of"/> </xs:choice> </xs:sequence> </xs:group> <xs:complexType name="sequence"> <xs:group ref="emma:sequence.class"/> <xs:attribute name="id" type="xs:ID"/> <xs:attributeGroup ref="emma:group.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> emma definition </xs:documentation> </xs:annotation> <xs:attributeGroup name="emma.root.attribs"> <xs:annotation> <xs:documentation/> </xs:annotation> <xs:attribute name="version" type="xs:string" use="required"/> </xs:attributeGroup> <xs:group name="emma.class"> <xs:annotation> <xs:documentation> emma content model </xs:documentation> </xs:annotation> <xs:sequence> <xs:choice maxOccurs="unbounded" minOccurs="0"> <xs:element ref="emma:grammar"/> <xs:element ref="emma:info"/> <xs:element ref="emma:endpoint-info"/> </xs:choice> <xs:choice maxOccurs="1" minOccurs="0"> <xs:element ref="emma:interpretation" maxOccurs="unbounded"/> <xs:element ref="emma:one-of"/> <xs:element ref="emma:group"/> <xs:element ref="emma:sequence"/> </xs:choice> </xs:sequence> </xs:group> <xs:complexType name="emma"> <xs:annotation> <xs:documentation> emma content model and attributes </xs:documentation> </xs:annotation> <xs:group ref="emma:emma.class"/> <xs:attributeGroup ref="emma:emma.root.attribs"/> </xs:complexType> <xs:annotation> <xs:documentation> EMMA elements </xs:documentation> </xs:annotation> <xs:element name="info" type="emma:info"/> <xs:element name="endpoint-info" type="emma:endpoint-info"/> <xs:element name="endpoint" type="emma:endpoint"/> <xs:element name="model" type="emma:model"/> <xs:element name="derived-from" type="emma:derived-from"/> <xs:element name="group-info" type="emma:group-info"/> <xs:element name="lattice" type="emma:lattice"/> <xs:element name="arc" type="emma:arc"/> <xs:element name="node" type="emma:node"/> <xs:element name="interpretation" type="emma:interpretation"/> <xs:element name="one-of" type="emma:one-of"/> <xs:element name="group" type="emma:group"/> <xs:element name="sequence" type="emma:sequence"/> <xs:element name="grammar" type="emma:grammar"/> <xs:element name="emma" type="emma:emma"/> </xs:schema>
Leading and trailing spaces in utterances are not significant. This will be defined in the Schema by specifying "xml:space=default".
(TBD)
(This section is informative)
Normative References
Informative References:
One way to build an EMMA representation of a spoken input such as "zoom in here" is to use grammar rules in the W3C Speech Recognition Grammar Specification (SRGS) using using the Semantic Interpretation SI tags to build the application semantics with the emma:hook attribute. In this approach ECMAscript is is specified in order to build up an object representing the semantics. The resulting ECMAscript object is then translated to XML.
For our example case of "zoom in here". The following SRGS rule could be used.
<rule id="zoom"> zoom in here <tag> $.command = new Object(); $.command.action = "zoom"; $.command.location = new Object(); $.command.location._attributes = new Object(); $.command.location._attributes.hook = new Object(); $.command.location._attributes.hook._namespace = "emma"; $.command.location._attributes.hook._value = "ink"; $.command.location.type = "area"; </tag> </rule>
Application of this rule will result in the following ECMAscript object being built.
command: { action: "zoom" location: { _attributes: { hook: { _namespace: "emma" _value: "ink" } } type: "area" } }
SI processing in an XML environment would generate the following document:
<command> <action>zoom</action> <location emma:hook="ink"> <type>area</type> </location> </command>
This XML fragment might then appear within an EMMA document as follows:
<emma:interpretation emma:mode="speech"> <command> <action>zoom</action> <location emma:hook="ink"> <type>area</type> </location> </command> </emma:interpretation>
The emma:hook annotation indicates that this speech input needs to be combined with ink input such as the following:
<emma:interpretation emma:mode="ink"> <location> <type>area</type> <points>42.1345 -37.128 42.1346 -37.120 ... </points> </location> </emma:interpretation>
This will result in the following EMMA document for the combined speech and pen multimodal input.
<emma:interpretation emma:mode="multimodal"> <command> <action>zoom</action> <location> <type>area</type> <points>42.1345 -37.128 42.1346 -37.120 ... </points> </location> </command> </emma:interpretation>
New EMMA elements and attributes are added in new Section 3.1.9.2, Section 3.1.9.3, Section 3.1.10, Section 3.1.11, Section 3.2.14 and Section 3.2.14.1. New elements added in the scope of the attributes in Section 3.2.10.
The editors would like to recognize the contributions of the following members of the W3C Multimodal Interaction Group (listed in alphabetical order):
Paolo Baggia, Loquendo
Daniel Burnett, Nuance Communications
Max Froumentin, W3C
Katriina Halonen, Nokia
Gerald McCobb, IBM
Stephen Potter, Microsoft
Yuan Shao, Canon