Copyright © 2009 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
The EMMA: Extensible MultiModal Annotation specification defines an XML markup language for capturing and providing metadata on the interpretation of inputs to multimodal systems. Throughout the implementation report process and discussion since EMMA 1.0 became a W3C Recommendation, a number of new possible use cases for the EMMA language have emerged. These include the use of EMMA to represent multimodal output, biometrics, emotion, sensor data, multi-stage dialogs, and interactions with multiple users. In this document, we describe these use cases and illustrate how the EMMA language could be extended to support them.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is a W3C Working Group Note published on 15 December 2009. This is the first publication of this document and it represents the views of the W3C Multimodal Interaction Working Group at the time of publication. The document may be updated as new technologies emerge or mature. Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document is one of a series produced by the Multimodal Interaction WorkingGroup, part of the W3C Multimodal Interaction Activity. Since EMMA 1.0 became a W3C Recommendation, a number of new possible use cases for the EMMA language have emerged, e.g., the use of EMMA to represent multimodal output, biometrics, emotion, sensor data, multi-stage dialogs and interactions with multiple users. Therefore the Working Group have been working on a document capturing use cases and issues for a series of possible extensions to EMMA. The intention of publishing this Working Group Note is to seek feedback on the various different use cases.
Comments on this document can be sent to [email protected], the public forum for discussion of the W3C's work on Multimodal Interaction. To subscribe, send an email to [email protected] with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is accessible online.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document presents a set of use cases for possible new features of the Extensible MultiModal Annotation (EMMA) markup language. EMMA 1.0 was designed primarily to be used as a data interchange format by systems that provide semantic interpretations for a variety of inputs, including but not necessarily limited to, speech, natural language text, GUI and ink input. EMMA 1.0 provides a set of elements for containing the various stages of processing of a user's input and a set of elements and attributes for specifying various kinds of metadata such as confidence scores and timestamps. EMMA 1.0 became a W3C Recommendation on February 10, 2009.
A number of possible extensions to EMMA 1.0 have been identified through discussions with other standards organizations, implementers of EMMA, and internal discussions within the W3C Multimodal Interaction Working Group. This document focusses on the following use cases:
It may be possible to achieve support for some of these features
without modifying the language, through the use of the
extensibility mechanisms of EMMA 1.0, such as the
<emma:info>
element and application-specific
semantics; however, this would significantly reduce
interoperability among EMMA implementations. If features are of
general value then it would be beneficial to define standard ways
of implementing them within the EMMA language. Additionally,
extensions may be needed to support additional new kinds of input
modalities such as multi-touch and accelerometer input.
The W3C Membership and other interested parties are invited to review this document and send comments to the Working Group's public mailing list [email protected] (archive) .
In EMMA 1.0, EMMA documents were assumed to be created for completed inputs within a given modality. However, there are important use cases where it would be beneficial to represent some level of interpretation of partial results before the input is complete. For example, in a dictation application, where inputs can be lengthy it is often desirable to show partial results to give feedback to the user while they are speaking. In this case, each new word is appended to the previous sequence of words. Another use case would be incremental ASR, either for dictation or dialog applications, where previous results might be replaced as more evidence is collected. As more words are recognized and provide more context, earlier word hypotheses may be updated. In this scenario it may be necessary to replace the previous hypothesis with a revised one.
In this section, we discuss how the EMMA standard could be extended to support incremental or streaming results in the processing of a single input. Some key considerations and areas for discussion are:
emma:source
sufficient? Subsequent messages (carrying
information for a particular stream) may need to have the same
identifier.In the example below for dictation, we show how three new
attributes emma:streamId
,
emma:streamSeqNr
, and emma:streamProgress
could be used to annotate each result with metadata regarding its
position and status within a stream of input. In this example, the
emma:streamId
is a identifier which can be used to
show that different emma:interpretation
elements are
members of the same stream. The emma:streamSeqNr
attribute provides a numerical order to elements in the stream
while emma:streamProgress
indicates the start of the
stream (and whether to expect more interpretations within the same
stream), and the end of the stream. This is an instance of the
'append' scenario for partial results in EMMA.
Participant | Input | EMMA |
User | Hi Joe the meeting has moved |
<emma:emma > version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" <emma:interpretation id="int1" emma:medium="acoustic" emma:mode="voice" emma:function="transcription" emma:confidence="0.75" emma:tokens="Hi Joe the meeting has moved" emma:streamId="id1" emma:streamSeqNr="0" emma:streamProgress="begin"> <emma:literal> Hi Joe the meeting has moved </emma:literal> </emma:interpretation> </emma:emma> |
User | to friday at four |
<emma:emma > version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" <emma:interpretation id="int2" emma:medium="acoustic" emma:mode="voice" emma:function="transcription" emma:confidence="0.75" emma:tokens="to friday at four" emma:streamId="id1" emma:streamSeqNr="1" emma:streamProgress="end"> <emma:literal> to friday at four </emma:literal> </emma:interpretation> </emma:emma> |
In the example below, a speech recognition hypothesis for the
whole string is updated once more words have been recognized. This
is an instance of the 'replace' scenario for partial results in
EMMA. Note that the emma:streamSeqNr
is the same for
each interpretation in this case.
Participant | Input | EMMA |
User | Is there a Pisa |
<emma:emma > version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" <emma:interpretation id="int1" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:confidence="0.7" emma:tokens="is there a pisa" emma:streamId="id2" emma:streamSeqNr="0" emma:streamProgress="begin"> <emma:literal> is there a pisa </emma:literal> </emma:interpretation> </emma:emma> |
User | Is there a pizza restaurant |
<emma:emma > version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" <emma:interpretation id="int2" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:confidence="0.9" emma:tokens="is there a pizza restaurant" emma:streamId="id2" emma:streamSeqNr="0" emma:streamProgress="end"> <emma:literal> is there a pizza restaurant </emma:literal> </emma:interpretation> </emma:emma> |
One issue for the 'replace' case of incremental results, is how
to specify that a result replaces multiple of the previously
received results. For example, a system could receive partial
results consisting of each word in turn of an utterance, and then a
final result which is the final recognition for the whole sequence
of words. One approach to this problem would be to allow
emma:streamSeqNr
to specify a range of inputs to be
replaced. For example, if the emma:streamSeqNr
for
each of three single word results was 1, 2, and then 3. A final
revised result could be marked as
emma:streamSeqNr="1-3"
indicating that it is a revised
result for those three words.
One issue is whether timestamps might be used to track ordering
instead of introducing new attributes. One problem is that
timestamp attributes are not required and may not always be
available. Also as shown in the example, chunks of input in a
stream may not always be in sequential order. Even with timestamps
providing an order some kind of 'begin' and 'end' flag is needed
(like emma:streamProgress)
to indicate indicate the
beginning and end of transmission of streamed input. Moreover,
timestamps do not provide sufficient information to detect whether
a message has been lost.
Another possibility to explore for representation of incremental
results would be to use an <emma:sequence>
element containing the interim results and a derived result which
contains the combination.
Another issue to explore is the relationship between incremental results and the MMI lifecyle events within the MMI Architecture.
Biometric technologies include systems designed to identify
someone or verify a claim of identity based on their physical or
behavioral characteristics. These include speaker verification,
speaker identification, face recognition, and iris recognition,
among others. EMMA 1.0
provided some capability for representing the results of biometric
analysis through values of the emma:function
attribute
such as "verification". However, it did not discuss the specifics
of this use case in any detail. It may be worth exploring further
considerations and consequences of using EMMA to represent
biometric results. As one example, if different biometric results
are represented in EMMA, this would simplify the process of fusing
the outputs of multiple biometric technologies to obtain a more
reliable overall result. It should also make it easier to
take into account non-biometric claims of identity, such as a
statement like "this is Kazuyuki", represented in EMMA, along with
a speaker verification result based on the speaker's voice, which
would also be represented in EMMA. In the following example, we
have extended the set of values for emma:function
to
include "identification" for an interpretation showing the results
of a biometric component that picks out an individual from a set of
possible individuals (who are they). This contrasts with
"verification" which is used for verification of a particular user
(are they who they say they are).
Participant | Input | EMMA |
user | an image of a face |
<emma:emma> version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" <emma:interpretation id=“int1" emma:confidence="0.75” emma:medium="visual" emma:mode="photograph" emma:verbal="false" emma:function="identification"> <person>12345</person> <name>Mary Smith</name> </emma:interpretation> </emma:emma> |
One direction to explore further is the relationship between work on messaging protocols for biometrics within the OASIS Biometric Identity Assurance Services (BIAS) standards committee and EMMA.
In addition to speech recognition, and other tasks such as speaker verification and identification, another kind of interpretation of speech that is of increasing importance is determination of the emotional state of the speaker, based on, for example, their prosody, lexical choice, or other features. This information can be used, for example, to make the dialog logic of an interactive system sensitive to the user's emotional state. Emotion detection can also use other modalities such as vision (facial expression, posture) and physiological sensors such as skin conductance measurement or blood pressure. Multimodal approaches where evidence is combined from multiple different modalities are also of significance for emotion classification.
The creation of a markup language for emotion has been a recent focus of attention in W3C. Work that initiated in the W3C Emotion Markup Language Incubator Group (EmotionML XG), has now transitioned to the W3C Multimodal Working Group and the EmotionML language has been published as a working draft. One of the major use cases for that effort is: "Automatic recognition of emotions from sensors, including physiological sensors, speech recordings, facial expressions, etc., as well as from multi-modal combinations of sensors."
Given the similarities to the technologies and annotations used for other kinds of input processing (recognition, semantic classification) which are now captured in EMMA, it makes sense to explore the use of EMMA for capture of emotional classification of inputs. Just as EMMA does not standardize the application markup for semantic results, though, it does not make sense to try and standardize emotion markup within EMMA. One promising approach is to combine the containers and metadata annotation of EMMA with the EmotionML markup, as shown in the following example.
Participant | Input | EMMA |
user | expression of boredom |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" xmlns:emo="http://www.w3.org/2009/10/emotionml"> <emma:interpretation id="emo1" emma:start="1241035886246" emma:end="1241035888246" emma:medium="acoustic" emma:mode="voice" emma:verbal="false" emma:signal="http://example.com/input345.amr" emma:media-type="audio/amr; rate:8000;" emma:process="engine:type=emo_class&vn=1.2”> <emo:emotion> <emo:intensity value="0.1" confidence="0.8"/> <emo:category set="everydayEmotions" name="boredom" confidence="0.1"/> </emo:emotion> </emma:interpretation> </emma:emma> |
In this example, we use the capabilities of EMMA for describing
the input signal, its temporal characteristics, modality, sampling
rate, audio codec etc. and EmotionML is used to provide the
specific representation of the emotion. Other EMMA container
elements also have strong use cases for emotion recognition. For
example, <emma:one-of>
can be used to represent
N-best lists of competing classifications of emotion. The
<emma:group>
element could be used to combine a
semantic interpretation of a user input with an emotional
classification, as illustrated in the following example. Note that
all of the general properties of the signal can be specified on the
<emma:group>
element.
Participant | Input | EMMA |
user | spoken input "flights to boston tomorrow" to dialog system in angry voice |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" xmlns:emo="http://www.w3.org/2009/10/emotionml"> <emma:group id="result1" emma:start="1241035886246" emma:end="1241035888246" emma:medium="acoustic" emma:mode="voice" emma:verbal="false" emma:signal="http://example.com/input345.amr" emma:media-type="audio/amr; rate:8000;"> <emma:interpretation id="asr1" emma:tokens="flights to boston tomorrow" emma:confidence="0.76" emma:process="engine:type=asr_nl&vn=5.2”> <flight> <dest>boston</dest> <date>tomorrow</date> </flight> </emma:interpretation> <emma:interpretation id="emo1" emma:process="engine:type=emo_class&vn=1.2”> <emo:emotion> <emo:intensity value="0.3" confidence="0.8"/> <emo:category set="everydayEmotions" name="anger" confidence="0.8"/> </emo:emotion> </emma:interpretation> <emma:group-info> meaning_and_emotion </emma:group-info> </emma:group> </emma:emma> |
The element <emma:group>
can also be used to
capture groups of emotion detection results from individual
modalities for combination by a multimodal fusion component or when
automatic recognition results are described together with manually
annotated data. This use case is inspired by
Use case 2b (II) of the Emotion Incubator Group Report. The
following example illustrates the grouping of three
interpretations, namely: a speech analysis emotion classifier, a
physiological emotion classifier measuring blood pressure, and a
human annotator viewing video, for two different media files (from
the same episode) that are synchronized via emma:start
and emma:end
attributes. In this case, the
physiological reading is for a subinterval of the video and audio
recording.
Participant | Input | EMMA |
user | audio, video, and physiological sensor of a test user acting with a new design. |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" xmlns:emo="http://www.w3.org/2009/10/emotionml"> <emma:group id="result1"> <emma:interpretation id="speechClassification1" emma:medium="acoustic" emma:mode="voice" emma:verbal="false" emma:start="1241035884246" emma:end="1241035887246" emma:signal="http://example.com/video_345.mov" emma:process="engine:type=emo_voice_classifier”> <emo:emotion> <emo:category set="everydayEmotions" name="anger" confidence="0.8"/> </emo:emotion> </emma:interpretation> <emma:interpretation id="bloodPressure1" emma:medium="tactile" emma:mode="blood_pressure" emma:verbal="false" emma:start="1241035885300" emma:end="1241035886900" emma:signal="http://example.com/bp_signal_345.cvs" emma:process="engine:type=emo_physiological_classifier”> <emo:emotion> <emo:category set="everydayEmotions" name="anger" confidence="0.6"/> </emo:emotion> </emma:interpretation> <emma:interpretation id="humanAnnotation1" emma:medium="visual" emma:mode="video" emma:verbal="false" emma:start="1241035884246" emma:end="1241035887246" emma:signal="http://example.com/video_345.mov" emma:process="human:type=labeler&id=1”> <emo:emotion> <emo:category set="everydayEmotions" name="fear" confidence="0.6"/> </emo:emotion> </emma:interpretation> <emma:group-info> several_emotion_interpretations </emma:group-info> </emma:group> </emma:emma> |
A combination of <emma:group>
and
<emma:derivation>
could be used to represent a
combined emotional analysis resulting from analysis of multiple
different modalities of the user's behavior. The
<emma:derived-from>
and
<emma:derivation>
elements can be used to
capture both the fused result and combining inputs in a single EMMA
document. In the following example, visual analysis of user
activity and analysis of their speech have been combined by a
multimodal fusion component to provide an combined multimodal
classification of the user's emotional state. The specifics of the
multimodal fusion algorithm are not relevant here, or to EMMA in
general. Note though that in this case, the multimodal fusion
appears to have compensated for uncertainty in the visual analysis
which gave two results with equal confidence, one for fear and one
for anger. The emma:one-of
element is used to capture
the N-best list of multiple competing results from the video
classifier.
Participant | Input | EMMA |
user | multimodal fusion of emotion classification of user based on analysis of voice and video |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" xmlns:emo="http://www.w3.org/2009/10/emotionml"> <emma:interpretation id="multimodalClassification1" emma:medium="acoustic,visual" emma:mode="voice,video" emma:verbal="false" emma:start="1241035884246" emma:end="1241035887246" emma:process="engine:type=multimodal_fusion”> <emo:emotion> <emo:category set="everydayEmotions" name="anger" confidence="0.7"/> </emo:emotion> <emma:derived-from ref="mmgroup1" composite="true"/> </emma:interpretation> <emma:derivation> <emma:group id="mmgroup1"> <emma:interpretation id="speechClassification1" emma:medium="acoustic" emma:mode="voice" emma:verbal="false" emma:start="1241035884246" emma:end="1241035887246" emma:signal="http://example.com/video_345.mov" emma:process="engine:type=emo_voice_classifier”> <emo:emotion> <emo:category set="everydayEmotions" name="anger" confidence="0.8"/> </emo:emotion> </emma:interpretation> <emma:one-of id="video_nbest" emma:medium="visual" emma:mode="video" emma:verbal="false" emma:start="1241035884246" emma:end="1241035887246" emma:signal="http://example.com/video_345.mov" emma:process="engine:type=video_classifier"> <emma:interpretation id="video_result1" <emo:emotion> <emo:category set="everydayEmotions" name="anger" confidence="0.5"/> </emo:emotion> </emma:interpretation> <emma:interpretation id="video_result2" <emo:emotion> <emo:category set="everydayEmotions" name="fear" confidence="0.5"/> </emo:emotion> </emma:interpretation> </emma:one-of> <emma:group-info> emotion_interpretations </emma:group-info> </emma:group> </emma:derivation> </emma:emma> |
One issue which need to be addressed is the relationship between
EmotionML confidence
attribute values and
emma:confidence
values. Could the
emma:confidence
value be used as an overall confidence
value for the emotion result, or should confidence values appear
only within the EmotionML markup since confidence is used for
different dimensions of the result? If a series of possible emotion
classifications are contained in emma:one-of
should
they be ordered by the EmotionML confidence values?
Enriching the semantic information represented in EMMA would be helpful for certain use cases. For example, the concepts in an EMMA application semantics representation might include references to concepts in an ontology such as WordNet. Then, a translation system might make use of a sense disambiguator to represent the probabilities of different senses of a word, for example, "spicy" in the example has two possible WordNet senses. In the following example, inputs to a machine translation system are annotated in the application semantics with specific WordNet senses which are used to distinguish among different senses of the words. A translation system might make use of a sense disambiguator to represent the probabilities of different senses of a word, for example, "spicy" in the example has two possible WordNet senses.
Participant | Input | EMMA |
user | I love to eat Mexican food because it is spicy |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example" xmlns="http://example.com/universal_translator"> <emma:interpretation id="spanish"> <result xml:lang="es"> Adoro alimento mejicano porque es picante. </result> <emma:derived-from resource="#english" composite="false"/> </emma:interpretation> <emma:derivation> <emma:interpretation id="english" emma:tokens="I love to eat Mexican food because it is spicy"> <assertion> <interaction wordnet="1828736" wordnet-desc="love, enjoy (get pleasure from)" token="love"> <experiencer reference="first" token="I"> <attribute quantity="single"/> </experiencer> <attribute time="present"/> <content> <interaction wordnet="1157345" wordnet-desc="eat (take in solid food)" token="to eat"> <object id="obj1" wordnet="7555863" wordnet-desc="food, solid food (any solid substance (as opposed to liquid) that is used as a source of nourishment)" token="food"> <restriction wordnet="3026902" wordnet-desc="Mexican (of or relating to Mexico or its inhabitants)" token="Mexican"/> </object> </interaction> </content> <reason token="because"> <experiencer reference="third" target="obj1" token="it"/> <attribute time="present"/> <one-of token="spicy"> <modification wordnet="2397732" wordnet-desc="hot, spicy (producing a burning sensation on the taste nerves)" confidence="0.8"/> <modification wordnet="2398378" wordnet-desc="piquant, savory, savoury, spicy, zesty (having an agreeably pungent taste)" confidence="0.4"/> </one-of> </reason> </interaction> </assertion> </emma:interpretation> </emma:derivation> </emma:emma> |
In addition to sense disambiguation it could also be useful to relate concepts to superordinate concepts in some ontology. For example, it could be useful to know that O'Hare is an airport and Chicago is a city, even though they might be used interchangeably in an application. For example, in an air travel application a user might say "I want to fly to O'Hare" or "I want to fly to Chicago".
EMMA 1.0 was explicitly limited in scope to representation of the interpretation of user inputs. Most interactive systems also produce system output and one of the major possible extensions of the EMMA language would be to provide support for representation of the outputs made by the system in addition to the user inputs. One advantage of having EMMA representation for system output is that system logs can have unified markup representation across input and output for viewing and analyzing user/system interactions. In this section, we consider two different use cases for addition of output representation to EMMA.
It is desirable for a multimodal dialog designer to be able to isolate dialog flow (for example SCXML code) from the details of specific utterances produced by a system. This can achieved by using presentation or media planning component that takes the abstract intent from the system and creates one or more modality-specific presentations. In addition to isolating dialog logic from specific modality choice this can also make it easier to support different technologies for the same modality. For example, in the example below, the GUI technology is HTML, but abstracting output would also support using a different GUI technology like Flash, or SVG. If EMMA is extended to support output, then EMMA documents could be used for communication from the dialog manager to the presentation planning component, and also potentially for the documents generated by the presentation component, which could embed specific markup such as HTML and SSML. Just as there can be multiple different stages of processing of a user input, there may be multiple stages of processing of an output, and the mechanisms of EMMA can be used to capture and provide metadata on these various stages of output processing.
Potential benefits for this approach include:
In the following example, we consider the introduction of a new
EMMA element, <emma:presentation>
which is the
output equivalent of the input element
<emma:interpretation>
. Like
<emma:interpretation>
this element can take
emma:medium
and emma:mode
attributes
classifying the specific modality. It could also potentially take
timestamp annotations indicating the time at which the output
should be produced. One issue is whether timestamps should be used
for the intended time of production or for the actual time of
production and how to capture both. Relative timestamps could be
used to anchor the planned time of presentation to another element
of system output. In this example we show how the
emma:semantic-rep
attribute proposed in Section 2.12 could potentially be used to indicate the
markup language of the output.
Participant | Output | EMMA |
IM (step 1) | semantics of "what would you like for lunch?" |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation> <question> <topic>lunch</topic> <experiencer>second person</experiencer> <object>questioned</object> </question> </emma:presentation> </emma:emma> or, more simply, without natural language generation: <emma:emma> <emma:presentation> <text>what would you like for lunch?</text> </emma:presentation> </emma:emma> |
presentation manager (voice output) | text "what would you like for lunch?" |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation emma:medium="acoustic" emma:mode="voice" emma:verbal="true" emma:function="dialog" emma:semantic-rep="ssml"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> what would you like for lunch</speak> </emma:presentation> </emma:emma> |
presentation manager (GUI output) | text "what would you like for lunch?" |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation emma:medium="visual" emma:mode="graphics" emma:verbal="true" emma:function="dialog" emma:semantic-rep="html"> <html> <body> <p>what would you like for lunch?"</p> <input name="" type="text"> <input type="submit" name="Submit" value="Submit"> </body> </html> </emma:presentation> </emma:emma> |
A critical issue in the enablement of effective multimodal
output is to enable synchronization of outputs in different output
media. For example, text to speech output or prompts may be
coordinated with graphical outputs such as highlighting of items in
an HTML table. EMMA markup could potentially be used to indicate
that elements in each medium should be coordinated in their
presentation. In the following example, a new attribute
emma:sync
is used to indicate the relationship between
a <mark>
in SSML and an element to
be highlighted in HTML content. The emma:process
attribute could be used to identify the presentation planning
component. Again emma:semantic-rep
is used to indicate
the embedded markup language.
Participant | Output | EMMA |
system | Coordinated presentation of table with TTS |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:group id=“gp1" emma:medium="acoustic,visual" emma:mode="voice,graphics" emma:process="http://example.com/presentation_planner"> <emma:presentation id=“pres1" emma:medium="acoustic" emma:mode="voice" emma:verbal="true" emma:function="dialog" emma:semantic-rep="ssml"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> Item 4 <mark emma:sync="123"/> costs fifteen dollars. </speak> </emma:presentation> <emma:presentation id=“pres2" emma:medium="visual" emma:mode="graphics" emma:verbal="true" emma:function="dialog" emma:semantic-rep="html" <table xmlns="http://www.w3.org/1999/xhtml"> <tr> <td emma:sync="123">Item 4</td> <td>15 dollars</td> </tr> </table> </emma:presentation> </emma:group> </emma:emma> |
One issue to be considered is the potential role of the Synchronized Multimedia Integration Language (SMIL) for capturing multimodal output synchronization. SMIL markup for multimedia presentation could potentially be embedded within EMMA markup coming from an interaction manager to a client for rendering.
The scope of EMMA 1.0
was explicitly limited to representation of single turns of user
input. For logging, analysis, and training purposes it could be
useful to be able to represent multi-stage dialogs in EMMA. The
following example shows a sequence of two EMMA documents where the
the first is a request from the system and the second is the user
response. A new attribute emma:in-response-to
is used
to relate the system output to the user input. EMMA already has an
attribute emma:dialog-turn
used to provide an
indicator of the turn of interaction.
Participant | Input | EMMA |
system | where would you like to go? |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation id="pres1" emma:dialog-turn="turn1" emma:in-response-to="initial"> <prompt> where would you like to go? </prompt> </emma:presentation> </emma:emma> |
user | New York |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id="int1" emma:dialog-turn="turn2" emma:tokens="new york" emma:in-response-to="pres1"> <location> New York </location> </emma:interpretation> </emma:emma> |
In this case, each utterance is still a single EMMA document,
and markup is being used to encode the fact that the utterance are
part of an ongoing dialog. Another possibility would be to use EMMA
markup to contain a whole dialog within a single EMMA document. For
example, a flight query dialog could be represented as follows
using <emma:sequence>
:
Participant | Input | EMMA |
user | flights to boston |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:sequence> <emma:interpretation id="user1" emma:dialog-turn="turn1" emma:in-response-to="initial"> <emma:literal> flights to boston </emma:literal> </emma:interpretation> <emma:presentation id="sys1" emma:dialog-turn="turn2" emma:in-response-to="user1"> <prompt> traveling to boston, which departure city </prompt> </emma:presentation> <emma:interpretation id="user2" emma:dialog-turn="turn3" emma:in-response-to="sys1"> <emma:literal> san francisco </emma:literal> </emma:interpretation> <emma:presentation id="sys2" emma:dialog-turn="turn4" emma:in-response-to="user2"> <prompt> departure date </prompt> </emma:presentation> <emma:interpretation id="user3" emma:dialog-turn="turn5" emma:in-response-to="sys2"> <emma:literal> next thursday </emma:literal> </emma:interpretation> </emma:sequence> </emma:emma> |
system | traveling to Boston, which departure city? | |
user | San Francisco | |
system | departure date | |
user | next thursday |
Note that in this example with
<emma:sequence>
the
emma:in-response-to
attribute is still important since
there is no guarantee that an utterance in a dialog is a response
to the previous utterance. For example, a sequence of utterances
may all be from the user.
One issue that arises with the representation of whole dialogs
is that the resulting EMMA documents with full sets of metadata may
become quite large. One possible extension that could help with
this would be allow the value of emma:in-response-to
to be URI valued so it can refer to another EMMA document.
EMMA was initially designed to facilitate communication among
components of an interactive system. It has become clear over time
that the language can also play an important role in logging of
user/system interactions. In this section, we consider possible
advantages of EMMA for log analysis and illustrate how elements
such as <emma:derived-from>
could be used to
capture and provide metadata on annotations made by human
annotators.
The proposal above for representing system output in EMMA would
support after the fact analysis of dialogs. For example, if both
the system's and the user's utterance are represented in EMMA, it
should be much easier to examine relationships between factors such
as how the wording of prompts might affect user's responses or even
the modality that users select for their responses. It would also
be easier to study timing relationships between the system prompt
and the user's responses. For example, prompts that are confusing
might consistently elicit longer times before the user starts
speaking. This would be useful even without a presentation manager
or fission component. In the following example, it might be useful
to look into the relationship between the end of the prompt and the
start of the user's response. We use here the
emma:in-response-to
attribute suggested in Section 2.6 for the representation of dialogs in
EMMA.
Participant | Input | EMMA |
system | where would you like to go? |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation id="pres1" emma:dialog-turn="turn1" emma:in-response-to="initial" emma:start="1241035886246" emma:end="1241035888306"> <prompt> where would you like to go? </prompt> </emma:presentation> </emma:emma> |
user | New York |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id="int1" emma:dialog-turn="turn2" emma:in-response-to="pres1" emma:start="1241035891246" emma:end="1241035893000""> <destination> New York </destination> </emma:interpretation> </emma:emma> |
EMMA is generally used to show the recognition, semantic
interpretation etc. assigned to inputs based on machine
processing of the user input. Another potential use case is to
provide a mechanism for showing the interpretation assigned to an
input by a human annotator and using
<emma:derived-from>
to show the relationship
between the input received the annotation. The
<emma:one-of>
element can then be used to show
multiple competing annotations for an input. The
<emma:group>
element could be used to contain
multiple different kinds of annotation on a single input. One
question here is whether emma:process
can be used for
identification of the labeller, and whether there is a need for any
additional EMMA machinery to better support this this use case. In
these examples, <emma:literal>
contains mixed
content with text and elements. This is in keeping with the EMMA
1.0 schema.
One issue that arises concerns the meaning of an
emma:confidence
value on an annotated interpretation.
It may be preferable to have another attribute for annotator
confidence rather than overloading the current
emma:confidence
.
Another issue concerns mixing of system results and human annotation. Should these be grouped or is the annotation a derived from the system's interpretation. Also it would be useful to capture the time of the annotation. The current timestamps are used for the time of the input itself. Where should annotation timestamps be recorded?
It would also be useful to have a way to specify open ended
information about the annotator such as their native language,
profession, experience etc. One approach would be to be to have a
new attribute e.g. emma:annotator
with a URI value
that could point to a description of the annotator.
It could be useful for very common annotations to have in
addition to emma:tokens
another dedicated element to
indicate the annotated transcription, for example,
emma:annotated-tokens
or
emma:transcription
.
In the following example, we show how
emma:interpretation
and emma:derived-from
could be used to capture the annotation of an input.
Participant | Input | EMMA |
user |
In this example the user has said: "flights from boston to san francisco leaving on the fourth of september" and the semantic interpretation here is a semantic tagging of the utterance done by a human annotator. emma:process is used to provide details about the annotation |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id="annotation1" emma:process="annotate:type=semantic&annotator=michael" emma:confidence="0.90"> <emma:literal> flights from <src>san francisco</src> to <dest>boston</dest> on <date>the fourth of september</date> </emma:literal> <emma:derived-from resource="#asr1"/> </emma:interpretation> <emma:derivation> <emma:interpretation id="asr1" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:start="1241690021513" emma:end="1241690023033" emma:media-type="audio/amr; rate=8000" emma:process="smm:type=asr&version=watson6" emma:confidence="0.80"> <emma:literal> flights from san francisco to boston on the fourth of september </emma:literal> </emma:interpretation> </emma:derivation> </emma:emma> |
Taking this example a step further,
<emma:group>
could be used to group annotations
made by multiple different annotators of the same utterance:
Participant | Input | EMMA |
user |
In this example the user has said: "flights from boston to san francisco leaving on the fourth of september" and the semantic interpretation here is a semantic tagging of
the utterance done by two different human annotators.
|
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:group emma:confidence="1.0"> <emma:interpretation id="annotation1" emma:process="annotate:type=semantic&annotator=michael" emma:confidence="0.90"> <emma:literal> flights from <src>san francisco</src> to <dest>boston</dest> on <date>the fourth of september</date> </emma:literal> <emma:derived-from resource="#asr1"/> </emma:interpretation> <emma:interpretation id="annotation2" emma:process="annotate:type=semantic&annotator=debbie" emma:confidence="0.90"> <emma:literal> flights from <src>san francisco</src> to <dest>boston</dest> on <date>the fourth of september</date> </emma:literal> <emma:derived-from resource="#asr1"/> </emma:interpretation> <emma:group-info>semantic_annotations</emma:group-info> </emma:group> <emma:derivation> <emma:interpretation id="asr1" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:start="1241690021513" emma:end="1241690023033" emma:media-type="audio/amr; rate=8000" emma:process="smm:type=asr&version=watson6" emma:confidence="0.80"> <emma:literal> flights from san francisco to boston on the fourth of september </emma:literal> </emma:interpretation> </emma:derivation> </emma:emma> |
For certain applications, it is useful to be able to represent the semantics of multi-sentence inputs, which may be in one of more modalities such as speech (e.g. voicemail), text (e.g. email), or handwritten input. One application use case is for summarizing a voicemail or email. We develop this example below.
There are at least two possible approaches to addressing this use case.
emma:tokens
attribute
of an <emma:interpretation>
or
<emma:one-of>
element, where the semantics of
the input is represented as the value of an
<emma:interpretation>
. Although in principle
there is no upper limit on the length of a emma:tokens
attribute, in practice, this approach might be cumbersome for
longer or more complicated texts.<emma:interpretation>
elements under an
<emma:sequence>
element. A single unified
semantics representing the meaning of the entire input could then
be represented with the sequence as the value of
<emma:derived-from>
.Participant | Input | EMMA |
user |
Hi Group, You are all invited to lunch tomorrow at Tony's Pizza at 12:00. Please let me know if you're planning to come so that I can make reservations. Also let me know if you have any dietary restrictions. Tony's Pizza is at 1234 Main Street. We will be discussing ways of using EMMA. Debbie |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation emma:tokens="Hi Group, You are all invited to lunch tomorrow at Tony's Pizza at 12:00. Please let me know if you're planning to come so that I can make reservations. Also let me know if you have any dietary restrictions. Tony's Pizza is at 1234 Main Street. We will be discussing ways of using EMMA." > <business-event>lunch</business-event> <host>debbie</host> <attendees>group</attendees> <location> <name>Tony's Pizza</name> <address> 1234 Main Street</address> </location> <date> tuesday, March 24</date> <needs-rsvp>true</needs-rsvp> <needs-restrictions>true</need-restrictions> <topic>ways of using EMMA</topic> </emma:interpretation> </emma:emma> |
EMMA 1.0 primarily
focussed on the interpretation of inputs from a single user. Both
for annotation of human-human dialogs and for the emerging systems
which support dialog or multimodal interaction with multiple
participants (such as multimodal systems for meeting analysis), it
is important to support annotation of interactions involving
multiple different participants. The proposals above for capturing
dialog can play an important role. One possible further extension
would be to add specific markup for annotation of the user making a
particular contribution. In the following example, we use an
attribute emma:participant
to identify the participant
contributing each response to the prompt.
Participant | Input | EMMA |
system | Please tell me your lunch orders |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation id="pres1" emma:dialog-turn="turn1" emma:in-response-to="initial" emma:start="1241035886246" emma:end="1241035888306"> <prompt>please tell me your lunch orders</prompt> </emma:presentation> </emma:emma> |
user1 | I'll have a mushroom pizza |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id="int1" emma:dialog-turn="turn2" emma:in-response-to="pres1" emma:participant="user1" emma:start="1241035891246" emma:end="1241035893000""> <pizza> <topping> mushroom </topping> </pizza> </emma:interpretation> </emma:emma> |
user3 | I'll have a pepperoni pizza. |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id="int2" emma:dialog-turn="turn3" emma:in-response-to="pres1" emma:participant="user2" emma:start="1241035896246" emma:end="1241035899000""> <pizza> <topping> pepperoni </topping> </pizza> </emma:interpretation> </emma:emma> |
The multimodal examples described in the EMMA 1.0 specification, include combination of spoken input with a location specified by touch or pen. With the increase in availability of GPS and other location sensing technology such as cell tower triangulation in mobile devices, it is desirable to provide a method for annotating inputs with the device location and, in some cases fusing the GPS information with the spoken command in order to derive a complete interpretation. GPS information could potentially be determined using the Geolocation API Specification from the Geolocation working group and then encoded into a EMMA result sent to a server for fusion.
One possibility using the current EMMA capabilities is to use
<emma:group>
to associate GPS markup with the
semantics of a spoken command. For example, the user might say
"where is the nearest pizza place?" and the interpretation of the
spoken command is grouped with markup capturing the GPS sensor
data. This example uses the existing
<emma:group>
element and extends the set of
values of emma:medium
and emma:mode
to
include "sensor"
and "gps"
respectively.
Participant | Input | EMMA |
user | where is the nearest pizza place? |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:group> <emma:interpretation emma:tokens="where is the nearest pizza place" emma:confidence="0.9" emma:medium="acoustic" emma:mode="voice" emma:start="1241035887111" emma:end="1241035888200" emma:process="reco:type=asr&version=asr_eng2.4" emma:media-type="audio/amr; rate=8000" emma:lang="en-US"> <category>pizza</category> </emma:interpretation> <emma:interpretation emma:medium="sensor" emma:mode="gps" emma:start="1241035886246" emma:end="1241035886246"> <lat>40.777463</lat> <lon>-74.410500</lon> <alt>0.2</alt> </emma:interpretation> <emma:group-info>geolocation</emma:group-info> </emma:group> </emma:emma> |
GPS | (GPS coordinates) |
Another, more abbreviated, way to incorporate sensor information
would be to have spatial correlates of the timestamps and allow for
location stamping of user inputs, e.g. emma:lat
and
emma:lon
attributes that could appear on EMMA
container elements to indicate the location where the input was
produced.
In many of the use cases considered so far, EMMA is used for
representation of the results of speech recognition and then for
the results of natural language understanding, and possibly
multimodal fusion. In systems used for voice search, the next step
is often to conduct search and extract a set of records or
documents. Strictly speaking, this stage of processing is out of
scope for EMMA. It is odd though to have the mechanisms of EMMA
such as <emma:one-of>
for ambiguity all the way
up to NLU or multimodal fusion, but not to have access to the same
apparatus for representation of the next stage of processing which
can often be search or database lookup. Just as we can use
<emma:one-of>
and emma:confidence
to represent N-best recognitions or semantic interpretations,
similarly we can use them to represent a series of search results
along with their relative confidence. One issue is whether we need
some measure other than confidence for relevance ranking, or is the
same confidence attribute can be used.
One issue that arises is whether it would be useful to have some
recommended or standardized element to use for query results e.g
<result>
as in the following example. Another
issue is how to annotate information about the database and the
query that was issued. The database could be indicate as part of
the emma:process
value as in the following example.
For web search the query URL could be annotated on the result e.g.
<result url="http://cnn.com"/>
. For database
queries, the query, SQL for example could be annotated on the
results or on the containing <emma:group>
.
The following example shows the use of EMMA to represent the
results of database retrieval from an employee directory. The user
says "John Smith". After ASR, NLU, and then database look up, the
system returns the XML here which shows the N-best lists associated
with each of these three stages of processing. Here
<emma:derived-from&gr;
is used to indicate the
relations between each of the <emma:one-of>
elements. However, if you want to see which specific ASR result a
record is derived from, you would need to put
<emma:derived-from>
on the individual
elements.
Participant | Input | EMMA |
user | User says "John Smith" |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:one-of id="db_results1" emma:process="db:type=mysql&database=personel_060109.db> <emma:interpretation id="db_nbest1" emma:confidence="0.80" emma:tokens="john smith"> <result> <name>John Smith</name> <room>dx513</room> <number>123-456-7890>/number> </result> </emma:interpretation> <emma:interpretation id="db_nbest2" emma:confidence="0.70" emma:tokens="john smith"> <result> <name>John Smith</name> <room>ef312</room> <number>123-456-7891>/number> </result> </emma:interpretation> <emma:interpretation id="db_nbest3" emma:confidence="0.50" emma:tokens="jon smith"> <result> <name>Jon Smith</name> <room>dv900</room> <number>123-456-7892>/number> </result> </emma:interpretation> <emma:interpretation id="db_nbest4" emma:confidence="0.40" emma:tokens="joan smithe"> <result> <name>Joan Smithe</name> <room>lt567</room> <number>123-456-7893>/number> </result> </emma:interpretation> <emma:derived-from resource="#nlu_results1/> </emma:one-of> <emma:derivation> <emma:one-of id="nlu_results1" emma:process="smm:type=nlu&version=parser"> <emma:interpretation id="nlu_nbest1" emma:confidence="0.99" emma:tokens="john smith"> <fn>john</fn><ln>smith</ln> </emma:interpretation> <emma:interpretation id="nlu_nbest2" emma:confidence="0.97" emma:tokens="jon smith"> <fn>jon</fn><ln>smith</ln> </emma:interpretation> <emma:interpretation id="nlu_nbest3" emma:confidence="0.93" emma:tokens="joan smithe"> <fn>joan</fn><ln>smithe</ln> </emma:interpretation> <emma:derived-from resource="#asr_results1/> </emma:one-of> <emma:one-of id="asr_results1" emma:medium="acoustic" emma:mode="voice" emma:function="dialog" emma:verbal="true" emma:lang="en-US" emma:start="1241641821513" emma:end="1241641823033" emma:media-type="audio/amr; rate=8000" emma:process="smm:type=asr&version=watson6"> <emma:interpretation id="asr_nbest1" emma:confidence="1.00"> <emma:literal>john smith</emma:literal> </emma:interpretation> <emma:interpretation id="asr_nbest2" emma:confidence="0.98"> <emma:literal>jon smith</emma:literal> </emma:interpretation> <emma:interpretation id="asr_nbest3" emma:confidence="0.89" > <emma:literal>joan smithe</emma:literal> </emma:interpretation> </emma:one-of> </emma:derivation> </emma:emma> |
In the EMMA 1.0
specification, the semantic representation of an input is
represented either in XML in some application namespace or as a
literal value using emma:literal
. In some
circumstances it could be beneficial to allow for semantic
representation in other formats such as JSON. Serializations such
as JSON could potentially be contained within
emma:literal
using CDATA, and a new EMMA annotation
e.g. emma:semantic-rep
used to indicate the semantic
representation language being used.
Participant | Input | EMMA |
user | semantics of spoken input |
<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id=“int1" emma:confidence=".75” emma:medium="acoustic" emma:mode="voice" emma:verbal="true" emma:function="dialog" emma:semantic-rep="json" <emma:literal> <![CDATA[ { drink: { liquid:"coke", drinksize:"medium"}, pizza: { number: "3", pizzasize: "large", topping: [ "pepperoni", "mushrooms" ] } } ]]> </emma:literal> </emma:interpretation> </emma:emma> |
EMMA 1.0 Requirements http://www.w3.org/TR/EMMAreqs/
EMMA Recommendation http://www.w3.org/TR/emma/
Thanks to Jim Larson (W3C Invited Expert) for his contribution to the section on EMMA for multimodal output.