ECE 259B
Fundamentals of Speech
Recognition Lecture 1
Introduction/Overview of
Automatic Speech
Recognition
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Why Digital Processing of
Speech?
digital processing of speech signals (DPSS)
enjoys an extensive theoretical and
experimental base developed over the past 75
years
much research has been done since 1965 on
the use of digital signal processing in speech
communication problems
highly advanced implementation technology
(VLSI) exists that is well matched to the
computational demands of DPSS
there are abundant applications that are in
widespread use commercially
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
The Speech Stack
Speech Applications coding, synthesis,
recognition, understanding, verification,
language translation, speed-up/slow-down
Speech Algorithms speech-silence,
voiced-unvoiced, pitch, formants
Speech Representations temporal,
spectral, homomorphic, LPC
Fundamentals acoustics, linguistics,
pragmatics, speech perception
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Speech Recognition-2001
(Stanley Kubrick View in 1968)
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Apple Navigator -- 1988
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
The Speech Advantage
Reduce costs
reduce labor expenses while still providing
customers an easy-to-use and natural way to
access information and services
New revenue opportunities
24x7 high-quality customer care automation
access to information without a keyboard or
touch-tones
Customer retention
provide personal services for customer
preferences
improve customer experience
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
The Speech Circle
Voice reply to customer
Customer voice request
What number did you
want to call?
Text-to-Speech
Synthesis
TTS
ASR
Automatic Speech
Recognition
Data
Whats next?
Words spoken
Determine correct number
I dialed a wrong number
Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
Automatic Speech Recognition
Goal: Accurately and efficiently convert a
speech signal into a text message
independent of the device, speaker or the
environment.
Applications: Automation of complex
operator-based tasks, e.g., customer care,
dictation, form filling applications,
provisioning of new services, customer
help lines, e-commerce, etc.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Pattern Matching Problems
speech
A-to-D
Converter
speech
Feature
Analysis
Pattern
Matching
symbols
recognition
speaker recognition
speaker verification
word spotting
automatic indexing of speech recordings
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Basic ASR Formulation (Bayes Method)
Speakers
Intention
Speech
Production
Mechanisms
s(n)
Acoustic
Processor
Speaker Model
Linguistic
Decoder
^
W
Speech Recognizer
W = arg max P (W | X )
W
P ( X | W ) P (W )
= arg max
P( X )
W
= arg max PA ( X | W ) PL (W )
W
Step 3
12/28/2009
Step 1
Step 2
Fundamentals of Speech
Recognition-Overview of ASR
10
Steps in Speech Recognition
Step 1- Acoustic Modeling:
Modeling assign probabilities
to acoustic realizations of a sequence of
words. Compute PA(X|W) using statistical
models (Hidden Markov Models) of acoustic
signals and words
Step 2- Language Modeling:
Modeling assign probabilities
to sequences of words in the language. Train
PL(W) from generic text or from transcriptions
of task-specific dialogues.
Step 3- Hypothesis Search:
Search find the word
sequence with the maximum a posteriori
probability. Search through all possible word
sequences to determine arg max over W.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
11
Step 1-The Acoustic Model
i we build acoustic models by learning statistics of the acoustic
features, X , from a training set where we compute the variability
of the acoustic features during the production of the sounds
represented by the models
i it is impractical to create a separate acoustic model, PA ( X | W ), for
every possible word in the language--it requires too much
training data for words in every possible context
i instead we build acoustic-phonetic models for the ~50 phonemes
in the English language and construct the model for a word by
concantenating (stringing together sequentially) the models for
the constituent phones in the word
i similarly we build sentences (sequences of words) by concatenating
word models
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
12
Step 2-The Language Model
the language model describes the probability of a sequence of
words that form a valid sentence in the language
a simple statistical method works well based on a Markovian
assumption, namely that the probability of a word in a sentence
is conditioned on only the previous N-words, namely an N-gram
language model
PL (W ) = PL (w1,w 2 ,...,w k )
k
= PL (w n | w n 1,w n 2 ,...,w n N )
n =1
i where PL (w n | w n 1,w n 2 ,...,w n N ) is estimated by simply
counting up the relative frequencies f of N -tuples in a
large corpus of text
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
13
Step 3-The Search Problem
the search problem is one of searching the space of
all valid sound sequences, conditioned on the word
grammar, the language syntax, and the task
constraints, to find the word sequence with the
maximum likelihood
the size of the search space can be astronomically
large and take inordinate amounts of computing
power to solve by heuristic methods
the use of methods from the field of Finite State
Automata Theory provide Finite State Networks
(FSN) that reduce the computational burden by
orders of magnitude, thereby enabling exact solutions
in computationally feasible times, for large speech
recognition problems
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
14
Basic ASR Formulation
The basic equation of Bayes rule-based speech
recognition is
W = arg max P ( W | X)
s(n), W
P(W) P( X | W)
= arg max
P ( X)
W
= arg max P ( W ) P ( X | W )
Speech
Analysis
Xn
Decoder
where X=X1,X2,,XN is the acoustic observation (feature
vector) sequence.
= w w ...w
W
1 2
M
is the corresponding word sequence, P(X|W) is the
acoustic model and P(W) is the language model
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
15
^W
TTS
DM
Speech
Recognition
Process
ASR
SLU
Acoustic
Acoustic
Model
Model
(HMM)
(HMM)
Input
Speech
Feature
Feature
Analysis
Analysis
(Spectral
(Spectral
Analysis)
Analysis)
Pattern
Pattern
Classification
Classification
(Decoding,
(Decoding,
Search)
Search)
Language
Language
Model
Model
(N-gram)
(N-gram)
12/28/2009
Utterance
Utterance
Verification
Verification
(Confidence
(Confidence
Scores)
Scores)
Hello World
(0.9) (0.8)
Word
Word
Lexicon
Lexicon
Fundamentals of Speech
Recognition-Overview of ASR
16
Speech Recognition Processes
Choose task => sounds, word vocabulary,
task syntax (grammar), task semantics
text training data set => word lexicon, word
grammar (language model), task grammar
speech training data set => acoustic models
Evaluate performance
speech testing data set
Training algorithm => build models from
training set of text and speech
Testing algorithm => evaluate
performance from testing set of speech
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
17
Feature Extraction
Goal: Extract robust features (information) from
the speech that are relevant for ASR.
Feature
Feature
Extraction
Extraction
Method: Spectral analysis through either a
bank-of-filters or through LPC followed
by non-linearity and normalization (cepstrum).
Acoustic
Acoustic
Model
Model
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Result: Signal compression where for each window of speech
samples where 30 or so cepstral features are extracted (64,000 b/s ->
5,200 b/s).
Challenges: Robustness to environment (office, airport, car),
devices (speakerphones, cellphones), speakers (acents, dialect, style,
speaking defects), noise and echo. Feature set for recognition
cepstral features or those from a high dimensionality space.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
18
Robustness
Robustness
Robustness
Problem:
Rejection
Rejection
Unlimited
Unlimited
Vocabulary
Vocabulary
A mismatch in the speech signal between the
training phase and testing phase can result in performance degradation.
Methods:
Traditional techniques for improving system robustness
are based on signal enhancement, feature normalization or/and
model adaptation.
Perception Approach:
Extract fundamental acoustic information in narrow bands
of speech. Robust integration of features across time and frequency.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
19
Methods for Robust Speech
Recognition
Training
Signal
Enhancement
Testing
12/28/2009
Signal
Features
Normalization
Features
Fundamentals of Speech
Recognition-Overview of ASR
Model
Adaptation
Model
20
Acoustic Model
Goal: Map acoustic features into distinct
Acoustic
Acoustic
Model
Model
Feature
Feature
Extraction
Extraction
phonetic labels (e.g., /s/, /aa/).
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Hidden Markov Model (HMM): Statistical method for
characterizing the spectral properties of speech by a parametric random
process. A collection of HMMs is associated with a phone. HMMs are
also assigned for modeling extraneous events.
Advantages: Powerful statistical method for dealing with a wide
range of data and reliably recognizing speech.
Challenges: Understanding the role of classification models (ML
Training) versus discriminative models (MMI training). What comes after
the HMMare there data driven models that work better for some or all
vocabularies.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
21
HMM for speech
Phone model : z (/Z/)
z1
z1
i
1
z1
z2
z1
z3
Word model: is (/IH/ /Z/)i
ih1
12/28/2009
ih2
ih3
z1
z1
z1
z2
Fundamentals of Speech
Recognition-Overview of ASR
z1
z3
22
Isolated Word HMM
a11
1
a22
a12
b1(Ot)
a33
a23
a44
a34
a55=1
a45
a13
a24
a35
b2(Ot)
b3(Ot)
b4(Ot)
b5(Ot)
Left-right HMM highly constrained state sequences
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
23
Word Lexicon
Acoustic
Acoustic
Model
Model
Goal:
Map legal phone sequences into words
according to phonotactic rules. For example,
David
Feature
Feature
Extraction
Extraction
/d/ /ey/ /v/ /ih/ /d/
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Multiple Pronunciation:
Several words may have multiple pronunciations. For example
Data
Data
/d/ /ae/ /t/ /ax/
/d/ /ey/ /t/ /ax/
Challenges:
How do you generate a word lexicon automatically; how do you add
new variant dialects and word pronunciations.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
24
Language Model
Goal:
Mapping words into
phrases and sentences based on
task syntax.
Acoustic
Acoustic
Model
Model
Feature
Feature
Extraction
Extraction
Handcrafted:
Deterministic grammars that are
knowledge-based. For example,
Yes on my credit (card) please
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Statistical:
Compute estimate of word probabilities (N-gram model). For example
Yes on my credit card please
0.4
Challenges:
0.6
How do you build a language model rapidly for a new task.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
25
Pattern Classification
Goal:
Combine information (probabilities) from the acoustic
model, language model and word lexicon to generate
an optimal word sequence (highest
Feature
Feature
Extraction
probability).
Extraction
Method:
Decoder searches through all possible recognition
choices using a Viterbi decoding algorithm.
Acoustic
Acoustic
Model
Model
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
Challenges:
How do we build efficient structures (FSMs) for decoding and searching
large vocabulary, complex language models tasks;
features x HMM units x phones x words x sentences can lead to
search networks with 10 22 states
FSM methods can compile the network to 10 8 states14 orders
of magnitude more efficient
What is the theoretical limit of efficiency that can be achieved
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
26
Unlimited Vocabulary ASR
The basic problem in ASR is to find the sequence
of words that explain the input signal. This implies
the following mapping:
Features
HMM states
HMM units
Phones
Words
Robustness
Robustness
Rejection
Rejection
Unlimited
Unlimited
Vocabulary
Vocabulary
Sentences
For the WSJ 64,000 vocabulary, this results in a network
of 1022 bytes!
State-of-the-art methods including fast match, multi-pass
decoding and A* stack provide tremendous speed-up at
a cost of increased complexity and less portability.
Advances in weighted finite state transducers have
enabled us to represent this network in a unified
8
mathematical framework with only 10 bytes!
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
27
Weighted Finite State
Transducers (WFST)
Unified Mathematical framework to ASR
Efficiency in time and space
Word:Phrase
WFST
WFST
Phone:Word
WFST
WFST
HMM:Phone
WFST
WFST
State:HMM
WFST
WFST
12/28/2009
Combination
Combination Optimization
Optimization
Fundamentals of Speech
Recognition-Overview of ASR
Search
Network
28
Weighted Finite State
Transducer
Word Pronunciation Transducer
ey:/.4
dx:/.8
ax:data/1
d: /1
ae:/.6
12/28/2009
Data
t:/.2
Fundamentals of Speech
Recognition-Overview of ASR
29
Algorithmic Speed
-up for
Speed-up
Speech Recognition
AT&T (Algorithmic)
Moore's Law (hardware)
Relative Speed
30
AT&T
25
20
Community
15
10
5
0
1994
1995
1996
1997
1998
1999
2000
2001
2002
Year
12/28/2009
North
NorthAmerican
AmericanBusiness
Business
vocabulary:
40,000
words
vocabulary:
40,000
words
Fundamentals
of Speech
branching
Recognition-Overview
of85
ASR
branchingfactor:
factor:
85
30
Utterance Verification
Acoustic
Acoustic
Model
Model
Goal:
Identify possible recognition errors
Feature
Feature
Extraction
and out-of-vocabulary events. Potentially
Extraction
improves the performance of ASR, SLU and DM.
Pattern
Pattern
Classification
Classification
Language
Language
Model
Model
Method:
Utterance
Utterance
Verification
Verification
Word
Word
Lexicon
Lexicon
A confidence score based on a hypothesis test is associated with
each recognized word. For example:
Label:
Recognized:
Confidence:
credit please
credit fees
(0.9) (0.3)
Challenges:
Rejection of extraneous acoustic events (noise, background speech,
door slams) without rejection of valid user input speech.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
31
Robustness
Robustness
Rejection
Rejection
Rejection
Problem:
Extraneous acoustic events, noise,
background speech and out-of-domain speech
deteriorate system performance.
Unlimited
Unlimited
Vocabulary
Vocabulary
Measure of Confidence:
Associating word strings with a verification cost that
provide an effective measure of confidence
(Utterance Verification).
Effect:
Improvement in the performance of the recognizer,
understanding system and dialogue manager.
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
32
TTS
DM
Input
Speech
ASR
SLU
Feature
Feature
Extraction
Extraction
State
-of-the-Art
State-of-the-Art
Performance?
Acoustic
Acoustic
Model
Model
Pattern
Pattern
Classification
Classification
(Decoding,
(Decoding,
Search)
Search)
Language
Language
Model
Model
12/28/2009
Utterance
Utterance
Verification
Verification
Recognized
Sentence
Word
Word
Lexicon
Lexicon
Fundamentals of Speech
Recognition-Overview of ASR
33
Word Error Rates
CORPUS
TYPE
Connected Digit
Strings--TI Database
Connected Digit
Strings--Mall
Recordings
Connected Digits
Strings--HMIHY
RM (Resource
Management)
ATIS(Airline Travel
Information System)
NAB (North American
Business)
Broadcast News
Spontaneous
Switchboard
Call Home
12/28/2009
Spontaneous
Conversational
VOCABULARY
WORD
SIZE
ERROR RATE
11 (zero-nine,
0.3%
oh)
11 (zero-nine,
2.0%
oh)
5.0%
Read Speech
11 (zero-nine,
oh)
1000
Spontaneous
2500
2.5%
Read Text
64,000
6.6%
News Show
210,000
13-17%
Conversational
Telephone
Conversational
Telephone
45,000
25-29%
28,000
40%
Fundamentals of Speech
Recognition-Overview of ASR
Factor of
17
increase
in digit
error rate
2.0%
34
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
35
North American Business
12/28/2009
vocabulary:
40,000
branching
vocabulary:
40,000words
words
Fundamentals
of
Speech branching
factor:
Recognition-Overview
factor:85
85of ASR
36
Broadcast News
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
37
Dictation Machine
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
38
Algorithmic Accuracy for
Speech Recognition
Word Accuracy
80
70
60
50
40
30
1996
1997
1998
1999
2000
2001
2002
Year
12/28/2009
Switchboard/Call
Switchboard/CallHome
Home
Vocabulary:
40,000
words
Vocabulary:
40,000
words
Fundamentals
of Speech
Perplexity:
Recognition-Overview
of ASR
Perplexity:85
85
39
Vocabulary Size
Growth in Effective
Recognition Vocabulary Size
10000000
1000000
100000
10000
1000
100
10
1
1960
1970
1980
1990
2000
2010
Year
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
40
Human Speech Recognition vs ASR
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
41
Human Speech Recognition vs ASR
M ACHINE ERROR (%)
100
10
x100
1
x10
x1
0.1
0.001
0.01
0.1
Machines
Outperform
Humans
1
10
HUMAN ERROR (%)
12/28/2009
Digits
RM-LM
NAB-mic
WSJ
RM-null
NAB-omni
SWBD
WSJ-22dB
Fundamentals of Speech
Recognition-Overview of ASR
42
Voice
-Enabling Services:
Voice-Enabling
Technology Components
Voice reply to customer
Customer voice request
What number did you
want to call?
Text-to-Speech
Synthesis
ASR
TTS
Automatic Speech
Recognition
Whats next?
Words spoken
Determine correct number
I dialed a wrong number
Dialog
Management
and Spoken
(Actions)
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
43
Spoken Language Understanding (SLU)
Goal:: Interpret the meaning of key words and phrases in the recognized
speech string, and map them to actions that the speech understanding
system should take
accurate understanding can often be achieved without correctly recognizing
every word
SLU makes it possible to offer services where the customer can speak naturally
without learning a specific set of terms
Methodology:: Exploit task grammar (syntax) and task semantics to
restrict the range of meaning associated with the recognized word string;
exploit salient words and phrases to map high information word
sequences to appropriate meaning
Performance Evaluation:: Accuracy of speech understanding system on
various tasks and in various operating environments
Applications:: Automation of complex operator-based tasks, e.g.,
customer care, catalog ordering, form filling systems, provisioning of new
services, customer help lines, etc.
Challenges: What goes beyond simple classifications systems but below
full Natural Language voice dialogue systems
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
44
Voice
-Enabling Services:
Voice-Enabling
Technology Components
Voice reply to customer
Customer voice request
What number did you
want to call?
Text-to-Speech
Synthesis
ASR
TTS
Automatic Speech
Recognition
Whats next?
Words spoken
Determine correct number
I dialed a wrong number
Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
45
Dialog Management (DM)
Goal:: Combine the meaning of the current input with the interaction history to
decide what the next step in the interaction should be
DM makes viable complex services that require multiple exchanges between the system and
the customer
dialog systems can handle user-initiated topic switching (within the domain of the
application)
Methodology:: Exploit models of dialog to determine the most appropriate
spoken text string to guide the dialog forward towards a clear and well understood
goal or system interaction
Performance Evaluation:: Speed and accuracy of attaining a well defined task
Applications:: Customer care (HMIHY), travel planning, conference registration,
goal, e.g., booking an airline reservation, renting a car, purchasing a stock, obtaining
help with a service
scheduling, voice access to unified messaging
Challenges: Is there a science of dialogueshow do you keep it efficient (turns,
time, progress towards a goal); how do you attain goals (get answers). Is there an
art of dialogues. How does the User Interface play into the art/science of
dialoguessometimes it is better/easier/faster/more efficient to point, use a mouse,
type than speak multimodal interactions with machines
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
46
Customer Care IVR and HMIHY
Customer Care IVR
sm
HMIHY
10 seconds
Sparkle Tone
Thank you for calling AT&T
Sparkle Tone
AT&T, How may I help you?
30 seconds
Network Menu
Account Verification Routine
13 seconds
LEC Misdirect Announcement
26 seconds
Account Verification Routine
58 seconds
Main Menu
38 seconds
LD Sub-Menu
20 seconds
8 seconds
Reverse Directory Routine
Total Time to Get to Reverse
Total Time to Get to Reverse
12/28/2009
Fundamentals of Speech
Directory Lookup: 28 seconds!!!
Directory Lookup: 2:55 minutes!!!
Recognition-Overview of ASR
47
HMIHY sm How Does It Work
z
z
z
Prompt is AT&T. How may I help you?
User responds with totally unconstrained fluent speech
System recognizes the words and determines the meaning
of users speech, then routes the call
Dialog technology enables task completion
HMIHY
Account
Balance
12/28/2009
Calling
Plans
Local
Unrecognized
Number
...
Fundamentals of Speech
Recognition-Overview of ASR
48
HMIHY Example Dialogs
sm
Irate Customer
Rate Plan
Account Balance
Local Service
Unrecognized Number
Threshold Billing
Billing Credit
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
Customer Satisfaction
decreased repeat
calls (37%)
decreased OUTPIC
rate (18%)
decreased CCA (Call
Control Agent) time
per call (10%)
decreased customer
complaints (78%)
49
Customer Care Scenario
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
50
TTS Closest to the Customers Ear
Voice reply to customer
Customer voice request
What number did you
want to call?
Text-to-Speech
Synthesis
TTS
ASR
Automatic Speech
Recognition
Whats next?
Words spoken
Determine correct number
I dialed a wrong number
Dialog
Management
(Actions) and
Spoken
Language
Generation
12/28/2009 (Words)
DM &
SLG
SLU
Spoken Language
Understanding
Meaning
Billing credit
Fundamentals of Speech
Recognition-Overview of ASR
51
Speech Synthesis
text
12/28/2009
Linguistic
Rules
DSP
Computer
Fundamentals of Speech
Recognition-Overview of ASR
D-to-A
Converter
speech
52
Speech Synthesis
Synthesis of Speech for effective humanmachine communications
reading email messages over a telephone
telematics feedback in automobiles
talking agents for completion of transactions
call center help desks and customer care
handheld devices such as foreign language
phrasebooks, dictionaries, crossword puzzle
helpers
announcement machines that provide things
like stock quotes, airlines schedules, updates
of arrivals and departures of flights
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
53
Giving Machines High Quality Voices and
Faces
U.S. English Female:
U.S. English Male:
Spanish Female::
Natural
Speech
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
54
Speech Synthesis Examples
Soliloquy from Hamlet:
Gettysburg Address:
Third Grade Story:
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
55
Speech Recognition Demos
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
56
Au Clair de la Lune
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
57
Information Kiosk
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
58
Multimodal Language Processing
Unified Multimodal Experience Access to
information through voice interface, gesture or both.
Multimodal Finite State Combination of speech,
gesture and meaning using finite state technology
MATCH: Multimodal
Access To City Help
12/28/2009
Fundamentals of Speech
59
Are thereRecognition-Overview
any cheap Italian
places in this neighborhood?
of ASR
MIPad Demo--Microsoft
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
60
Voice-Enabled Services
Desktop applications -- dictation, command and control of
desktop, control of document properties (fonts, styles, bullets, )
Agent technology simple tasks like stock quotes, traffic
reports, weather; access to communications, e.g., voice dialing,
voice access to directories (800 services); access to messaging
(text and voice messages); access to calendars and appointments
Voice Portals convert any web page to a voice-enabled site
where any question that can be answered on-line can be
answered via a voice query; protocols like VXML, SALT, SMIL,
SOAP and others are key
E-Contact services Call Centers, Customer Care (HMIHY) and
Help Desks where calls are triaged and answered appropriately
using natural language voice dialogues
Telematics command and control of automotive features
(comfort systems, radio, windows, sunroof)
Small devices control of cellphones, PDAs from voice
commands
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
61
Milestones in Speech and
Multimodal Technology Research
Small
Vocabulary,
Acoustic
Phoneticsbased
Isolated
Words
Filter-bank
analysis;
Timenormalization;
Dynamic
programming
1962
12/28/2009
Large
Vocabulary;
Syntax,
Semantics,
Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based
Isolated Words;
Connected Digits;
Continuous
Speech
Pattern
recognition; LPC
analysis;
Clustering
algorithms; Level
building;
1967
1972
Connected
Words;
Continuous
Speech
Hidden Markov
models;
Stochastic
Language
modeling;
Continuous
Speech; Speech
Understanding
Spoken dialog;
Multiple
modalities
Stochastic language
understanding;
Finite-state
machines;
Statistical learning;
1977
1982
1987
Fundamentals of Speech
Recognition-Overview of ASR
Year
Very Large
Vocabulary;
Semantics,
Multimodal
Dialog, TTS
1992
Concatenative
synthesis; Machine
learning; Mixedinitiative dialog;
1997
2002
62
Future of Speech Recognition
Technologies
Very Large
Vocabulary,
Limited Tasks,
Controlled
Environment
Dialog
Systems
2002
12/28/2009
Very Large
Vocabulary,
Limited Tasks,
Arbitrary
Environment
Unlimited
Vocabulary,
Unlimited Tasks,
Many
Languages
Robust
Systems
Multilingual
Systems;
Multimodal
Speech Enabled
Devices
2005
2008
Fundamentals of Speech
Recognition-Overview of ASR
Year
2011
63
Issues in Speech Recognition
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
64
Issues in Speech Recognition
Speaker trained
Speaker independent
Amount of training material
Quiet office
Home
Noisy surroundings (factory floor,
cellular environments, speakerphones)
High quality microphone, close
talking/noise cancelling microphone
Telephone (carbon button/electret)
Switched telephone network
IP network (VoIP)
Cellular network
12/28/2009
Feedback to users
Instructions
Requests for repeats
Rejections
Tolerance for recognition errors
Syntax constrained (language model)
Viable semantics
Human factors
Highly motivated, cooperative
Casual
Recognition task
Small (2-50 words)
Medium (50-250 word)
Large (250-2,000,000 words)
Speaker characteristics
Transducer and transmission system
Vocabulary size and complexity
(perplexity)
Speaking environment
Isolated words/phrases
Connected word sequences
Continuous speech (essentially
unconstrained)
Recognition mode
Input speech format
Fail soft systems
Human intervention on errors/confusion
Correction mechanisms built in
System complexity
Computation/hardware
Real-time response capability
Fundamentals of Speech
Recognition-Overview of ASR
65
Overview of Speech
Recognition Processes
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
66
OverviewofSpeech
RecognitionProcesses
Template
VQHMM
FS
Speech
Transcriptions
logEn Zn
Temporal,
Spectral,
Cepstral,LPC
Features
d(X,Y)DTW
Word/
Sound
Models,
Templates
Dictionary
Syntax
Recognized
Input
TemplatesModels
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
67
Statistical Pattern Recognition
The basic speech recognition task may be defined as follows:
a sequence of measurements (speech analysis frames)
on the (endpoint detected) speech signal of an utterance
defines a pattern for that utterance
this pattern is to be classified as belonging to one of
several possible categories (classes) (for word/phrase
recognition) or to a sequence of possible categories (for
continuous speech recognition)
the rules for this classification are formulated on the
basis of a labeled set of training patterns or models
The type of measurement (temporal, spectral, cepstral, LPC
features) and the classification rules (pattern alignment and
distance, model alignment and probability) are the main
factors that distinguish one method of speech recognition from
another
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
68
Issues in Pattern Recognition
Training phase is required; the more training data, the better the
patterns (templates, models)
Patterns are sensitive to the speaking environment, transmission
environment, transducer (microphone), etc. (This problem is known
as the speech robustness problem).
No speech specific knowledge is required or exploited, except in the
feature extraction stage
Computational load is (more or less) linearly proportional to the
number of patterns being recognized (at least for simple recognition
problems, e.g., isolated word tasks)
Pattern recognition techniques are applicable to a range of speech
units, including phrases, words, and sub-word units (phonemes,
syllables, dyads, etc.)
Extensions possible to large vocabulary, fluent speech recognition
using word lexicons (dictionaries) and language models (grammars
or syntax)
Extensions possible to natural language understanding systems
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
69
Speech Recognition Processes
1. Fundamentals (Lectures 2-6)
speech production (acoustic-phonetics, linguistics)
speech perception (auditory (ear) models, neural
models)
pattern recognition (statistical, template-based)
neural networks (classification methods)
2. Speech/Endpoint Detection (Lecture 7)
algorithms
speech features
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
70
Speech Recognition Processes
3. Speech Analysis/Feature Extraction (Lectures 7-9)
temporal parameters (log energy, zero crossings,
autocorrelation)
spectral parameters (STFT, OLA, FBS, spectrograms)
cepstral parameters (cepstrum, -cepstrum, 2cepstrum)
LPC parameters (reflection coefs, area coefs, LSP)
4. Distance/Distortion Measures (Lecture 10)
12/28/2009
temporal (quadratic, weighted)
spectral (log spectral distance)
cepstral (cepstral distance)
LPC (Itakura distance)
Fundamentals of Speech
Recognition-Overview of ASR
71
Speech Recognition Processes
5. Time Alignment Algorithms (Lectures 11-12)
linear alignments
dynamic time warping (DTW, dynamic programming)
HMM alignments (Viterbi alignment)
6. Model Building/Training (Lectures 13-14)
template methods
clustering methods
HMM methodsViterbi, Forward-Backward
vector quantization (VQ) methods
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
72
Speech Recognition Processes
7. Connected word modeling (Lecture 15)
dynamic programming
level building
one pass method
8. Testing/Evaluation Methods (Lecture 16)
12/28/2009
word/sound error rates
dictionary of words
task syntax
task semantics
task perplexity
Fundamentals of Speech
Recognition-Overview of ASR
73
Speech Recognition Processes
9. Large Vocabulary Recognition (Lectures 17-18)
phoneme models
context dependent models
discrete, mixed, continuous density models
N-gram language models
Natural language understanding
insertions, deletions, substitutions
other factors
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
74
Putting It All Together
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
75
Speech Recognition Course
Topics
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
76
What We Will Be Learning
speech production modelacoustics, articulatory concepts, speech production models
speech perception modelear models, auditory signal processing, equivalent acoustic
processing models
signal processing approaches to speech recognitionacoustic-phonetic methods,
pattern recognition methods, statistical methods, neural network methods
fundamentals of pattern recognition
signal processing methodsbank-of-filters model, short-time Fourier transforms, LPC
methods, cepstral methods, perceptual linear prediction, mel cepstrum, vector
quantization
pattern recognition issuesspeech detection, distortion measures, time alignment and
normalization, dynamic time warping
speech system design issuessource coding, template training, discriminative methods
robustness issuesspectral subtraction, cepstral mean subtraction, model adaptation
Hidden Markov Model (HMM) fundamentalsdesign issues
connected word modelsdynamic programming, level building, one pass method
grammar networksfinite state machine (FSM) basics
large vocabulary speech recognitiontraining, language models, perplexity, acoustic
models for context dependent sub-word units
task-oriented designsnatural language understanding, mixed initiative systems, dialog
management
text-to-speech synthesisbased on unit selection methods
12/28/2009
Fundamentals of Speech
Recognition-Overview of ASR
77