The comparison between the Deletion-Based Method and the Mixing-Based Method for Audio CAPTCHAs
Interspeech 2010, Mon-Ses2-P3, Makuhari, Tokyo
1 of 16
Download to read offline
More Related Content
Nishimoto Interspeech 2010 v3
1. The comparison between
the Deletion-Based Method
and the Mixing-Based Method
for Audio CAPTCHAs
Takuya NISHIMOTO (Univ. Tokyo, Japan)
Takayuki WATANABE (TWCU, Japan)
Interspeech 2010 Mon-Ses2-P3
1
2. CAPTCHA
Completely Automated Public Turing test
to tell Computers and Humans Apart
popular security techniques on the Web
prevent automated programs from abusing
image-based CAPTCHAs
image containing distorted characters
preventing use of persons with visual disability
audio CAPTCHAs were created
create better audio CAPTCHA tasks
safeness: the difference of recognition performance
usability: mental workload of human in listening speech
2
3. Performance gap model
performance of machine should be lower
than the intelligibility of human
gap: safeness 100
should be large Human
Intelligibility (%)
exposed ratio (ER)
0%: random answer ASR
chance-level; no gap
100%: best guess
easy for both; no gap
practical condition
0 < ER < 100
0 Exposed Ratio (%) 100
(Provided Information)
3
4. Safeness: ER control
machine is becoming strong
statistical ASR method is the mainstream
supervised machine learning (Hidden Markov Models)
techniques to cope with the noise
CAPTCHA tasks should be created systematically
it should not be created by trial and error
controllability of Exposed Ratio is essential
Mixing-based method: best way to control ER?
mixing noises / distorting signals
can hide portion of information, however...
difficult to measure the ER, performance is not easy to predict
alternatives must be investigated
4
5. Usability: Mental workload
CAPTCHAs should not increase mental workload
the workload may increase, if they are..
difficult to listen / memorize the task
long task (many characters)
difficult to remember
safer, but higher mental workload
requirements
information can be obtained in short time, easily
investigation required
human auditory sensation
language cognition
5
6. Top-down knowledge
incomplete stimulus
knowledge helps to guess the information
visual sensation
if part of image is missing, or part of the word is hidden
common knowledge can complement image
about the character and the vocabulary
speech perception
if "word familiarity" is high: easy to guess
phonemic restoration
may help the human listening
6
7. Deletion-based method
delete some parts on temporal axis little by little
if every 30 msec over a period of 100 msec is replaced with
silence, the 30% of the information was deleted (D70)
if the ratio of remained sections go down, the degree of
listening difficulty may increase.
Exposed Ratio can be controlled easily
however, not easy to understand....
deletion (original)
Festival engine
KAL (HMM-based)
7
8. Phonemic restoration
interrupted speech and noise maskers combined
the fence effect
continuity of speech signal perceived
may help human listening
does not affect machine performance
expected to enlarge the gap
performance difference of human and machine
deletion +
phonemic restoration
8
9. NASA-TLX evaluation
mental workload
rating 6 subscales
Mental, Physical, and Temporal Demands,
Frustration, Effort, and Performance
range: 0-100
weights of subscales (6-1)
for each participant
placing an order
how the 6 dimensions are related
to personal definition of workload
weighted workload (WWL)
9
10. Deletion vs Mixing (Exp1)
objective: compare intelligibility and mental workload
Deletion-Based Method (DBM)
Mixing-Based Method (MBM)
effect of SNR (signal-to-noise ratio) in MBM
human intelligibility test
75 utterances: 3,4,5 digits numbers (3 x 25)
Japanese recorded speech
subjects: 15 (5 x 3) undergraduate students
mental workload (WWL) by NASA-TLX
normalized within every subject
their average and SD become 50 and 10 respectively
automatic speech recognition using HMM
task: numbers (1-7 digits) in Japanese
training: 8440 utterances, 18 states, 20 mixtures
evaluation: 1001 utterances, sentence recognition
10
11. Setup (Exp1)
compare DBM and MBM within a person
acoustic presentation: given by headphone
at the subject’s preferred reference loudness level
MBM disturbing signals
utterances of Japanese sentences
fragmented as short periods, shuffled and combined
MBM(Exp1): Sentence
Group Trial 1: D30 Trial 2: M0, Mm10, Mm20 recognition using HTK (%)
80
G1 DBM 30% MBM SNR 0dB
60
G2 DBM 30% MBM SNR -10dB
40
G3 DBM 30% MBM SNR -20dB
20
0
M0 Mm10 Mm20
11
12. Performance (Exp1)
DBM(T1):marginally significant (p<0.1) (G1>G2)
DBM 30% task is harder than MBM 0dB, -10dB, -20dB
MBM(T2): effect of SNR conditions is significant, however,
only between 0dB & -10dB (p<0.05) (G1>G2)
DBM 30% vs DBM 30% vs DBM 30% vs
100 MBM 0dB MBM -10dB MBM -20dB
90
80
70
60
50
40 T1 T2
30
s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305
12
13. Workload diffefence (Exp1)
WWL: individual difference cancelled
subtraction of DBM (D30) score
from MBM (M0, Mm10 and Mm20) score was performed
DBM 30% vs DBM 30% vs DBM 30% vs
MBM 0dB MBM -10dB MBM -20dB
20
10
0
s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305
-10
-20
average WWL difference
-30 20
-40 0.7 1.0
0
-50 WWL: MBM 0db < DBM 30% ?
-60 -20
no significance (ANOVA) (16.2)
M0-D30 Mm10-D30 Mm20-D30
MBM: task difficulty is not easy to control
13
14. DBM exposed ratio (Exp2)
DBM: Exposed Ratio can control the gap size
100 70
90 Workload
60
80
70
50
60
50 Human Ave. (%) 40
40 Machine (%)
30
30
30% 50% 70%
30% 50% 70%
DBM 30%
gap is very large, however,
Significant difference (p<0.05) workload is very high.
14
15. Discussion
D30 (DBM) & Mm10 (MBM) can be the benchmarks
for the purpose of comparison between MBM and DBM
performance difference are close (43.7pt & 44.8pt)
WWL are also very close (WMm10 - WD30 = 0.7)
performance difference
between human and machine (pt)
80
60
40
20
0
M0 Mm10 Mm20 D70 D50 D30
15
16. Conclusion
audio CAPTCHA task using phonemic restoration
deletion-based method (DBM)
evaluation of CAPTCHA task
performance + mental workload (NASA-TLX)
comparison between DBM and MBM
DBM: easier to control the task
future works
improve the noise
investigation of phonemic restoration
really improving performance? only decreasing workload?
word familiarity, speech rate, synthesized speech, ...
16