Nishimoto Interspeech 2010 v3

The comparison between
the Deletion-Based Method
and the Mixing-Based Method
for Audio CAPTCHAs

Takuya NISHIMOTO (Univ. Tokyo, Japan)
Takayuki WATANABE (TWCU, Japan)
Interspeech 2010 Mon-Ses2-P3

1

CAPTCHA
 Completely Automated Public Turing test
to tell Computers and Humans Apart
 popular security techniques on the Web
 prevent automated programs from abusing
 image-based CAPTCHAs
 image containing distorted characters
 preventing use of persons with visual disability
 audio CAPTCHAs were created
 create better audio CAPTCHA tasks
 safeness: the difference of recognition performance
 usability: mental workload of human in listening speech

2

Performance gap model
 performance of machine should be lower
 than the intelligibility of human
 gap: safeness 100
 should be large Human

Intelligibility (%)
 exposed ratio (ER)
 0%: random answer ASR
 chance-level; no gap
 100%: best guess
 easy for both; no gap
 practical condition
 0 < ER < 100
0 Exposed Ratio (%) 100
(Provided Information)
3

Safeness: ER control
 machine is becoming strong
 statistical ASR method is the mainstream
 supervised machine learning (Hidden Markov Models)
 techniques to cope with the noise
 CAPTCHA tasks should be created systematically
 it should not be created by trial and error
 controllability of Exposed Ratio is essential
 Mixing-based method: best way to control ER?
 mixing noises / distorting signals
 can hide portion of information, however...
 difficult to measure the ER, performance is not easy to predict
 alternatives must be investigated

4

Usability: Mental workload
 CAPTCHAs should not increase mental workload
 the workload may increase, if they are..
 difficult to listen / memorize the task
 long task (many characters)
 difficult to remember
 safer, but higher mental workload
 requirements
 information can be obtained in short time, easily
 investigation required
 human auditory sensation
 language cognition

5

Top-down knowledge
 incomplete stimulus
 knowledge helps to guess the information
 visual sensation
 if part of image is missing, or part of the word is hidden
 common knowledge can complement image
 about the character and the vocabulary
 speech perception
 if "word familiarity" is high: easy to guess
 phonemic restoration
 may help the human listening

6

Deletion-based method
 delete some parts on temporal axis little by little
 if every 30 msec over a period of 100 msec is replaced with
silence, the 30% of the information was deleted (D70)
 if the ratio of remained sections go down, the degree of
listening difficulty may increase.
 Exposed Ratio can be controlled easily
 however, not easy to understand....
deletion (original)

Festival engine
KAL (HMM-based)
7

Phonemic restoration
 interrupted speech and noise maskers combined
 the fence effect
 continuity of speech signal perceived
 may help human listening
 does not affect machine performance
 expected to enlarge the gap
 performance difference of human and machine

deletion +
phonemic restoration

8

NASA-TLX evaluation
 mental workload
 rating 6 subscales
 Mental, Physical, and Temporal Demands,
Frustration, Effort, and Performance
 range: 0-100
 weights of subscales (6-1)
 for each participant
 placing an order
how the 6 dimensions are related
to personal definition of workload
 weighted workload (WWL)

9

Deletion vs Mixing (Exp1)
 objective: compare intelligibility and mental workload
 Deletion-Based Method (DBM)
 Mixing-Based Method (MBM)
 effect of SNR (signal-to-noise ratio) in MBM
 human intelligibility test
 75 utterances: 3,4,5 digits numbers (3 x 25)
 Japanese recorded speech
 subjects: 15 (5 x 3) undergraduate students
 mental workload (WWL) by NASA-TLX
 normalized within every subject
 their average and SD become 50 and 10 respectively
 automatic speech recognition using HMM
 task: numbers (1-7 digits) in Japanese
 training: 8440 utterances, 18 states, 20 mixtures
 evaluation: 1001 utterances, sentence recognition
10

Setup (Exp1)
 compare DBM and MBM within a person
 acoustic presentation: given by headphone
 at the subject’s preferred reference loudness level
 MBM disturbing signals
 utterances of Japanese sentences
fragmented as short periods, shuffled and combined
MBM(Exp1): Sentence
Group Trial 1: D30 Trial 2: M0, Mm10, Mm20 recognition using HTK (%)
80
G1 DBM 30% MBM SNR 0dB
60
G2 DBM 30% MBM SNR -10dB
40
G3 DBM 30% MBM SNR -20dB
20
0
M0 Mm10 Mm20
11

Performance (Exp1)
DBM(T1)：marginally significant (p<0.1) (G1>G2)
DBM 30% task is harder than MBM 0dB, -10dB, -20dB
MBM(T2): effect of SNR conditions is significant, however,
only between 0dB & -10dB (p<0.05) (G1>G2)
DBM 30% vs DBM 30% vs DBM 30% vs
100 MBM 0dB MBM -10dB MBM -20dB

90

80

70

60

50

40 T1 T2

30
s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305

12

Workload diffefence (Exp1)
 WWL: individual difference cancelled
 subtraction of DBM (D30) score
from MBM (M0, Mm10 and Mm20) score was performed
DBM 30% vs DBM 30% vs DBM 30% vs
MBM 0dB MBM -10dB MBM -20dB
20
10
0
s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305
-10
-20
average WWL difference
-30 20
-40 0.7 1.0
0
-50 WWL: MBM 0db < DBM 30% ?
-60 -20
no significance (ANOVA) (16.2)
M0-D30 Mm10-D30 Mm20-D30

MBM: task difficulty is not easy to control
13

DBM exposed ratio (Exp2)
 DBM: Exposed Ratio can control the gap size
100 70

90 Workload

60
80

70
50
60

50 Human Ave. (%) 40

40 Machine (%)
30
30
30% 50% 70%
30% 50% 70%

DBM 30%
gap is very large, however,
Significant difference (p<0.05) workload is very high.

14

Discussion
 D30 (DBM) & Mm10 (MBM) can be the benchmarks
 for the purpose of comparison between MBM and DBM
 performance difference are close (43.7pt & 44.8pt)
 WWL are also very close (WMm10 - WD30 = 0.7)
performance difference
between human and machine (pt)
80

60

40

20

0
M0 Mm10 Mm20 D70 D50 D30

15

Conclusion
 audio CAPTCHA task using phonemic restoration
 deletion-based method (DBM)
 evaluation of CAPTCHA task
 performance + mental workload (NASA-TLX)
 comparison between DBM and MBM
 DBM: easier to control the task
 future works
 improve the noise
 investigation of phonemic restoration
 really improving performance? only decreasing workload?
 word familiarity, speech rate, synthesized speech, ...

16

Nishimoto Interspeech 2010 v3

More Related Content

Nishimoto Interspeech 2010 v3