SlideShare a Scribd company logo
The comparison between
 the Deletion-Based Method
and the Mixing-Based Method
     for Audio CAPTCHAs

  Takuya NISHIMOTO (Univ. Tokyo, Japan)
   Takayuki WATANABE (TWCU, Japan)
       Interspeech 2010 Mon-Ses2-P3

                                          1
CAPTCHA
   Completely Automated Public Turing test
    to tell Computers and Humans Apart
       popular security techniques on the Web
            prevent automated programs from abusing
       image-based CAPTCHAs
            image containing distorted characters
            preventing use of persons with visual disability
       audio CAPTCHAs were created
   create better audio CAPTCHA tasks
       safeness: the difference of recognition performance
       usability: mental workload of human in listening speech



                                                                  2
Performance gap model
   performance of machine should be lower
       than the intelligibility of human
   gap: safeness                    100
       should be large                                     Human




                                      Intelligibility (%)
   exposed ratio (ER)
       0%: random answer                                                       ASR
            chance-level; no gap
       100%: best guess
            easy for both; no gap
   practical condition
       0 < ER < 100
                                                  0         Exposed Ratio (%)         100
                                                            (Provided Information)
                                                                                            3
Safeness: ER control
   machine is becoming strong
       statistical ASR method is the mainstream
       supervised machine learning (Hidden Markov Models)
       techniques to cope with the noise
   CAPTCHA tasks should be created systematically
       it should not be created by trial and error
       controllability of Exposed Ratio is essential
   Mixing-based method: best way to control ER?
       mixing noises / distorting signals
            can hide portion of information, however...
            difficult to measure the ER, performance is not easy to predict
       alternatives must be investigated

                                                                               4
Usability: Mental workload
   CAPTCHAs should not increase mental workload
   the workload may increase, if they are..
       difficult to listen / memorize the task
   long task (many characters)
       difficult to remember
       safer, but higher mental workload
   requirements
       information can be obtained in short time, easily
   investigation required
       human auditory sensation
       language cognition

                                                            5
Top-down knowledge
   incomplete stimulus
       knowledge helps to guess the information
   visual sensation
       if part of image is missing, or part of the word is hidden
       common knowledge can complement image
            about the character and the vocabulary
   speech perception
       if "word familiarity" is high: easy to guess
   phonemic restoration
       may help the human listening



                                                                     6
Deletion-based method
   delete some parts on temporal axis little by little
       if every 30 msec over a period of 100 msec is replaced with
        silence, the 30% of the information was deleted (D70)
       if the ratio of remained sections go down, the degree of
        listening difficulty may increase.
   Exposed Ratio can be controlled easily
   however, not easy to understand....
                                          deletion (original)



                                             Festival engine
                                             KAL (HMM-based)
                                                                      7
Phonemic restoration
   interrupted speech and noise maskers combined
       the fence effect
       continuity of speech signal perceived
       may help human listening
       does not affect machine performance
   expected to enlarge the gap
       performance difference of human and machine

                                           deletion +
                                           phonemic restoration




                                                                  8
NASA-TLX evaluation
   mental workload
       rating 6 subscales
            Mental, Physical, and Temporal Demands,
             Frustration, Effort, and Performance
       range: 0-100
   weights of subscales (6-1)
       for each participant
       placing an order
        how the 6 dimensions are related
        to personal definition of workload
   weighted workload (WWL)


                                                       9
Deletion vs Mixing (Exp1)
   objective: compare intelligibility and mental workload
       Deletion-Based Method (DBM)
       Mixing-Based Method (MBM)
            effect of SNR (signal-to-noise ratio) in MBM
   human intelligibility test
       75 utterances: 3,4,5 digits numbers (3 x 25)
            Japanese recorded speech
       subjects: 15 (5 x 3) undergraduate students
       mental workload (WWL) by NASA-TLX
            normalized within every subject
            their average and SD become 50 and 10 respectively
   automatic speech recognition using HMM
       task: numbers (1-7 digits) in Japanese
       training: 8440 utterances, 18 states, 20 mixtures
       evaluation: 1001 utterances, sentence recognition
                                                                  10
Setup (Exp1)
   compare DBM and MBM within a person
        acoustic presentation: given by headphone
             at the subject’s preferred reference loudness level
   MBM disturbing signals
        utterances of Japanese sentences
         fragmented as short periods, shuffled and combined
                                                                 MBM(Exp1): Sentence
Group         Trial 1: D30   Trial 2: M0, Mm10, Mm20           recognition using HTK (%)
                                                          80
G1            DBM 30%        MBM SNR 0dB
                                                          60
G2            DBM 30%        MBM SNR -10dB
                                                          40
G3            DBM 30%        MBM SNR -20dB
                                                          20
                                                           0
                                                                    M0   Mm10       Mm20
                                                                                           11
Performance (Exp1)
      DBM(T1):marginally significant (p<0.1) (G1>G2)
      DBM 30% task is harder than MBM 0dB, -10dB, -20dB
      MBM(T2): effect of SNR conditions is significant, however,
      only between 0dB & -10dB (p<0.05) (G1>G2)
               DBM 30% vs                          DBM 30% vs                         DBM 30% vs
100             MBM 0dB                            MBM -10dB                          MBM -20dB

90

80

70

60

50

40                                                                                      T1          T2

30
        s101    s102   s103   s104   s105   s201   s202   s203   s204   s205   s301   s302   s303   s304   s305

                                                                                                                  12
Workload diffefence (Exp1)
   WWL: individual difference cancelled
        subtraction of DBM (D30) score
         from MBM (M0, Mm10 and Mm20) score was performed
                DBM 30% vs                          DBM 30% vs                                        DBM 30% vs
                 MBM 0dB                            MBM -10dB                                         MBM -20dB
    20
    10
    0
         s101    s102   s103   s104   s105   s201   s202   s203   s204        s205             s301   s302         s303   s304         s305
-10
-20
                                                                                     average WWL difference
-30                                                                      20
-40                                                                                                          0.7                 1.0
                                                                          0
-50               WWL: MBM 0db < DBM 30% ?
-60                                       -20
                  no significance (ANOVA)                                             (16.2)
                                                                                     M0-D30           Mm10-D30            Mm20-D30


                  MBM: task difficulty is not easy to control
                                                                                                                                              13
DBM exposed ratio (Exp2)
   DBM: Exposed Ratio can control the gap size
     100                                    70

      90                                                          Workload

                                            60
      80

      70
                                            50
      60

      50                   Human Ave. (%)   40

      40                   Machine (%)
                                            30
      30
                                                   30%      50%              70%
            30%      50%            70%

                                                 DBM 30%
                                                 gap is very large, however,
     Significant difference (p<0.05)             workload is very high.


                                                                                   14
Discussion
   D30 (DBM) & Mm10 (MBM) can be the benchmarks
       for the purpose of comparison between MBM and DBM
               performance difference are close (43.7pt & 44.8pt)
               WWL are also very close (WMm10 - WD30 = 0.7)
                          performance difference
                      between human and machine (pt)
     80

     60

     40

     20

        0
                 M0     Mm10 Mm20               D70    D50   D30



                                                                     15
Conclusion
   audio CAPTCHA task using phonemic restoration
       deletion-based method (DBM)
   evaluation of CAPTCHA task
       performance + mental workload (NASA-TLX)
   comparison between DBM and MBM
       DBM: easier to control the task
   future works
       improve the noise
       investigation of phonemic restoration
            really improving performance? only decreasing workload?
       word familiarity, speech rate, synthesized speech, ...

                                                                       16

More Related Content

Nishimoto Interspeech 2010 v3

  • 1. The comparison between the Deletion-Based Method and the Mixing-Based Method for Audio CAPTCHAs Takuya NISHIMOTO (Univ. Tokyo, Japan) Takayuki WATANABE (TWCU, Japan) Interspeech 2010 Mon-Ses2-P3 1
  • 2. CAPTCHA  Completely Automated Public Turing test to tell Computers and Humans Apart  popular security techniques on the Web  prevent automated programs from abusing  image-based CAPTCHAs  image containing distorted characters  preventing use of persons with visual disability  audio CAPTCHAs were created  create better audio CAPTCHA tasks  safeness: the difference of recognition performance  usability: mental workload of human in listening speech 2
  • 3. Performance gap model  performance of machine should be lower  than the intelligibility of human  gap: safeness 100  should be large Human Intelligibility (%)  exposed ratio (ER)  0%: random answer ASR  chance-level; no gap  100%: best guess  easy for both; no gap  practical condition  0 < ER < 100 0 Exposed Ratio (%) 100 (Provided Information) 3
  • 4. Safeness: ER control  machine is becoming strong  statistical ASR method is the mainstream  supervised machine learning (Hidden Markov Models)  techniques to cope with the noise  CAPTCHA tasks should be created systematically  it should not be created by trial and error  controllability of Exposed Ratio is essential  Mixing-based method: best way to control ER?  mixing noises / distorting signals  can hide portion of information, however...  difficult to measure the ER, performance is not easy to predict  alternatives must be investigated 4
  • 5. Usability: Mental workload  CAPTCHAs should not increase mental workload  the workload may increase, if they are..  difficult to listen / memorize the task  long task (many characters)  difficult to remember  safer, but higher mental workload  requirements  information can be obtained in short time, easily  investigation required  human auditory sensation  language cognition 5
  • 6. Top-down knowledge  incomplete stimulus  knowledge helps to guess the information  visual sensation  if part of image is missing, or part of the word is hidden  common knowledge can complement image  about the character and the vocabulary  speech perception  if "word familiarity" is high: easy to guess  phonemic restoration  may help the human listening 6
  • 7. Deletion-based method  delete some parts on temporal axis little by little  if every 30 msec over a period of 100 msec is replaced with silence, the 30% of the information was deleted (D70)  if the ratio of remained sections go down, the degree of listening difficulty may increase.  Exposed Ratio can be controlled easily  however, not easy to understand.... deletion (original) Festival engine KAL (HMM-based) 7
  • 8. Phonemic restoration  interrupted speech and noise maskers combined  the fence effect  continuity of speech signal perceived  may help human listening  does not affect machine performance  expected to enlarge the gap  performance difference of human and machine deletion + phonemic restoration 8
  • 9. NASA-TLX evaluation  mental workload  rating 6 subscales  Mental, Physical, and Temporal Demands, Frustration, Effort, and Performance  range: 0-100  weights of subscales (6-1)  for each participant  placing an order how the 6 dimensions are related to personal definition of workload  weighted workload (WWL) 9
  • 10. Deletion vs Mixing (Exp1)  objective: compare intelligibility and mental workload  Deletion-Based Method (DBM)  Mixing-Based Method (MBM)  effect of SNR (signal-to-noise ratio) in MBM  human intelligibility test  75 utterances: 3,4,5 digits numbers (3 x 25)  Japanese recorded speech  subjects: 15 (5 x 3) undergraduate students  mental workload (WWL) by NASA-TLX  normalized within every subject  their average and SD become 50 and 10 respectively  automatic speech recognition using HMM  task: numbers (1-7 digits) in Japanese  training: 8440 utterances, 18 states, 20 mixtures  evaluation: 1001 utterances, sentence recognition 10
  • 11. Setup (Exp1)  compare DBM and MBM within a person  acoustic presentation: given by headphone  at the subject’s preferred reference loudness level  MBM disturbing signals  utterances of Japanese sentences fragmented as short periods, shuffled and combined MBM(Exp1): Sentence Group Trial 1: D30 Trial 2: M0, Mm10, Mm20 recognition using HTK (%) 80 G1 DBM 30% MBM SNR 0dB 60 G2 DBM 30% MBM SNR -10dB 40 G3 DBM 30% MBM SNR -20dB 20 0 M0 Mm10 Mm20 11
  • 12. Performance (Exp1) DBM(T1):marginally significant (p<0.1) (G1>G2) DBM 30% task is harder than MBM 0dB, -10dB, -20dB MBM(T2): effect of SNR conditions is significant, however, only between 0dB & -10dB (p<0.05) (G1>G2) DBM 30% vs DBM 30% vs DBM 30% vs 100 MBM 0dB MBM -10dB MBM -20dB 90 80 70 60 50 40 T1 T2 30 s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305 12
  • 13. Workload diffefence (Exp1)  WWL: individual difference cancelled  subtraction of DBM (D30) score from MBM (M0, Mm10 and Mm20) score was performed DBM 30% vs DBM 30% vs DBM 30% vs MBM 0dB MBM -10dB MBM -20dB 20 10 0 s101 s102 s103 s104 s105 s201 s202 s203 s204 s205 s301 s302 s303 s304 s305 -10 -20 average WWL difference -30 20 -40 0.7 1.0 0 -50 WWL: MBM 0db < DBM 30% ? -60 -20 no significance (ANOVA) (16.2) M0-D30 Mm10-D30 Mm20-D30 MBM: task difficulty is not easy to control 13
  • 14. DBM exposed ratio (Exp2)  DBM: Exposed Ratio can control the gap size 100 70 90 Workload 60 80 70 50 60 50 Human Ave. (%) 40 40 Machine (%) 30 30 30% 50% 70% 30% 50% 70% DBM 30% gap is very large, however, Significant difference (p<0.05) workload is very high. 14
  • 15. Discussion  D30 (DBM) & Mm10 (MBM) can be the benchmarks  for the purpose of comparison between MBM and DBM  performance difference are close (43.7pt & 44.8pt)  WWL are also very close (WMm10 - WD30 = 0.7) performance difference between human and machine (pt) 80 60 40 20 0 M0 Mm10 Mm20 D70 D50 D30 15
  • 16. Conclusion  audio CAPTCHA task using phonemic restoration  deletion-based method (DBM)  evaluation of CAPTCHA task  performance + mental workload (NASA-TLX)  comparison between DBM and MBM  DBM: easier to control the task  future works  improve the noise  investigation of phonemic restoration  really improving performance? only decreasing workload?  word familiarity, speech rate, synthesized speech, ... 16