How can get a 100 per cent success rate when speech recognition itself isn't 100 per cent successful?
Audio tweaked just 0.1% to fool speech recognition engines
The development of AI adversaries continues apace: a paper by Nicholas Carlini and David Wagner of the University of California Berkeley has explained off a technique to trick speech recognition by changing the source waveform by 0.1 per cent. The pair wrote at arXiv that their attack achieved a first: not merely an attack …
COMMENTS
-
-
Thursday 11th January 2018 07:36 GMT bazza
"Think “Alexa, stream smut to the TV” when your friend only hears you say “What's the weather, Alexa?”"
Judging by some of the clips on YouTube, Alexa is perfectly capable of doing that already...
This kind of thing should act as a real warning to anyone planning an automated call centre. It means that fraud is a real risk. If "Tell me my balance" can be tweaked into being interpretted as "transfer the funds", followed by "No" being tweaked into a "Yes", a bank could get into deep trouble. Any playback in a court case would show that the punter had said one thing and that the bank interpretted it all wrongly...
Generally speaking, at least here in the UK / Europe, it'd be interesting to see if a recording of someone's voice (as made by a voice recognition system) counted as a personal data record. If so then a failure to process it accurately (and to the detriment of the customer) would be a Data Protection Act problem. £5000 fine.
-
Thursday 11th January 2018 11:34 GMT Anonymous Coward
"Generally speaking, at least here in the UK / Europe, it'd be interesting to see if a recording of someone's voice (as made by a voice recognition system) counted as a personal data record."
HMRC are using voice recognition as a tax payer's ID to get through to their help desks. It appears to use one set phrase.
-
-
Friday 12th January 2018 15:44 GMT Alan Brown
Re: So, it's crap?
'If a system can be "fooled" in this way, it's probably not doing it right.'
Humans can be trivially fooled into mishearing things
(Excuse me while I kiss this guy, We built this city on sausage rolls, We're calling a trout, Le freak c'est sheep, I'll never leave your pizza burning)
They can also be relatively easily fooled into misidentifying the speaker.
-
Thursday 11th January 2018 08:15 GMT Nick Kew
Just like human senses
Remind me: was that dress blue and black, or gold and white, or ... ? I'm sure someone can remember the story of the dress that hit the headlines when it fooled the human eye.
And when I worked in speech recognition research, we could easily confuse a speech recogniser for a digger, because the latter could wreck a nice beach (say it a couple of times if you don't get it).
-
Thursday 11th January 2018 09:29 GMT Anonymous Coward
Re: Just like human senses
It was both. I managed to see either dress, depending on the conditions when I looked at the picture. On a few cases I fit the sweet spot where it was just on the point of changing.
I had a copy pinned up for a while because the ongoing optical illusion delighted me that much.
-
Thursday 11th January 2018 12:43 GMT handleoclast
Re: Just like human senses
@Anonymous Coward
because the ongoing optical illusion delighted me that much.
If you liked that, you'll love this. If you're looking at it one a phone, it's not using the front-facing camera to track your eyes; it works on a desktop setup with no cam. Doesn't work in a printout, though. :)
-
-
Thursday 11th January 2018 12:40 GMT Nick Kew
Re: Just like human senses
Following up to myself (sorry).
Just heard Rutherford & Fry ont'wireless discussing human vs machine perception. Specifically, facial recognition.
They made a crucial distinction. Humans (and sheep) are very good at recognising faces we know, but very bad at recognising strangers. The latter has led to criminal convictions on eye-witness evidence that have subsequently been proven entirely wrong. Machines can of course be fooled too, as studies like this article demonstrate.
I reckon that means the real human/machine distinctions come from secondary influences. Like suggestibility and prejudices in humans, or tampering in machines.
-
Thursday 11th January 2018 14:54 GMT Charles 9
Re: Just like human senses
"They made a crucial distinction. Humans (and sheep) are very good at recognising faces we know, but very bad at recognising strangers."
We also lose the ability to recognize even faces we know if enough cues disappear. A famous case around the early 90's pretty much shot eyewitness testimony all to hell by showing a suffciently-covered (ex. beard and glasses) celebrity face was mistaken by nearly everyone for the defendant.
-
-
-
Thursday 11th January 2018 09:22 GMT foxyshadis
El Reg is showing a pattern here
While this is a major step up from the last two "machine learning fail" studies The Register has breathlessly reported on -- at least this time it's not just testing some crap created from scratch by the researchers themselves -- they chose DeepSpeech, of all the speech-to-text algorithms, widely considered so bad that this might be the first study to actually bother testing it. It's no surprise that it fails so badly. Even if they have to confine themselves to open source (which makes no sense in this case, since they neither analyze the algorithms nor modify the code), CMU Sphinx and Kaldi are the gold standards.
No one cares how DeepSpeech fails, it's widely regarded as a failure. Waste of time testing that. Wait until it has another year or two to mature before it's worth testing.
-
Thursday 11th January 2018 10:09 GMT Anonymous Coward
Potential for the future
Although there are obvious flaws and issues with the case, it shows that there is potential for people who want to communicate in different languages. I think there are already some earpieces that do a similar thing but with added AI then it could only improve on the accuracy and range of things you can say.
-
Thursday 11th January 2018 10:30 GMT Joerg
This happens because AI algorithms are a joke...
This happens because AI algorithms are a joke... yep that is the truth. All currently in use AI algorithms aren't really AI at all. It is just a very complex (aka messed up) combination of various conditions with tweaks and hacks to make it look like an AI taking its own decisions. They really are no different than having thousands of very simple nested if-then-else conditions .. All the neural network stuff it looks shiny and cool theory wise but it is not that advanced as the marketing wants people to believe.
-
Thursday 11th January 2018 11:48 GMT Matthew Taylor
He who laughs last...
To everyone who is chuckling that "it just goes to show, AI is useless after all", consider this. One of the problems with training these systems is a lack of good training data, so this "attack" is a boon to AI researchers. They just need to add lots of adversarially hobbled speech samples to their network's training set, and it will learn to classify speech much more robustly.
-
Thursday 11th January 2018 13:25 GMT Alistair
Re: He who laughs last...
Umm:
They just need to add lots of adversarially hobbled speech samples to their network's training set,
I'd suggest that they start collecting recordings from drive through restaurant communications systems.
If they get that working, well, then they can *definitely* claim a victory.
-
-
Thursday 11th January 2018 12:25 GMT mark l 2
I wondered if this technique will work with other audio other than voice? If so it could also be used to get around Youtube annoying song recognition where you videos have to have parts of the audio replaced or muted or you have to allow moneytised by the song copyright owner because there happened to be a bit of music playing in the background when you made your video.
I can understand them not wanting people to make money off others work, but sometimes you can't always turn down music if what your filming is occurring in real time and is not a repeatable event
-
Thursday 11th January 2018 12:33 GMT Packet
I am not surprised by this.
What disturbs me is how many banks have shifted to voice print identification for their telephone systems.
And unlike say a consumer version like Siri, the telephone banking system can't be trained to recognize your voice (or maybe it is?) I use Siri as an example of something that learns your voice the more you use it.
I get the motivation though - every CTO, etc is looking for the next great cure-all for all the security shite of IT, and so they jump onto the newest flavour of the month.
Telephone banking is definitely insecure - to just rely upon a PIN of 3-5 digits, so along comes voice recognition - and it sounds very Star Trek like too.
But in the meantime, they ignore their web/app banking security... (I know of a bank that for the longest time, did not differentiate between upper-case and lower-case passwords...)
-
Friday 12th January 2018 15:38 GMT Alan Brown
Speech recognition
Try "Mmm, yes. Special we are" played backwards.
As far as the stegenography in audio is concerned:
This kind of thing is due to having far too high a dynamic range on the listener as well as too wide a bandwidth. Injecting synthetic masking noise would nobble the hidden speech detection (and also kills off false google/siri/cortana/alexa hits) and filtering to 300-3000Hz (actually only 300-1500Hz is all you'd need) would probably improve accuracy.
You only need 12dB of dynamic range to handle intelligible speech (That's why 12dB SINAD used to be the squelch point in landmobile systems). Old style Telephony LD circuits used to only have about 40dB