WebRTC calls transmit voice, video and data peer-to-peer, over the Internet.
But since this is an automated test, of course we could not have someone speak in a microphone every time the test runs; we had to feed in a known input file, so we had something to compare the recorded output audio against.
Could we duct-tape a small stereo to the mic and play our audio file on the stereo? That's not very maintainable or reliable, not to mention annoying for anyone in the vicinity. What about some kind of fake device driver which makes a microphone-like device appear on the device level? The problem with that is that it's hard to control a driver from the userspace test program. Also, the test will be more complex and flaky, and the driver interaction will not be portable.[1]
Instead, we chose to sidestep this problem. We used a solution where we load an audio file with WebAudio and play that straight into the peer connection through the WebAudio-PeerConnection integration. That way we start the playing of the file from the same renderer process as the call itself, which made it a lot easier to time the start and end of the file. We still needed to be careful to avoid playing the file too early or too late, so we don't clip the audio at the start or end - that would destroy our PESQ scores! - but it turned out to be a workable approach.[2]
Recording the Output
Alright, so now we could get a WebRTC call set up with a known audio file with decent control of when the file starts playing. Now we had to record the output. There are a number of possible solutions. The most end-to-end way is to straight up record what the system sends to default audio out (like speakers or headphones). Alternatively, we could write a hook in our application to dump our audio as late as possible, like when we're just about to send it to the sound card.
We went with the former. Our colleagues in the Chrome video stack team in Kirkland had already found that it's possible to configure a Windows or Linux machine to send the system's audio output (i.e. what plays on the speakers) to a virtual recording device. If we make that virtual recording device the default one, simply invoking SoundRecorder.exe and arecord respectively will record what the system is playing out.
They found this works well if one also uses the sox utility to eliminate silence around the actual audio content (recall we had some safety margins at both ends to ensure we record the whole input file as playing through the WebRTC call). We adopted the same approach, since it records what the user would hear, and yet uses only standard tools. This means we don't have to install additional software on the myriad machines that will run this test.[3]
Analyzing Audio
The only remaining step was to compare the silence-eliminated recording with the input file. When we first did this, we got a really bad score (like 2.0 out of 5.0, which means PESQ thinks it’s barely legible). This didn't seem to make sense, since both the input and recording sounded very similar. Turns out we didn’t think about the following:
We were comparing a full-band (24 kHz) input file to a wide-band (8 kHz) result (although both files were sampled at 48 kHz). This essentially amounted to a low pass filtering of the result file.
Both files were in stereo, but PESQ is only mono-aware.
The files were 32-bit, but the PESQ implementation is designed for 16 bits.
As you can see, it’s important to pay attention to what format arecord and SoundRecorder.exe records in, and make sure the input file is recorded in the same way. After correcting the input file and “rebasing”, we got the score up to about 4.0.[4]
Thus, we ended up with an automated test that runs continously on the torrent of Chrome change lists and protects WebRTC's ability to transmit sound. You can see the finished code here. With automated tests and cleverly chosen metrics you can protect against most regressions a user would notice. If your product includes video and audio handling, such a test is a great addition to your testing mix.
How the components of the test fit together.
Future work
It might be possible to write a Chrome extension which dumps the audio from Chrome to a file. That way we get a simpler-to-maintain and portable solution. It would be less end-to-end but more than worth it due to the simplified maintenance and setup. Also, the recording tools we use are not perfect and add some distortion, which makes the score less accurate.
There are other algorithms than PESQ to consider - for instance, POLQA is the successor to PESQ and is better at analyzing high-bandwidth audio signals.
We are working on a solution which will run this test under simulated network conditions. Simulated networks combined with this test is a really powerful way to test our behavior under various packet loss and delay scenarios and ensure we deliver a good experience to all our users, not just those with great broadband connections. Stay tuned for future articles on that topic!
Investigate feasibility of running this set-up on mobile devices.
1
It would be tolerable if the driver was just looping the input file, eliminating the need for the test to control the driver (i.e. the test doesn't have to tell the driver to start playing the file). This is actually what we do in the video quality test. It's a much better fit to take this approach on the video side since each recorded video frame is independent of the others. We can easily embed barcodes into each frame and evaluate them independently.
This seems much harder for audio. We could possibly do audio watermarking, or we could embed a kind of start marker (for instance, using DTMF tones) in the first two seconds of the input file and play the real content after that, and then do some fancy audio processing on the receiving end to figure out the start and end of the input audio. We chose not to pursue this approach due to its complexity.
2
Unfortunately, this also means we will not test the capturer path (which handles microphones, etc in WebRTC). This is an example of the frequent tradeoffs one has to do when designing an end-to-end test. Often we have to trade end-to-endness (how close the test is to the user experience) with robustness and simplicity of a test. It's not worth it to cover 5% more of the code if the test become unreliable or radically more expensive to maintain. Another example: A WebRTC call will generally involve two peers on different devices separated by the real-world internet. Writing such a test and making it reliable would be extremely difficult, so we make the test single-machine and hope we catch most of the bugs anyway.
3
It's important to keep the continuous build setup simple and the build machines easy to configure - otherwise you will inevitably pay a heavy price in maintenance when you try to scale your testing up.
4
When sending audio over the internet, we have to compress it since lossless audio consumes way too much bandwidth. WebRTC audio generally sounds great, but there's still compression artifacts if you listen closely (and, in fact, the recording tools are not perfect and add some distorsion as well). Given that this test is more about detecting regressions than measuring some absolute notion of quality, we'd like to downplay those artifacts. As our Kirkland colleagues found, one of the ways to do that is to "rebase" the input file. That means we start with a pristine recording, feed that through the WebRTC call and record what comes out on the other end. After manually verifying the quality, we use that as our input file for the actual test. In our case, it pushed our PESQ score up from 3 to about 4 (out of 5), which gives us a bit more sensitivity to regressions.