Google Testing Blog: 2015

All presentation videos and slides are posted on the Video Recordings and Presentations pages. All videos have professionally transcribed closed captions, and the YouTube descriptions have the slides links. Enjoy and share!

We had over 1,300 applicants and over 200 of those for speaking. Over 250 people filled our venue to capacity, and the live stream had a peak of about 400 concurrent viewers, with about 3,300 total viewing hours.

Our goal in hosting GTAC is to make the conference highly relevant and useful for both attendees and the larger test engineering community as a whole. Our post-conference survey shows that we are close to achieving that goal; thanks to everyone who completed the feedback survey!

Our 82 survey respondents were mostly (81%) test focused professionals with a wide range of 1 to 40 years of experience.
Another 76% of respondents rated the conference as a whole as above average, with marked satisfaction for the venue, the food (those Diwali treats!), and the breadth and coverage of the talks themselves.

The top five most popular talks were:

The Uber Challenge of Cross-Application/Cross-Device Testing (Apple Chow and Bian Jiang)
Your Tests Aren't Flaky (Alister Scott)
Statistical Data Sampling (Celal Ziftci and Ben Greenberg)
Coverage is Not Strongly Correlated with Test Suite Effectiveness (Laura Inozemtseva)
Chrome OS Test Automation Lab (Simran Basi and Chris Sosa).

Our social events also proved to be crowd pleasers. The social events were a direct response to feedback from GTAC 2014 for organized opportunities for socialization among the GTAC attendees.

This isn’t to say there isn’t room for improvement. We had 11% of respondents express frustration with event communications and provided some long, thoughtful suggestions for what we could do to improve next year. Also, many of the long form comments asked for a better mix of technologies, noting that mobile had a big presence in the talks this year.

If you have any suggestions on how we can improve, please comment on this post, or better yet – fill out the survey, which remains open. Based on feedback from last year urging more transparency in speaker selection, we included an individual outside of Google in the speaker evaluation. Feedback is precious, we take it very seriously, and we will use it to improve next time around.

Thank you to all the speakers, attendees, and online viewers who made this a special event once again. To receive announcements about the next GTAC, currently planned for early 2017, subscribe to the Google Testing Blog.

by Michael Klepikov and Lesley Katzen on behalf of the GTAC CommitteeGTACGoogle Cambridge

Video RecordingsPresentations

Our 82 survey respondents were mostly (81%) test focused professionals with a wide range of 1 to 40 years of experience.

Another 76% of respondents rated the conference as a whole as above average, with marked satisfaction for the venue, the food (those Diwali treats!), and the breadth and coverage of the talks themselves.

The Uber Challenge of Cross-Application/Cross-Device Testing (Apple Chow and Bian Jiang)

Your Tests Aren't Flaky (Alister Scott)

Statistical Data Sampling (Celal Ziftci and Ben Greenberg)

Coverage is Not Strongly Correlated with Test Suite Effectiveness (Laura Inozemtseva)

Chrome OS Test Automation Lab (Simran Basi and Chris Sosa).

surveyGoogle Testing Blog

Figure 1. How Auto Gain Control works [code here].

This is an example of automatic control engineering (another example would be the classic PID controller) and happens in real time. Therefore, if you move closer to the mic while speaking, the AGC will notice the output stream is too loud, and reduce mic volume and/or digital gain. When you move further away, it tries to adapt up again. The fancy voice activity detector is there so we only amplify speech, and not, say, the microwave oven your spouse just started in the other room.

Testing the AGC

Now, how do we make sure the AGC works? The first thing is obviously to write unit tests and integration tests. You didn’t think about building that end-to-end test first, did you? Once we have the lower-level tests in place, we can start looking at a bigger test. While developing the WebRTC implementation in Chrome, we had several bugs where the AGC code was working by itself, but was misconfigured in Chrome. In one case, it was simply turned off for all users. In another, it was only turned off in Hangouts.

Only an end-to-end test can catch these integration issues, and we already had stable, low-maintenance audio quality tests with the ability to record Chrome’s output sound for analysis. I encourage you to read that article, but the bottom line is that those tests can run a WebRTC call in two tabs and record the audio output to a file. Those tests run the PESQ algorithm on input and output to see how similar they are.

That’s a good framework to have, but I needed to make two changes:

Add file support to Chrome’s fake audio input device, so we can play a known file. The original audio test avoided this by using WebAudio, but AGC doesn’t run in the WebAudio path, just the microphone capture path, so that won’t work.
Instead of running PESQ, run an analysis that compares the gain between input and output.

Adding Fake File Support

This is always a big part of the work in media testing: controlling the input and output. It’s unworkable to tape microphones to loudspeakers or point cameras to screens to capture the media, so the easiest solution is usually to add a debug flag. It is exactly what I did here. It was a lot of work, but I won’t go into much detail since Chrome’s audio pipeline is complex. The core is this:

int FileSource::OnMoreData(AudioBus* audio_bus, uint32 total_bytes_delay) {
  // Load the file if we haven't already. This load needs to happen on the
  // audio thread, otherwise we'll run on the UI thread on Mac for instance.
  // This will massively delay the first OnMoreData, but we'll catch up.
  if (!wav_audio_handler_)
    LoadWavFile(path_to_wav_file_);
  if (load_failed_)
    return 0;

  DCHECK(wav_audio_handler_.get());

  // Stop playing if we've played out the whole file.
  if (wav_audio_handler_->AtEnd(wav_file_read_pos_))
    return 0;

  // This pulls data from ProvideInput.
  file_audio_converter_->Convert(audio_bus);
  return audio_bus->frames();
}

This code runs every 10 ms and reads a small chunk from the file, converts it to Chrome’s preferred audio format and sends it on through the audio pipeline. After implementing this, I could simply run:

chrome --use-fake-device-for-media-stream \
       --use-file-for-fake-audio-capture=/tmp/file.wav

and whenever I hit a webpage that used WebRTC, the above file would play instead of my microphone input. Sweet!

The Analysis Stage

Next I had to get the analysis stage figured out. It turned out there was something called an AudioPowerMonitor in the Chrome code, which you feed audio data into and get the average audio power for the data you fed in. This is a measure of how “loud” the audio is. Since the whole point of the AGC is getting to the right audio power level, we’re looking to compute

A_diff = A_out - A_in

Or, really, how much louder or weaker is the output compared to the input audio? Then we can construct different scenarios: A_diff should be 0 if the AGC is turned off and it should be > 0 dB if the AGC is on and we feed in a low power audio file. Computing the average energy of an audio file was straightforward to implement:

  // ...
  size_t bytes_written;
  wav_audio_handler->CopyTo(audio_bus.get(), 0, &bytes_written);
  CHECK_EQ(bytes_written, wav_audio_handler->data().size())
      << "Expected to write entire file into bus.";

  // Set the filter coefficient to the whole file's duration; this will make
  // the power monitor take the entire file into account.
  media::AudioPowerMonitor power_monitor(wav_audio_handler->sample_rate(),
                                         file_duration);
  power_monitor.Scan(*audio_bus, audio_bus->frames());
  // ...
  return power_monitor.ReadCurrentPowerAndClip().first;

I wrote a new test, and hooked up the above logic instead of PESQ. I could compute A_in by running the above algorithm on the reference file (which I fed in using the flag I implemented above) and A_out on the recording of the output audio. At this point I pretty much thought I was done. I ran a WebRTC call with the AGC turned off, expecting to get zero… and got a huge number. Turns out I wasn’t done.

What Went Wrong?

I needed more debugging information to figure out what went wrong. Since the AGC was off, I would expect the power curves for output and input to be identical. All I had was the average audio power over the entire file, so I started plotting the audio power for each 10 millisecond segment instead to understand where the curves diverged. I could then plot the detected audio power over the time of the test. I started by plotting A_diff:

Figure 2. Plot of A_diff.

The difference is quite small in the beginning, but grows in amplitude over time. Interesting. I then plotted A_out and A_in next to each other:

Figure 3. Plot of A_outand A_in.

A-ha! The curves drift apart over time; the above shows about 10 seconds of time, and the drift is maybe 80 ms at the end. The more they drift apart, the bigger the diff becomes. Exasperated, I asked our audio engineers about the above. Had my fancy test found its first bug? No, as it turns out - it was by design.

Clock Drift and Packet Loss

Let me explain. As a part of WebRTC audio processing, we run a complex module called NetEq on the received audio stream. When sending audio over the Internet, there will inevitably be packet loss and clock drift. Packet losses always happen on the Internet, depending on the network path between sender and receiver. Clock drift happens because the sample clocks on the sending and receiving sound cards are not perfectly synced.

In this particular case, the problem was not packet loss since we have ideal network conditions (one machine, packets go over the machine’s loopback interface = zero packet loss). But how can we have clock drift? Well, recall the fake device I wrote earlier that reads a file? It never touches the sound card like when the sound comes from the mic, so it runs on the system clock. That clock will drift against the machine’s sound card clock, even when we are on the same machine.

NetEq uses clever algorithms to conceal clock drift and packet loss. Most commonly it applies time compression or stretching on the audio it plays out, which means it makes the audio a little shorter or longer when needed to compensate for the drift. We humans mostly don’t even notice that, whereas a drift left uncompensated would result in a depleted or flooded receiver buffer – very noticeable. Anyway, I digress. This drift of the recording vs. the reference file was natural and I would just have to deal with it.

Silence Splitting to the Rescue!

I could probably have solved this with math and postprocessing of the results (least squares maybe?), but I had another idea. The reference file happened to be comprised of five segments with small pauses between them. What if I made these pauses longer, split the files on the pauses and trimmed away all the silence? This would effectively align the start of each segment with its corresponding segment in the reference file.

Figure 4. Before silence splitting.

Figure 5. After silence splitting.

We would still have NetEQ drift, but as you can see its effects will not stack up towards the end, so if the segments are short enough we should be able to mitigate this problem.

Result

Here is the final test implementation:

  base::FilePath reference_file = 
      test::GetReferenceFilesDir().Append(reference_filename);
  base::FilePath recording = CreateTemporaryWaveFile();

  ASSERT_NO_FATAL_FAILURE(SetupAndRecordAudioCall(
      reference_file, recording, constraints,
      base::TimeDelta::FromSeconds(30)));

  base::ScopedTempDir split_ref_files;
  ASSERT_TRUE(split_ref_files.CreateUniqueTempDir());
  ASSERT_NO_FATAL_FAILURE(
      SplitFileOnSilenceIntoDir(reference_file, split_ref_files.path()));
  std::vector<base::FilePath> ref_segments =
      ListWavFilesInDir(split_ref_files.path());

  base::ScopedTempDir split_actual_files;
  ASSERT_TRUE(split_actual_files.CreateUniqueTempDir());
  ASSERT_NO_FATAL_FAILURE(
      SplitFileOnSilenceIntoDir(recording, split_actual_files.path()));

  // Keep the recording and split files if the analysis fails.
  base::FilePath actual_files_dir = split_actual_files.Take();
  std::vector<base::FilePath> actual_segments =
      ListWavFilesInDir(actual_files_dir);

  AnalyzeSegmentsAndPrintResult(
      ref_segments, actual_segments, reference_file, perf_modifier);

  DeleteFileUnlessTestFailed(recording, false);
  DeleteFileUnlessTestFailed(actual_files_dir, true);

Where AnalyzeSegmentsAndPrintResult looks like this:

void AnalyzeSegmentsAndPrintResult(
    const std::vector<base::FilePath>& ref_segments,
    const std::vector<base::FilePath>& actual_segments,
    const base::FilePath& reference_file,
    const std::string& perf_modifier) {
  ASSERT_GT(ref_segments.size(), 0u)
      << "Failed to split reference file on silence; sox is likely broken.";
  ASSERT_EQ(ref_segments.size(), actual_segments.size())
      << "The recording did not result in the same number of audio segments "
      << "after on splitting on silence; WebRTC must have deformed the audio "
      << "too much.";

  for (size_t i = 0; i < ref_segments.size(); i++) {
    float difference_in_decibel = AnalyzeOneSegment(ref_segments[i],
                                                    actual_segments[i],
                                                    i);
    std::string trace_name = MakeTraceName(reference_file, i);
    perf_test::PrintResult("agc_energy_diff", perf_modifier, trace_name,
                           difference_in_decibel, "dB", false);
  }
}

The results look like this:

Figure 6. Average Adiff values for each segment on the y axis, Chromium revisions on the x axis.

We can clearly see the AGC applies about 6 dB of gain to the (relatively low-energy) audio file we feed in. The maximum amount of gain the digital AGC can apply is 12 dB, and 7 dB is the default, so in this case the AGC is pretty happy with the level of the input audio. If we run with the AGC turned off, we get the expected 0 dB of gain. The diff varies a bit per segment, since the segments are different in audio power.

Using this test, we can detect if the AGC accidentally gets turned off or malfunctions on windows, mac or linux. If that happens, the with_agc graph will drop from ~6 db to 0, and we’ll know something is up. Same thing if the amount of digital gain changes.

A more advanced version of this test would also look at the mic level the AGC sets. This mic level is currently ignored in the test, but it could take it into account by artificially amplifying the reference file when played through the fake device. We could also try throwing curveballs at the AGC, like abruptly raising the volume mid-test (as if the user leaned closer to the mic), and look at the gain for the segments to ensure it adapted correctly.

By: Patrik Höglund What is Automatic Gain Control? http://apprtc.appspot.com

Figure 1. How Auto Gain Control works [code here].
PID controller Testing the AGCwrite unit testsintegration testsdidn’t think about building that end-to-end test first, did youstable, low-maintenance audio quality testsPESQ

Add file support to Chrome’s fake audio input device, so we can play a known file. The original audio test avoided this by using WebAudio, but AGC doesn’t run in the WebAudio path, just the microphone capture path, so that won’t work.

Instead of running PESQ, run an analysis that compares the gain between input and output.

Adding Fake File Supportherethisint FileSource::OnMoreData(AudioBus* audio_bus, uint32 total_bytes_delay) { // Load the file if we haven't already. This load needs to happen on the // audio thread, otherwise we'll run on the UI thread on Mac for instance. // This will massively delay the first OnMoreData, but we'll catch up. if (!wav_audio_handler_) LoadWavFile(path_to_wav_file_); if (load_failed_) return 0; DCHECK(wav_audio_handler_.get()); // Stop playing if we've played out the whole file. if (wav_audio_handler_->AtEnd(wav_file_read_pos_)) return 0; // This pulls data from ProvideInput. file_audio_converter_->Convert(audio_bus); return audio_bus->frames(); } chrome --use-fake-device-for-media-stream \ --use-file-for-fake-audio-capture=/tmp/file.wav The Analysis StageAudioPowerMonitorA_diff = A_out - A_inA_diffimplement // ... size_t bytes_written; wav_audio_handler->CopyTo(audio_bus.get(), 0, &bytes_written); CHECK_EQ(bytes_written, wav_audio_handler->data().size()) << "Expected to write entire file into bus."; // Set the filter coefficient to the whole file's duration; this will make // the power monitor take the entire file into account. media::AudioPowerMonitor power_monitor(wav_audio_handler->sample_rate(), file_duration); power_monitor.Scan(*audio_bus, audio_bus->frames()); // ... return power_monitor.ReadCurrentPowerAndClip().first;A_inA_out What Went Wrong?A_diff

Figure 2. Plot of A_diff.A_outA_in

Figure 3. Plot of A_outand A_in. Clock Drift and Packet LossNetEqpacket lossclock drift Silence Splitting to the Rescue!least squares

Figure 4. Before silence splitting.

Figure 5. After silence splitting. Resultfinal test implementation base::FilePath reference_file = test::GetReferenceFilesDir().Append(reference_filename); base::FilePath recording = CreateTemporaryWaveFile(); ASSERT_NO_FATAL_FAILURE(SetupAndRecordAudioCall( reference_file, recording, constraints, base::TimeDelta::FromSeconds(30))); base::ScopedTempDir split_ref_files; ASSERT_TRUE(split_ref_files.CreateUniqueTempDir()); ASSERT_NO_FATAL_FAILURE( SplitFileOnSilenceIntoDir(reference_file, split_ref_files.path())); std::vector<base::FilePath> ref_segments = ListWavFilesInDir(split_ref_files.path()); base::ScopedTempDir split_actual_files; ASSERT_TRUE(split_actual_files.CreateUniqueTempDir()); ASSERT_NO_FATAL_FAILURE( SplitFileOnSilenceIntoDir(recording, split_actual_files.path())); // Keep the recording and split files if the analysis fails. base::FilePath actual_files_dir = split_actual_files.Take(); std::vector<base::FilePath> actual_segments = ListWavFilesInDir(actual_files_dir); AnalyzeSegmentsAndPrintResult( ref_segments, actual_segments, reference_file, perf_modifier); DeleteFileUnlessTestFailed(recording, false); DeleteFileUnlessTestFailed(actual_files_dir, true);void AnalyzeSegmentsAndPrintResult( const std::vector<base::FilePath>& ref_segments, const std::vector<base::FilePath>& actual_segments, const base::FilePath& reference_file, const std::string& perf_modifier) { ASSERT_GT(ref_segments.size(), 0u) << "Failed to split reference file on silence; sox is likely broken."; ASSERT_EQ(ref_segments.size(), actual_segments.size()) << "The recording did not result in the same number of audio segments " << "after on splitting on silence; WebRTC must have deformed the audio " << "too much."; for (size_t i = 0; i < ref_segments.size(); i++) { float difference_in_decibel = AnalyzeOneSegment(ref_segments[i], actual_segments[i], i); std::string trace_name = MakeTraceName(reference_file, i); perf_test::PrintResult("agc_energy_diff", perf_modifier, trace_name, difference_in_decibel, "dB", false); } } this

Figure 6. Average Adiff values for each segment on the y axis, Chromium revisions on the x axis.

The deadline to apply for GTAC 2015 is this Monday, August 10th, 2015. There is a great deal of interest to both attend and speak, and we’ve received many outstanding proposals. However, it’s not too late to submit your proposal for consideration. If you would like to speak or attend, be sure to complete the form by Monday.

We will be making regular updates to the GTAC site (developers.google.com/gtac/2015/) over the next several weeks, and you can find conference details there.

For those that have already signed up to attend or speak, we will contact you directly by mid-September.

Posted by Anthony Vallone on behalf of the GTAC Committee

complete the formdevelopers.google.com/gtac/2015/

We are pleased to announce that the ninth GTAC (Google Test Automation Conference) will be held in Cambridge (Greatah Boston, USA) on November 10th and 11th (Toozdee and Wenzdee), 2015. So, tell everyone to save the date for this wicked good event.

GTAC is an annual conference hosted by Google, bringing together engineers from industry and academia to discuss advances in test automation and the test engineering computer science field. It’s a great opportunity to present, learn, and challenge modern testing technologies and strategies.

You can browse presentation abstracts, slides, and videos from previous years on the GTAC site.

Stay tuned to this blog and the GTAC website for application information and opportunities to present at GTAC. Subscribing to this blog is the best way to get notified. We're looking forward to seeing you there!

Posted by Anthony Vallone on behalf of the GTAC Committee

CambridgeToozdee and WenzdeeGTAC

Figure 1. Holy dependencies, Batman!

There are many reasons to use software libraries. Why write your own phone number parser when you can use libphonenumber, which is battle-tested by real use in Android and Chrome and available under a permissive license? Using such software frees you up to focus on the core of your software so you can deliver a unique experience to your users. On the other hand, you need to keep your application up to date with changes in the library (you want that latest bug fix, right?), and you also run a risk of such a change breaking your application. This article will examine that integration problem and how you can reduce the risks associated with it.

Updating Dependencies is Hard

The simplest solution is to check in a copy of the library, build with it, and avoid touching it as much as possible. This solution, however, can be problematic because you miss out on bug fixes and new features in the library. What if you need a new feature or bug fix that just made it in? You have a few options:

Update the library to its latest release. If it’s been a long time since you did this, it can be quite risky and you may have to spend significant testing resources to ensure all the accumulated changes don’t break your application. You may have to catch up to interface changes in the library as well.
Cherry-pick the feature/bug fix you want into your copy of the library. This is even riskier because your cherry-picked patches may depend on other changes in the library in subtle ways. Also, you still are not up to date with the latest version.
Find some way to make do without the feature or bug fix.

None of the above options are very good. Using this ad-hoc updating model can work if there’s a low volume of changes in the library and our requirements on the library don’t change very often. Even if that is the case, what will you do if a critical zero-day exploit is discovered in your socket library?

One way to mitigate the update risk is to integrate more often with your dependencies. As an extreme example, let’s look at Chrome.

In Chrome development, there’s a massive amount of change going into its dependencies. The Blink rendering engine lives in a separate code repository from the browser. Blink sees hundreds of code changes per day, and Chrome must integrate with Blink often since it’s an important part of the browser. Another example is the WebRTC implementation, where a large part of Chrome’s implementation resides in the webrtc.org repository. This article will focus on the latter because it’s the team I happen to work on.

How “Rolling” Works

The open-sourced WebRTC codebase is used by Chrome but also by a number of other companies working on WebRTC. Chrome uses a toolchain called depot_tools to manage dependencies, and there’s a checked-in text file called DEPS where dependencies are managed. It looks roughly like this:

{
  # ... 
  'src/third_party/webrtc':
      'https://chromium.googlesource.com/' +
      'external/webrtc/trunk/webrtc.git' + 
      '@' + '5727038f572c517204e1642b8bc69b25381c4e9f',
}

The above means we should pull WebRTC from the specified git repository at the 572703... hash, similar to other dependency-provisioning frameworks. To build Chrome with a new version, we change the hash and check in a new version of the DEPS file. If the library’s API has changed, we must update Chrome to use the new API in the same patch. This process is known as rolling WebRTC to a new version.

Now the problem is that we have changed the code going into Chrome. Maybe getUserMedia has started crashing on Android, or maybe the browser no longer boots on Windows. We don’t know until we have built and run all the tests. Therefore a roll patch is subject to the same presubmit checks as any Chrome patch (i.e. many tests, on all platforms we ship on). However, roll patches can be considerably more painful and risky than other patches.

Figure 2. Life of a Roll Patch.

On the WebRTC team we found ourselves in an uncomfortable position a couple years back. Developers would make changes to the webrtc.org code and there was a fair amount of churn in the interface, which meant we would have to update Chrome to adapt to those changes. Also we frequently broke tests and WebRTC functionality in Chrome because semantic changes had unexpected consequences in Chrome. Since rolls were so risky and painful to make, they started to happen less often, which made things even worse. There could be two weeks between rolls, which meant Chrome was hit by a large number of changes in one patch.

Bots That Can See the Future: “FYI Bots”

We found a way to mitigate this which we called FYI (for your information) bots. A bot is Chrome lingo for a continuous build machine which builds Chrome and runs tests.

All the existing Chrome bots at that point would build Chrome as specified in the DEPS file, which meant they would build the WebRTC version we had rolled to up to that point. FYI bots replace that pinned version with WebRTC HEAD, but otherwise build and run Chrome-level tests as usual. Therefore:

If all the FYI bots were green, we knew a roll most likely would go smoothly.
If the bots didn’t compile, we knew we would have to adapt Chrome to an interface change in the next roll patch.
If the bots were red, we knew we either had a bug in WebRTC or that Chrome would have to be adapted to some semantic change in WebRTC.

The FYI “waterfall” (a set of bots that builds and runs tests) is a straight copy of the main waterfall, which is expensive in resources. We could have cheated and just set up FYI bots for one platform (say, Linux), but the most expensive regressions are platform-specific, so we reckoned the extra machines and maintenance were worth it.

Making Gradual Interface Changes

This solution helped but wasn’t quite satisfactory. We initially had the policy that it was fine to break the FYI bots since we could not update Chrome to use a new interface until the new interface had actually been rolled into Chrome. This, however, often caused the FYI bots to be compile-broken for days. We quickly started to suffer from red blindness [1] and had no idea if we would break tests on the roll, especially if an interface change was made early in the roll cycle.

The solution was to move to a more careful update policy for the WebRTC API. For the more technically inclined, “careful” here means “following the API prime directive” [2]. Consider this example:

class WebRtcAmplifier {
  ...
  int SetOutputVolume(float volume);
}

Normally we would just change the method’s signature when we needed to:

class WebRtcAmplifier {
  ...
  int SetOutputVolume(float volume, bool allow_eleven¹);
}

… but this would compile-break Chome until it could be updated. So we started doing it like this instead:

class WebRtcAmplifier {
  ...
  int SetOutputVolume(float volume);
  int SetOutputVolume2(float volume, bool allow_eleven);
}

Then we could:

Roll into Chrome
Make Chrome use SetOutputVolume2
Update SetOutputVolume’s signature
Roll again and make Chrome use SetOutputVolume
Delete SetOutputVolume2

This approach requires several steps but we end up with the right interface and at no point do we break Chrome.

Results

When we implemented the above, we could fix problems as they came up rather than in big batches on each roll. We could institute the policy that the FYI bots should always be green, and that changes breaking them should be immediately rolled back. This made a huge difference. The team could work smoother and roll more often. This reduced our risk quite a bit, particularly when Chrome was about to cut a new version branch. Instead of doing panicked and risky rolls around a release, we could work out issues in good time and stay in control.

Another benefit of FYI bots is more granular performance tests. Before the FYI bots, it would frequently happen that a bunch of metrics regressed. However, it’s not fun to find which of the 100 patches in the roll caused the regression! With the FYI bots, we can see precisely which WebRTC revision caused the problem.

Future Work: Optimistic Auto-rolling

The final step on this ladder (short of actually merging the repositories) is auto-rolling. The Blink team implemented this with their ARB (AutoRollBot). The bot wakes up periodically and tries to do a roll patch. If it fails on the trybots, it waits and tries again later (perhaps the trybots failed because of a flake or other temporary error, or perhaps the error was real but has been fixed).

To pull auto-rolling off, you are going to need very good tests. That goes for any roll patch (or any patch, really), but if you’re edging closer to a release and an unstoppable flood of code changes keep breaking you, you’re not in a good place.

References

[1] Martin Fowler (May 2006) “Continuous Integration”
[2] Dani Megert, Remy Chi Jian Suen, et. al. (Oct 2014) “Evolving Java-based APIs”

Footnotes

We actually did have a hilarious bug in WebRTC where it was possible to set the volume to 1.1, but only 0.0-1.0 was supposed to be allowed. No, really. Thus, our WebRTC implementation must be louder than the others since everybody knows 1.1 must be louder than 1.0.

Author: Patrik Höglund roughly a hundred libraries

Figure 1. Holy dependencies, Batman!libphonenumber Updating Dependencies is Hard

Update the library to its latest release. If it’s been a long time since you did this, it can be quite risky and you may have to spend significant testing resources to ensure all the accumulated changes don’t break your application. You may have to catch up to interface changes in the library as well.

Cherry-pick the feature/bug fix you want into your copy of the library. This is even riskier because your cherry-picked patches may depend on other changes in the library in subtle ways. Also, you still are not up to date with the latest version.

Find some way to make do without the feature or bug fix.

How “Rolling” Works DEPS{ # ... 'src/third_party/webrtc': 'https://chromium.googlesource.com/' + 'external/webrtc/trunk/webrtc.git' + '@' + '5727038f572c517204e1642b8bc69b25381c4e9f', } 572703...DEPSrollinggetUserMedia has started crashing on Androidmany tests, on all platforms we ship on

Figure 2. Life of a Roll Patch. Bots That Can See the Future: “FYI Bots” DEPS

If all the FYI bots were green, we knew a roll most likely would go smoothly.

If the bots didn’t compile, we knew we would have to adapt Chrome to an interface change in the next roll patch.

If the bots were red, we knew we either had a bug in WebRTC or that Chrome would have to be adapted to some semantic change in WebRTC.

FYI “waterfall”main waterfall Making Gradual Interface Changes red blindness[1]API prime directive[2]class WebRtcAmplifier { ... int SetOutputVolume(float volume); }class WebRtcAmplifier { ... int SetOutputVolume(float volume, bool allow_eleven¹); }class WebRtcAmplifier { ... int SetOutputVolume(float volume); int SetOutputVolume2(float volume, bool allow_eleven); }

Roll into Chrome

Make Chrome use SetOutputVolume2

Update SetOutputVolume’s signature

Roll again and make Chrome use SetOutputVolume

Delete SetOutputVolume2

Resultsperformance tests Future Work: Optimistic Auto-rollingARB (AutoRollBot) References Footnotes

We actually did have a hilarious bug in WebRTC where it was possible to set the volume to 1.1, but only 0.0-1.0 was supposed to be allowed. No, really. Thus, our WebRTC implementation must be louder than the others since everybody knows 1.1 must be louder than 1.0.

Reliable

Isolates Failures

Simulates a Real User

Integration Tests

Unit tests do have one major disadvantage: even if the units work well in isolation, you do not know if they work well together. But even then, you do not necessarily need end-to-end tests. For that, you can use an integration test. An integration test takes a small group of units, often two units, and tests their behavior as a whole, verifying that they coherently work together.

If two units do not integrate properly, why write an end-to-end test when you can write a much smaller, more focused integration test that will detect the same bug? While you do need to think larger, you only need to think a little larger to verify that units work together.

Testing Pyramid

Even with both unit tests and integration tests, you probably still will want a small number of end-to-end tests to verify the system as a whole. To find the right balance between all three test types, the best visual aid to use is the testing pyramid. Here is a simplified version of the testing pyramid from the opening keynote of the 2014 Google Test Automation Conference:

The bulk of your tests are unit tests at the bottom of the pyramid. As you move up the pyramid, your tests gets larger, but at the same time the number of tests (the width of your pyramid) gets smaller.

As a good first guess, Google often suggests a 70/20/10 split: 70% unit tests, 20% integration tests, and 10% end-to-end tests. The exact mix will be different for each team, but in general, it should retain that pyramid shape. Try to avoid these anti-patterns:

Inverted pyramid/ice cream cone. The team relies primarily on end-to-end tests, using few integration tests and even fewer unit tests.
Hourglass. The team starts with a lot of unit tests, then uses end-to-end tests where integration tests should be used. The hourglass has many unit tests at the bottom and many end-to-end tests at the top, but few integration tests in the middle.

Just like a regular pyramid tends to be the most stable structure in real life, the testing pyramid also tends to be the most stable testing strategy.

by Mike Wacker End-to-End Tests in Theory ten things we know to be true

Developers like it because it offloads most, if not all, of the testing to others.

Managers and decision-makers like it because tests that simulate real user scenarios can help them easily determine how a failing test would impact the user.

Testers like it because they often worry about missing a bug or writing a test that does not verify real-world behavior; writing tests from the user's perspective often avoids both problems and gives the tester a greater sense of accomplishment.

End-to-End Tests in Practice

The latest version of the service is built.

This version is then deployed to the team's testing environment.

All end-to-end tests then run against this testing environment.

An email report summarizing the test results is sent to the team.

Days Left

Pass %

Notes

Everything is broken! Signing in to the service is broken. Almost all tests sign in a user, so almost all tests failed.

A partner team we rely on deployed a bad build to their testing environment yesterday.

-1

54%

A dev broke the save scenario yesterday (or the day before?). Half the tests save a document at some point in time. Devs spent most of the day determining if it's a frontend bug or a backend bug.

-2

54%

It's a frontend bug, devs spent half of today figuring out where.

-3

54%

A bad fix was checked in yesterday. The mistake was pretty easy to spot, though, and a correct fix was checked in today.

-4

Hardware failures occurred in the lab for our testing environment.

-5

84%

Many small bugs hiding behind the big bugs (e.g., sign-in broken, save broken). Still working on the small bugs.

-6

87%

We should be above 90%, but are not for some reason.

-7

89.54%

(Rounds up to 90%, close enough.) No fixes were checked in yesterday, so the tests must have been flaky yesterday.

Analysis What Went Well

Customer-impacting bugs were identified and fixed before they reached the customer.

What Went Wrong

The team completed their coding milestone a week late (and worked a lot of overtime).

Finding the root cause for a failing end-to-end test is painful and can take a long time.

Partner failures and lab failures ruined the test results on multiple days.

Many smaller bugs were hidden behind bigger bugs.

End-to-end tests were flaky at times.

Developers had to wait until the following day to know if a fix worked or not.

The True Value of Tests A failing test does not directly benefit the user. A bug fix directly benefits the user.

Stage

Failing Test

Bug Opened

Bug Fixed

Value Added

Yes

Building the Right Feedback Loop

It's fast. No developer wants to wait hours or days to find out if their change works. Sometimes the change does not work - nobody is perfect - and the feedback loop needs to run multiple times. A faster feedback loop leads to faster fixes. If the loop is fast enough, developers may even run tests before checking in a change.

It's reliable. No developer wants to spend hours debugging a test, only to find out it was a flaky test. Flaky tests reduce the developer's trust in the test, and as a result flaky tests are often ignored, even when they find real product issues.

It isolates failures. To fix a bug, developers need to find the specific lines of code causing the bug. When a product contains millions of lines of codes, and the bug could be anywhere, it's like trying to find a needle in a haystack.

Think Smaller, Not Larger Unit Tests

Unit tests are fast. We only need to build a small unit to test it, and the tests also tend to be rather small. In fact, one tenth of a second is considered slow for unit tests.

Unit tests are reliable. Simple systems and small units in general tend to suffer much less from flakiness. Furthermore, best practices for unit testing - in particular practices related to hermetic tests - will remove flakiness entirely.

Unit tests isolate failures. Even if a product contains millions of lines of code, if a unit test fails, you only need to search that small unit under test to find the bug.

buildstests Unit Tests vs. End-to-End Tests

Unit

End-toEnd

Fast

Reliable

Isolates Failures

Simulates a Real User

Integration Tests Testing Pyramidtesting pyramid2014 Google Test Automation Conference

Inverted pyramid/ice cream cone. The team relies primarily on end-to-end tests, using few integration tests and even fewer unit tests.

Hourglass. The team starts with a lot of unit tests, then uses end-to-end tests where integration tests should be used. The hourglass has many unit tests at the bottom and many end-to-end tests at the top, but few integration tests in the middle.

have realized this idea and we are looking forward to the day this has

been worked out and becomes a reality.

by Kevin Graney

Here at Google we have a long history of capitalizing on the latest research and technology to improve the quality of our software. Over our past 16+ years as a company, what started with some humble unit tests has grown into a massive operation. As our software complexity increased, ever larger and more complex tests were dreamed up by our Software Engineers in Test (SETs).

What we have come to realize is that our love of testing is a double-edged sword. On the one hand, large-scale testing keeps us honest and gives us confidence. It ensures our products remain reliable, our users' data is kept safe, and our engineers are able to work productively without fear of breaking things. On the other hand, it's expensive in both engineer and machine time. Our SETs have been working tirelessly to reduce the expense and latency of software tests at Google, while continuing to increase their quality.

Today, we're excited to reveal how Google is tackling this challenge. In collaboration with the Quantum AI Lab, SETs at Google have been busy revolutionizing how software is tested. The theory is relatively simple: bits in a traditional computer are either zero or one, but bits in a quantum computer can be both one and zero at the same time. This is known as superposition, and the classic example is Schrodinger's cat. Through some clever math and cutting edge electrical engineering, researchers at Google have figured out how to utilize superposition to vastly improve the quality of our software testing and the speed at which our tests run.

Figure 1 Some qubits inside a Google quantum device.

With superposition, tests at Google are now able to simultaneously model every possible state of the application under test. The state of the application can be thought of as an n bit sequential memory buffer, consistent with the traditional Von Neuman architecture of computing. Because each bit under superposition is simultaneously a 0 and a 1, these tests can simulate 2n different application states at any given instant in time in O(n) space. Each of these application states can be mutated by application logic to another state in constant time using quantum algorithms developed by Google researchers. These two properties together allow us to build a state transition graph of the application under test that shows every possible application state and all possible transitions to other application states. Using traditional computing methods this problem has intractable time complexity, but after leveraging superposition and our quantum algorithms it becomes relatively fast and cheap.

Once we have the state transition graph for the application under test, testing it becomes almost trivial. Given the initial startup state of the application, i.e. the executable bits of the application stored on disk, we can find from the application's state transition graph all reachable states. Assertions that ensure proper behavior are then written against the reachable subset of the transition graph. This paradigm of test writing allows both Google's security engineers and software engineers to work more productively. A security engineer can write a test, for example, that asserts "no executable memory regions become mutated in any reachable state". This one test effectively eliminates the potential for security flaws that result from memory safety violations. A test engineer can write higher level assertions using graph traversal methods that ensure data integrity is maintained across a subset of application state transitions. Tests of this nature can detect data corruption bugs.

We're excited about the work our team has done so far to push the envelope in the field of quantum software quality. We're just getting started, but based on early dogfood results among Googlers we believe the potential of this work is huge. Stay tuned!

UPDATE: Hey, this was an April fool's joke but in fact we wished we could have realized this idea and we are looking forward to the day this has been worked out and becomes a reality.
by Kevin Graney Quantum AI LabSchrodinger's cat

Figure 1 Some qubits inside a Google quantum device. nVon Neuman architecturenO(n)

Figure 2 The application state graph for a demonstrative 3-bit application. If the start state is 001 then 000, 110, 111, and 011 are all unreachable states. States 010 and 100 both result in deadlock. Once we have the state transition graph for the application under test, testing it becomes almost trivial. Given the initial startup state of the application, i.e. the executable bits of the application stored on disk, we can find from the application's state transition graph all reachable states. Assertions that ensure proper behavior are then written against the reachable subset of the transition graph. This paradigm of test writing allows both Google's security engineers and software engineers to work more productively. A security engineer can write a test, for example, that asserts "no executable memory regions become mutated in any reachable state". This one test effectively eliminates the potential for security flaws that result from memory safety violations. A test engineer can write higher level assertions using graph traversal methods that ensure data integrity is maintained across a subset of application state transitions. Tests of this nature can detect data corruption bugs.

When you use UI tests as E2E tests, you face the following problems:

Very large and slow tests.
High flakiness rate due to timeouts and memory issues.
Hard to debug/investigate failures.
Authentication issues (ex: authentication from automated tests is very tricky).

Let’s see how these problems can be fixed using the following strategies.

Strategy 2: Hermetic UI Testing using Fake Servers

In this strategy, you avoid network calls and external dependencies, but you need to provide your application with data that drives the UI. Update your application to communicate to a local server rather than external one, and create a fake local server that provides data to your application. You then need a mechanism to generate the data needed by your application. This can be done using various approaches depending on your system design. One approach is to record server responses and replay them in your fake server.

Once you have hermetic UI tests talking to a local fake server, you should also have server hermetic tests. This way you split your E2E test into a server side test, a client side test, and an integration test to verify that the server and client are in sync (for more details on integration tests, see the backend testing section of blog).

Now, the client test flow looks like:

While this approach drastically reduces the test size and flakiness rate, you still need to maintain a separate fake server as well as your test. Debugging is still not easy as you have two moving parts: the test and the local server. While test stability will be largely improved by this approach, the local server will cause some flakes.

Let’s see how this could this be improved...

Strategy 3: Dependency Injection Design for Apps.

To remove the additional dependency of a fake server running on Android, you should use dependency injection in your application for swapping real module implementations with fake ones. One example is Dagger, or you can create your own dependency injection mechanism if needed.

This will improve the testability of your app for both unit testing and UI testing, providing your tests with the ability to mock dependencies. In instrumentation testing, the test apk and the app under test are loaded in the same process, so the test code has runtime access to the app code. Not only that, but you can also use classpath override (the fact that test classpath takes priority over app under test) to override a certain class and inject test fakes there. For example, To make your test hermetic, your app should support injection of the networking implementation. During testing, the test injects a fake networking implementation to your app, and this fake implementation will provide seeded data instead of communicating with backend servers.

Strategy 4: Building Apps into Smaller Libraries

If you want to scale your app into many modules and views, and plan to add more features while maintaining stable and fast builds/tests, then you should build your app into small components/libraries. Each library should have its own UI resources and user dependency management. This strategy not only enables mocking dependencies of your libraries for hermetic testing, but also serves as an experimentation platform for various components of your application.

Once you have small components with dependency injection support, you can build a test app for each component.

The test apps bring up the actual UI of your libraries, fake data needed, and mock dependencies. Espresso tests will run against these test apps. This enables testing of smaller libraries in isolation.

For example, let’s consider building smaller libraries for login and settings of your app.

The settings component test now looks like:

Conclusion

UI testing can be very challenging for rich apps on Android. Here are some UI testing lessons learned on the Google+ team:

Don’t write E2E tests instead of UI tests. Instead write unit tests and integration tests beside the UI tests.
Hermetic tests are the way to go.
Use dependency injection while designing your app.
Build your application into small libraries/modules, and test each one in isolation. You can then have a few integration tests to verify integration between components is correct .
Componentized UI tests have proven to be much faster than E2E and 99%+ stable. Fast and stable tests have proven to drastically improve developer productivity.

by Mona El Mahdy Overview Robolectricgradle unit tests supportEspressotest support librariesStrategy 1: Using an End-To-End Test as a UI Test UI testend-to-end (E2E)

Very large and slow tests.

High flakiness rate due to timeouts and memory issues.

Hard to debug/investigate failures.

Authentication issues (ex: authentication from automated tests is very tricky).

Strategy 2: Hermetic UI Testing using Fake Serversserver hermetic testsblog

Strategy 3: Dependency Injection Design for Apps. Daggerinstrumentation testing

Strategy 4: Building Apps into Smaller Libraries

Conclusion

Don’t write E2E tests instead of UI tests. Instead write unit tests and integration tests beside the UI tests.

Hermetic tests are the way to go.

Use dependency injection while designing your app.

Build your application into small libraries/modules, and test each one in isolation. You can then have a few integration tests to verify integration between components is correct .

Componentized UI tests have proven to be much faster than E2E and 99%+ stable. Fast and stable tests have proven to drastically improve developer productivity.

(The logo on the trophy is the same one we put on the printed version of each TotT episode, which you can see by looking for the “printer-friendly version” link in the TotT blog posts).

Congratulations to the winners!

By Andrew Trenk createdErik Kuefler:Test Behaviors, Not MethodsDon't Put Logic in TestsAlex Eagle:Change-Detector Tests Considered Harmful

TotT blog posts

Testing Blog

GTAC 2015 Wrap Up

GTAC 2015 is Next Week!

Announcing the GTAC 2015 Agenda

Audio Testing - Automatic Gain Control

Testing the AGC

Adding Fake File Support

The Analysis Stage

What Went Wrong?

Clock Drift and Packet Loss

Silence Splitting to the Rescue!

Result

The Deadline to Apply for GTAC 2015 is Monday Aug 10

GTAC 2015: Call for Proposals & Attendance

GTAC 2015 Coming to Cambridge (Greater Boston) in November

Multi-Repository Development

Updating Dependencies is Hard

How “Rolling” Works

Bots That Can See the Future: “FYI Bots”

Making Gradual Interface Changes

Results

Future Work: Optimistic Auto-rolling

References

Footnotes

Just Say No to More End-to-End Tests

Integration Tests

Testing Pyramid

Quantum Quality

Android UI Automated Testing

The First Annual Testing on the Toilet Awards

Testing on the Toilet: Change-Detector Tests Considered Harmful

Testing on the Toilet: Prefer Testing Public APIs Over Implementation-Detail Classes

Labels

Archive

Feed