Tags: vui

9

sparkline

Tuesday, August 27th, 2019

Voice User Interface Design by Cheryl Platz

Cheryl Platz is speaking at An Event Apart Chicago. Her inaugural An Event Apart presentation is all about voice interfaces, and I’m going to attempt to liveblog it…

Why make a voice interface?

Successful voice interfaces aren’t necessarily solving new problems. They’re used to solve problems that other devices have already solved. Think about kitchen timers. There are lots of ways to set a timer. Your oven might have one. Your phone has one. Why use a $200 device to solve this mundane problem? Same goes for listening to music, news, and weather.

People are using voice interfaces for solving ordinary problems. Why? Context matters. If you’re carrying a toddler, then setting a kitchen timer can be tricky so a voice-activated timer is quite appealing. But why is voice is happening now?

Humans have been developing the art of conversation for thousands of years. It’s one of the first skills we learn. It’s deeply instinctual. Most humans use speach instinctively every day. You can’t necessarily say that about using a keyboard or a mouse.

Voice-based user interfaces are not new. Not just the idea—which we’ve seen in Star Trek—but the actual implementation. Bell Labs had Audrey back in 1952. It recognised ten words—the digits zero through nine. Why did it take so long to get to Alexa?

In the late 70s, DARPA issued a challenge to create a voice-activated system. Carnagie Mellon came up with Harpy (with a thousand word grammar). But none of the solutions could respond in real time. In conversation, we expect a break of no more than 200 or 300 milliseconds.

In the 1980s, computing power couldn’t keep up with voice technology, so progress kind of stopped. Time passed. Things finally started to catch up in the 90s with things like Dragon Naturally Speaking. But that was still about vocabulary, not grammar. By the 2000s, small grammars were starting to show up—starting an X-Box or pausing Netflix. In 2008, Google Voice Search arrived on the iPhone and natural language interaction began to arrive.

What makes natural language interactions so special? It requires minimal training because it uses the conversational muscles we’ve been working for a lifetime. It unlocks the ability to have more forgiving, less robotic conversations with devices. There might be ten different ways to set a timer.

Natural language interactions can also free us from “screen magnetism”—that tendency to stay on a device even when our original task is complete. Voice also enables fast and forgiving searches of huge catalogues without time spent typing or browsing. You can pick a needle straight out of a haystack.

Natural language interactions are excellent for older customers. These interfaces don’t intimidate people without dexterity, vision, or digital experience. Voice input often leads to more inclusive experiences. Many customers with visual or physical disabilities can’t use traditional graphical interfaces. Voice experiences throw open the door of opportunity for some people. However, voice experience can exclude people with speech difficulties.

Making the case for voice interfaces

There’s a misconception that you need to work at Amazon, Google, or Apple to work on a voice interface, or at least that you need to have a big product team. But Cheryl was able to make her first Alexa “skill” in a week. If you’re a web developer, you’re good to go. Your voice “interaction model” is just JSON.

How do you get your product team on board? Find the customers (and situations) you might have excluded with traditional input. Tell the stories of people whose hands are full, or who are vision impaired. You can also point to the adoption rate numbers for smart speakers.

You’ll need to show your scenario in context. Otherwise people will ask, “why can’t we just build an app for this?” Conduct research to demonstrate the appeal of a voice interface. Storyboarding is very useful for visualising the context of use and highlighting existing pain points.

Getting started with voice interfaces

You’ve got to understand how the technology works in order to adapt to how it fails. Here are a few basic concepts.

Utterance. A word, phrase, or sentence spoken by a customer. This is the true form of what the customer provides.

Intent. This is the meaning behind a customer’s request. This is an important distinction because one intent could have thousands of different utterances.

Prompt. The text of a system response that will be provided to a customer. The audio version of a prompt, if needed, is generated separately using text to speech.

Grammar. A finite set of expected utterances. It’s a list. Usually, each entry in a grammar is paired with an intent. Many interfaces start out as being simple grammars before moving on to a machine-learning model later once the concept has been proven.

Here’s the general idea with “artificial intelligence”…

There’s a human with a core intent to do something in the real world, like knowing when the cookies in the oven are done. This is translated into an intent like, “set a 15 minute timer.” That’s the utterance that’s translated into a string. But it hasn’t yet been parsed as language. That string is passed into a natural language understanding system. What comes is a data structure that represents the customers goal e.g. intent=timer; duration=15 minutes. That’s sent to the business logic where a timer is actually step. For a good voice interface, you also want to send back a response e.g. “setting timer for 15 minutes starting now.”

That seems simple enough, right? What’s so hard about designing for voice?

Natural language interfaces are a form of artifical intelligence so it’s not deterministic. There’s a lot of ruling out false positives. Unlike graphical interfaces, voice interfaces are driven by probability.

How do you turn a sound wave into an understandable instruction? It’s a lot like teaching a child. You feed a lot of data into a statistical model. That’s how machine learning works. It’s a probability game. That’s where it gets interesting for design—given a bunch of possible options, we need to use context to zero in on the most correct choice. This is where confidence ratings come in: the system will return the probability that a response is correct. Effectively, the system is telling you how sure or not it is about possible results. If the customer makes a request in an unusual or unexpected way, our system is likely to guess incorrectly. That’s because the system is being given something new.

Designing a conversation is relatively straightforward. But 80% of your voice design time will be spent designing for what happens when things go wrong. In voice recognition, edge cases are front and centre.

Here’s another challenge. Interaction with most voice interfaces is part conversation, part performance. Most interactions are not private.

Humans don’t distinguish digital speech fom human speech. That means these devices are intrinsically social. Our brains our wired to try to extract social information, even form digital speech. See, for example, why it’s such a big question as to what gender a voice interface has.

Delivering a voice interface

Storyboards help depict the context of use. Sample dialogues are your new wireframes. These are little scripts that not only cover the happy path, but also your edge case. Then you reverse engineer from there.

Flow diagrams communicate customer states, but don’t use the actual text in them.

Prompt lists are your final deliverable.

Functional prototypes are really important for voice interfaces. You’ll learn the real way that customers will ask for things.

If you build a working prototype, you’ll be building two things: a natural language interaction model (often a JSON file) and custom business logic (in a programming language).

Eventually voice design will become a core competency, much like mobile, which was once separate.

Ask yourself what tasks your customers complete on your site that feel clunkly. Remember that voice desing is almost never about new scenarious. Start your journey into voice interfaces by tackling old problems in new, more inclusive ways.

May the voice be with you!

Monday, May 21st, 2018

Google Duplex and the canny rise: a UX pattern – UX Collective

Chris weighs up the ethical implications of Google Duplex:

The social hacking that could be accomplished is mind-boggling. For this reason, I expect that having human-sounding narrow AI will be illegal someday. The Duplex demo is a moment of cultural clarity, where it first dawned on us that we can do it, but with only a few exceptions, we shouldn’t.

But he also offers alternatives for designing systems like this:

  1. Provide disclosure, and
  2. Design a hot signal:

…design the interface so that it is unmistakeable that it is synthetic. This way, even if the listener missed or misunderstood the disclosure, there is an ongoing signal that reinforces the idea. As designer Ben Sauer puts it, make it “Humane, not human.”

Wednesday, January 10th, 2018

My Pod

Merlin mentioned this service on a recent podcast episode. If you have an Amazon Echo, you can authenticate with this service and then point it at an RSS feed …like your Huffduffer feed, for example. From then on, Alexa becomes a Huffduffer player.

Tuesday, January 9th, 2018

Trends in Digital Tech for 2018 - Peter Gasston

Peter looks into his crystal ball for 2018 and sees computers with eyes, computers with ears, and computers with brains.

Friday, December 8th, 2017

The world is not a desktop

This 1993 article by Mark Weiser is relevant to our world today.

Take intelligent agents. The idea, as near as I can tell, is that the ideal computer should be like a human being, only more obedient. Anything so insidiously appealing should immediately give pause. Why should a computer be anything like a human being? Are airplanes like birds, typewriters like pens, alphabets like mouths, cars like horses? Are human interactions so free of trouble, misunderstanding, and ambiguity that they represent a desirable computer interface goal? Further, it takes a lot of time and attention to build and maintain a smoothly running team of people, even a pair of people. A computer I need to talk to, give commands to, or have a relationship with (much less be intimate with), is a computer that is too much the center of attention.

Monday, October 23rd, 2017

Voice Guidelines | Clearleft

I love what Ben is doing with this single-serving site (similar to my design principles collection)—it’s a collection of handy links and resources around voice UI:

Designing a voice interface? Here’s a useful list of lists: as many guiding principles as we could find, all in one place. List compiled and edited by Ben Sauer @bensauer.

BONUS ITEM: Have him run a voice workshop for you!

Friday, September 8th, 2017

A Simple Design Flaw Makes It Astoundingly Easy To Hack Siri And Alexa

This article about a specific security flaw in voice-activated assistants raises a bigger issue:

User-friendliness is increasingly at odds with security.

This is something I’ve been thinking about for a while. “Don’t make me think” is a great mantra for user experience, but a terrible mantra for security.

Our web browsers easily and invisibly collect cookies, allowing marketers to follow us across the web. Our phones back up our photos and contacts to the cloud, tempting any focused hacker with a complete repository of our private lives. It’s as if every tacit deal we’ve made with easy-to-use technology has come with a hidden cost: our own personal vulnerability. This new voice command exploit is just the latest in a growing list of security holes caused by design, but it is, perhaps, the best example of Silicon Valley’s widespread disregard for security in the face of the new and shiny.

Tuesday, June 6th, 2017

Amazon Alexa Voice Design Guide

A style guide for voice interfaces.

Wednesday, January 4th, 2017

Day 14: Posting to my Website from Alexa #100DaysOfIndieWeb • Aaron Parecki

Aaron documents how he posts to his website through his Amazon Echo. No interface left behind.