What They Said
With the increasing availability and competition between voice-controlled smart home assistants ([see the October 18, 2016 LRSJ] client registration required), Lux recently interviewed Dawn Brun, Senior Manager of Public Relations from Amazon, about its Alexa platform and its future direction. Dawn said that Alexa, like many other voice-based assistants, relies on four key components to drive its conversational interface – Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), Dialogue Management, and Text to Speech (TTS):
- The first step to answering correctly is speech recognition – hearing correctly. ASR is how we “hear” the user’s speech and convert it to text that we can then process. This is the challenge we had to overcome for Amazon Echo and Alexa – how do you get the machine to understand you from a distance, (i.e. in the far-field environment)?
- Second, we need to make sure we understand the user correctly. NLU helps us parse the user’s request into their true intent. This enables us to find the meaning behind the speech. NLU is a particularly interesting problem, as we want to clearly understand what you are saying. A human-being is very good at disambiguating multiple responses, but with a voice interface you want to try to make the one, right choice from the very beginning for them.
- Third, we need to decide how to respond to the user and take an action to address the request. We call this dialogue management. There’s also a personalization element here. We need to give the user the right response based on past behavior and preferences. So when a user asks to skip a song, we have to quickly deliver a new song that they will like.
- Finally, TTS – we convert text back to speech to respond to the customer’s request. And of course, the TTS needs to be very natural.
When asked about the initial vision for Alexa’s implementation and its vision going forward, Dawn said, “We wanted to create a computer in the cloud that’s controlled entirely by your voice – you could ask it things, ask it to do things for you, find things for you, and it’s easy to converse with in a natural way. We’re always inventing and looking at ways to make customers’ lives easier. We believe voice is the most natural user interface and can really improve the way people interact with technology.”
Asking how Alexa compared to other voice-based assistants, such as Google Now, Microsoft’s Cortana, Apple’s Siri, or Facebook M, Dawn said, “Alexa is different than a voice assistant on a phone or tablet, which is designed to accompany a screen. Alexa was designed with the assumption that the user is not looking at a screen; therefore, the interactions become very different than with other voice assistants. Alexa isn’t a search engine giving you a list of choices on a screen; she’s making a decision on the best choice and delivering that back to the customer. We also leverage AWS, which is a huge advantage – things like huge processing power, Lambda, IoT.”