“Alexa, what are the stats?”: How the Amazon Echo is powered by data science

Author: Carlos Grajales

About five years ago, Amazon released the first version of Echo, an internet-connected speaker, unique for its proprietary always-on voice assistant [1]. At first glance, Echo looks like a regular Bluetooth speaker, but unlike its predecessors, Echo is constantly awaiting and listening for spoken commands that its virtual assistant can then react to in order to play some music, respond to some query, activate an app or even interact with other appliances in the house. Right now, more than 28,000 smart home devices, from televisions to microwaves, are currently able to interact with Amazon’s assistant [2].

As it is mandatory for the marketing department, the assistant has its own female name: Alexa. It is a name chosen mostly due to practical concerns: it is a phonetically appropriate word, as it includes a letter “x”. It is also a word with a very specific sound, making it easier to understand by algorithms, so it would be recognized and interpreted by the software in a swift [3].

Inspired by science fiction systems, particularly those seen in Star Trek, Alexa came to the world with the aspiration of becoming a cloud-powered computer for our everyday life: a full-time voice powered assistant. Apparently, in the Star Trek universe, statisticians were many and well applied, as to achieve such a dream, Amazon is relying a lot on statistical techniques to improve customer’s experience with Alexa.

One of the unique features of Alexa is that it works solely with voice commands, an intuitive interface that is the most appealing part of the system. It would appear that the technology behind such voice recognition system is something fairly new but, in truth, Natural Language Processing, the field of Computing Science that relates with machines interpreting and expressing language, has been around since the 1950s when the first attempts of automatic language translation took place [4]. Natural Language Processing, or NLP for friends, has only become ubiquitous recently thanks to a deeper integration of statistical models in this field of research, particularly the use of neural networks and other algorithms associated with Machine Learning.
Alexa’s groundbreaking abilities are almost entirely based on models [5] [6] [7] [8]. When you voice an order to Alexa, such as “Alexa, get me chicken wings with extra habanero”, the first thing that occurs is that seven microphones within the device are used to identify where the sound is coming from, to orientate the system to focus on the sounds coming from that direction. Remember that the system is always listening and recording, so every sound of the house is listened to, but Alexa must focus only on specific commands for her. Once it identifies some interesting sound, the speaker applies some echo cancellation and noise reduction systems to get a cleaner, more accurate audio depiction of the command [7]. As stated, every sound is analyzed, but this time the speaker identifies that I used the word “Alexa”, which is one of the triggers to “wake it up”. Detecting the “wake word” is the very first process that relies on neural nets [8]. Based on the sound of the speech and the phonics of the words used, the algorithm identifies whether Alexa has been called upon or not. If the model detects that I activated the device, a dim light will turn on in the upper part of the speaker and the magic will begin.

So far, my voice has been analyzed solely within the device, which has a tiny computer that runs the noise filtering and wake word recognition model [7] [8]. But at this moment, Alexa relies on something bigger, as the task is about to become more burdensome. The reason why Alexa requires an Internet Connection is that it sends the recording of my command over the Internet, directly to Amazon’s Alexa Voice Services, where the rest of the processing will occur. My recorded command is sent to be analyzed with more sophisticated models of Natural Language Processing, models that are also powered by probabilistic concepts. In the cloud, Amazon firsts translates my words into a text, by analyzing characteristics of my speech such as frequency and pitch. This algorithm generates a text that will be used as input for a two-step model that will “understand” what I just said [7] [8].

In the first step, a model will analyze the words in the text to detect the most likely sequence of those words, based on a huge amount of existing corpus that feeds the algorithm. This process will produce a “prior” estimation of meaning, kind of. This prior will then be used in another neural network model, which will not only use this prior identification of words but will also rely on the audio and its characteristics, as the model uses the specifics of the recording (pitch, tone, pauses) to classify the command’s content [7] [8]. Besides audio and priors, Amazon’s model will also use some form of “context” to try to better understand my food order of wings. It identifies who said the command, in order to adapt the meaning to the user, but it also relies on information about what type of device the order came from. This means that the exact same command can be interpreted differently when asked from an Echo speaker than when asked from a smart TV. In the end, all this information will be processed in order to identify some key components of my command that shall allow the system to understand what I need and how.

In the cloud, Amazon now knows (ideally) that I want wings with extra habanero, but the process isn’t over yet. Alexa must still generate an appropriate audio response. This is also done in the cloud, by using text-to-speech synthesis (TTS), a technology that converts sequences of words into natural sounding, intelligible audio [7]. This is also done with the use of modeling techniques that pair the words to their more adequate sound. An audio recording is then prepared and sent to my device. I would therefore be greeted with an announcement that my wings are already on their way, though it is totally plausible that one of the processes just described couldn’t find an appropriate meaning with enough confidence, in which case Alexa will prepare an audio asking for more details from the user. In this case, possibly asking where the wings will be ordered from.

The whole process is optimized to be as fast and friction-less for consumers, but it is easy to find a lot of gaps that could produce less than optimal responses from the system. For this reason, Amazon’s system requires constant updates of the different models used by Alexa. Curiously, this requirement of constant updating the statistical models behind Alexa is also what has brought more concerns from the consumers.

Being a statistical model, a bigger sample size is a great way to improve accuracy, which is why speech recognition models are constantly being updated with new data points. As such, it is important to keep feeding the algorithm with new interactions between users and Alexa, interactions which must be previously labelled and catalogued [9]. Anyone with prior experience fitting models knows how important it is to ensure your data is appropriately labeled and classified in terms of the response variable. And speaking of speech recognition, the best standard is still a human-being listening and understanding what the user is asking for. For this purpose, Amazon hires thousands of employees around the globe, whose sole job is to listen to user interactions with Alexa and identify whether the speech was correctly identified, understood and, if not, what would have been the appropriate topic or response [9] [10]. For instance, let’s say I ask Alexa for “the freshest clams in town to entice my girlfriend with a home cooked dinner”. The model might use the words “dinner”, “clams” and “girlfriend” to suggest seafood restaurants nearby, even though that’s not what I’m looking for. A human listener can then realize the software made a mistake, catalogue it correctly and then ensure this scenario is included within the dataset that feeds the speech recognition model to reduce inaccuracy when another user wants to seduce a woman with seafood.

Of course, when this was first exposed, privacy concerns arose: why are human beings within Amazon listening to my recordings? [9] [10] Even though updating the speech recognition software is crucial to ensure it appropriately works with every language and colloquialism, this certainly blurs the line of what is private and what not. Amazon is not the only culprit of such invasion, other voice assistant manufacturers also rely on similar procedures to constantly upgrade their models, meaning that using any assistant and agreeing on their terms of service means that you are OK with your voice being randomly selected as part of the sample that every day is sent to humans to listen and reclassify. Other algorithms such as Image Recognition models also rely on such procedures, yet voice assistants are surely more sensible to privacy concerns due to their nature. It is for this reason that Amazon is exploring new ways of updating their models, particularly with the use of semi-supervised algorithms [5] [8].

The idea is that, instead of humans classifying the results of new interactions, another model will decide whether the interactions occurred smoothly or not. As it will find and point out errors, let’s call this other model the “scolding mother” model for clarity. As this “scolding mother” model might make mistakes as well, Amazon will take advantage of its probabilistic nature and feed its Speech Recognition algorithms only with interactions that did not go well according to the “scolding mother” model, i.e., interactions where the probability of failure is high. The speech recognition system is then re-trained with data that another model labelled, not a human. In some contexts, they are also trying other approaches such as clustering algorithms for words [5]. These new approaches are aimed to reduce the necessity of human interaction in retraining models, not only for privacy concerns, but also as part of the always on-going effort of reducing costs. Still, the success of such approaches is yet to be evaluated.

As a plan-B, Amazon has another approach to improve its algorithm’s performance, this time by analyzing how users themselves “label” Alexa’s interactions. The system does not allow the public to “grade” how well the system works, but certain responses by the user do signal a failing interaction. If Alexa provides an unsatisfying response to a command, the user might simply cut the response off and rephrase the request. If the customer then allows the new response to play out, it would point out that the first request didn’t go quite as well, and that the final response should be associated with the first command [5].

It is estimated that more than 100 million devices that run Alexa have been sold to date [11]. This means that hundreds of millions are using one of the most statistically driven platforms ever designed, on a day-to-day basis. The success of the system and its popularity is just another proof of the game-changing effect of adding statistics to a development. Voice assistants are simply the latest field in which statistics are changing the world. As seen on Star Trek.

REFERENCES:

[1] Etherington, Darrell. Amazon Echo Is A $199 Connected Speaker Packing An Always-On Siri-Style Assistant. TechCrunch Website (November, 2014)
https://techcrunch.com/2014/11/06/amazon-echo/
[2] Barrett, Brian. The year Alexa grew up. Wired Website (December, 2018)
https://www.wired.com/story/amazon-alexa-2018-machine-learning/
[3] The Exec Behind Amazon’s Alexa: Full Transcript of Fortune’s Interview. Fortune Website (July, 2016)
http://fortune.com/2016/07/14/amazon-alexa-david-limp-transcript/
[4] Natural Language Processing. Wikipedia The Free Encyclopedia (April, 2019).
https://en.wikipedia.org/wiki/Natural_language_processing
[5] Sarikaya, Ruhi. How Alexa learns. Scientific American Blog (March, 2019)
https://blogs.scientificamerican.com/observations/how-alexa-learns/
[6] Marr, Bernard. Machine Learning in Practice: How Does Amazon’s Alexa Really Work? Forbes (Oct, 2018)
https://www.forbes.com/sites/bernardmarr/2018/10/05/how-does-amazons-alexa-really-work/#2fff93861937
[7] Gonfalonieri, Alexandre. How Amazon Alexa works? Your guide to Natural Language Processing (AI). Towards Data Science (Nov, 2018)
https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-natural-language-processing-ai-7506004709d3
[8] How our scientists are making Alexa smarter. The Amazon Blog (March, 2018)
https://blog.aboutamazon.com/amazon-ai/how-our-scientists-are-making-alexa-smarter
[9] Day, Matt et al. Amazon Workers Are Listening to What You Tell Alexa. Bloomberg (Apr, 2019)
https://www.bloomberg.com/news/articles/2019-04-10/is-anyone-listening-to-you-on-alexa-a-global-team-reviews-audio
[10] Statt, Nick. Amazon’s Alexa isn’t just AI — thousands of humans are listening. The Verge website (Apr, 2019)
https://www.theverge.com/2019/4/10/18305378/amazon-alexa-ai-voice-assistant-annotation-listen-private-recordings
[11] Clifford, Catherine. Jeff Bezos: Amazon is still ‘small’—90% of US retail sales happen in brick and mortar stores. CNBC Website (April, 2019)
https://www.cnbc.com/2019/04/12/amazons-jeff-bezos-most-us-sales-still-in-brick-and-mortar-stores.html