The role statistics plays in speech recognition

Features

  • Author: Dr Catherine Breslin
  • Date: 23 Apr 2014
  • Copyright: Image appears courtesy of iStock Photo

Speech recognition has been a research topic in universities and research labs since the 1950s, though commercial products have been slow to appear due to poor performance. In recent years, increases in computational power and storage, together with improvements to the underlying technology and analytical methodology, have given rise to large improvements in accuracy. The release of Siri in 2011 and Google Now in 2012 made voice interfaces widely available to smartphone users, and speech recognition is quickly moving into settings like cars, hospitals and offices around the world.

thumbnail image: The role statistics plays in speech recognition

Speech recognition is a supervised machine learning problem. That means that, from a large corpus of transcribed audio, statistical models learn patterns in the audio that make up the sounds of speech. These models are then used to automatically transcribe new speech. There are two parts to any speech recognition system: the language model and the acoustic model.

The language model tells us which word sequences are common, and which are rare. For example, the phrase “How are you today?” is relatively common in English, while “Noise proves problematic for speech recognition” is far rarer. The language model assigns appropriate probabilities to these, and all other, sentences in the language by learning from a text corpus of millions, or even billions, of words.

The purpose of the acoustic model is to model the sounds we make when we speak. For a small vocabulary, say just the digits zero to nine, it’s possible to model the acoustics of individual words. As the vocabulary size grows however, it becomes impossible to record enough spoken examples of all words, and so we need to model acoustics at a lower granularity.

Our speech is comprised of phonemes, which are individual sounds, like the ‘k’ sound at the beginning of ‘cat’, or the ‘oo’ sound in the middle of ‘book’. A dictionary maps words in the language to their phonetic pronunciation, and enables us to model acoustics at the phoneme level. There are approximately 50 phonemes in the English language, compared to hundreds of thousands of words. It is much more reasonable to record enough spoken examples of the 50 phonemes than it would be to record the many thousands of words.

To build an acoustic model, the audio is first converted from a continuous sound wave to a representation that has one vector for each 25ms segment of audio. Each vector encodes the mix of frequencies present in that short fragment of sound. The acoustic model then learns, for sequences of these vectors, the probability that a particular sequence of phonemes was spoken. For a recording of the word ‘cat’, we’d expect to see the acoustic model assign a high probability to the phoneme sequence ‘k ah t’, and a low probability to the sequence ‘b oo k’. We’d probably also see a reasonably high probability for the phoneme sequence ‘k oh t’, as the words ‘cat’ and ‘cot’ sound very similar when spoken out loud.

In practice, rather than model individual phonemes, we model triphones. That is, a phoneme in the context of its previous and following phonemes. The model for ‘oo’ in the middle of ‘book’ is a different model to the same ‘oo’ in the middle of ‘shoot’. The physical restrictions when moving your mouth and tongue to form speech mean that the actual sounds spoken are highly dependent on the phonemes that come before and after, and our models are improved by taking this into account.

Putting both acoustic and language model together in a probabilistic framework allows us to search for the most likely words spoken when we receive some new audio. Essentially, the speech recognition system has to search over all possible word sequences to determine which sentence is the most probable. This is very computationally intensive, and so the search space has to be pruned intelligently. There’s a trade-off between speed and accuracy that can be changed depending on the application. For a digital personal assistant, recognition should be fast so that the system can respond to the user in real time. For an offline task like automatic subtitling of recorded lectures, recognition can take longer as it’s not crucial to have an immediate transcription.

Performance will continue to improve in the next few years as the amount of data and available computational power increase further, and even better modeling techniques are devised. The next challenge will be to improve voice interfaces so they are more natural to use, including better natural language understanding, more expressive responses from the computer, and better management of human-computer dialogue.

A big challenge for speech recognition is recognizing speech from different acoustic conditions. That is, from people with different accents, with varying types of background noise, and that’s recorded with different microphones. There are ways to quickly adapt acoustic models to these new conditions, but a really effective way to get a good general performance is to collect a varied corpus of speech from many different conditions, and use this corpus for training the acoustic models.

The underlying speech recognition technology is the same no matter the language spoken, but languages have different characteristics that must be taken into account. For example, Arabic transcriptions do not have the vowels written down, Mandarin is a tonal language so the intonation of individual phonemes has meaning, and German has a large number of compound words that inflate the vocabulary size. Hence there’s often some language-specific work to be done when moving from English to another language.

Speech recognition performance has improved a lot in the past few years, and there are several reasons for this. The move towards cloud computing has allowed companies like Google , Microsoft and Apple to do their speech processing in the cloud, allowing large amounts of computational power to be used without individual users having to buy expensive machines. As speech recognition systems have become more commonplace, the data collected from real users has been very valuable for improving the models. Finally, improved modeling techniques, like the use of deep learning, have had a big impact on speech recognition accuracy.

Performance will continue to improve in the next few years as the amount of data and available computational power increase further, and even better modeling techniques are devised. The next challenge will be to improve voice interfaces so they are more natural to use, including better natural language understanding, more expressive responses from the computer, and better management of human-computer dialogue.

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.