What Is Speech Recognition?

Automatic Speech Recognition (ASR) software transforms voice commands or utterances into digital information that computers can use to process human speech as input. In a number of varying applications, speech recognition enables users to navigate a voice-user interface or interact with a computer system through spoken directives.

How It Works

Given the sheer number of words in every language as well as variations in pronunciation from region to region, ASR software has a very difficult task trying to understand us.

The software must first transform our analog voice into a digital format. It has to distinguish between words and sounds within words, which it does using phonemes—the smallest elements of any language (e.g., the splits into th and uh).

ASR compares the phonemes in context with the other phonemes around them while also analyzing the preceding and following words, for context. The software uses complicated statistical modeling, such as Hidden Markov Models, to find the likely word.

Some speech recognition systems are speaker-dependent, meaning they require a training period to adjust to specific users’ voices for optimum performance. Other systems are speaker-independent, meaning they work without a training period and for any user.

All ASR systems incorporate noise reduction elements to filter out background noise from actual speech.

Speech Recognition vs. Voice Biometrics

While speech recognition software identifies what a speaker is saying, voice biometrics software identifies who’s speaking.

ASR Systems vs. Speech Recognition Engines

A speech recognition engine is a component of the larger speech recognition system, which uses a speech rec engine, a text-to-speech engine and a dialog manager. A speech recognition engine has several components: a language model or grammar, an acoustic model and a decoder.

Speech Recognition Applications

Most visibly, ASR is a key technology in the latest mobile devices with personal assistants (Siri, et cetera) and interactive voice response (IVR) systems that often couple ASR with speech synthesis.

Uses include data entry (password for IVR), voice dialing or texting, speech-to-text (dictation), device control (home appliances, et cetera) and direct voice input (voice commands in aviation).

History

Around since the 1960s, ASR has seen steady, incremental improvement over the years. It has benefited greatly from increased processing speed of computers in the last decade, entering the marketplace in the mid-2000s.

Early systems were acoustic phonetics-based and worked with small vocabularies to identify isolated words. Over the years, vocabularies have grown while ASR systems have become statistics-based (Hidden Markov Models). They now have large vocabularies and can recognize continuous speech.