Automated speech recognition (ASR) has to effectively distinguish spoken words from background noise in a real-world (i.e., noisy) environment.
In an ideal listening environment (i.e., no background noise) such as a lab where a technician is training a new system, speech recognition systems can easily identify and convert spoken words into strings of text that computers can use.
In a real-world environment, more audio data (speech + noise versus speech alone) is received by the ASR system. A speech recognition engine has to distinguish between words and noise (wind, other speakers, cars, et cetera) before it can even begin identifying the words.
The disparity between lab and real-world conditions is a major challenge in optimizing ASR technology. Speech rec engines approach this problem a few different ways:Seeding the engine with non-ideal listening environments during the training process;
- Seeding the engine with non-ideal listening environments during the training process;
- Tuning the speech engine for specific environments to minimize the impact of noise on reliability in those environments;
- Reducing noise from the audio stream before it’s converted to text.