It happens all the time with technology: two different processes that share some similarities get conflated. People say one thing and really mean the other, and vice versa. One area where we see this with audio related processes involves automatic speech recognition (ASR) and transcription.
What both of these things tend to suffer from is a definition problem. When it comes to voice applications and IVR, transcription and ASR are used for very different purposes. In order to create the best end-user experience possible, it’s important to understand what these are and how they differ.
Automatic Speech Recognition
There are a number of characteristics about ASR that make it different from transcription. ASR makes speech a valid type of data input. That means what an end-user says directly influences the call-flow and re-directs the caller based on what they’ve said. In other words, when it comes to IVR, ASR propels the call-flow forward in some capacity.
Another differentiator for ASR is that it’s programmable and based on keywords or expected responses. This concept of keywords is critical because it reduces the number of possible answers that end-users may provide and makes processing speech faster and easier for the ASR engine.
Speech recognition engines use grammars, which are basically a collection of possible responses, to control speech input. You can program ASR to recognize the answer to a yes/no question, but in order to be safe and mitigate errors it’s necessary to account for all the various ways that people may say “yes” or “no.”
The grammar, therefore, includes words like “yes”, “yeah”, and “yup” to cover all the bases. If an end-user says something that the grammar doesn’t recognize then the application kicks out an error.
ASR is commonly used when you need capture alpha-numeric data, like a person’s name or address. It is also useful for hands-free calling. So, if a lot of your end-users tend to call while on the go it might make sense to look into enabling ASR.
Transcription
Transcription, on the other hand, is different because capturing audio information in this context is akin to leaving a voicemail message. There are no pre-determined grammars, keywords, or expected responses. What a caller says does not direct the call-flow in any way. A transcription application simply records an audio file and then attempts to interpret the audio in the recording into written text.
Here at Plum Voice, we use Nuance technology for both ASR and transcription. When it comes to the latter, the accuracy of the transcription depends on the quality of the recording. Recordings made over a poor phone connection or with a lot of background noise are harder for the speech engine to accurately deduce.
Transcriptions tend to fall into one of three categories, based on the confidence of the engine in processing the audio file, either high, medium, or low. A clear, well-recorded file tends to fall into the high confidence bucket, while a recording with poor sound quality tends to fall in the lower confidence bucket. Obviously, the medium confidence is somewhere in between.
You can choose to have the computer transcribe audio recordings, you can have a human transcribe them, or you can use a hybrid approach. If the speech engine can’t decipher the audio of a low confidence recording, the application can either throw an error or send the message to a human to transcribe. While using human transcription tends to be more accurate, it also takes longer and is more expensive.
Transcription is typically used for open-ended questions, like with surveys or for customer feedback, e.g. voice of the customer programs.
Clearly, both technologies have a place in modern business communications. It’s up to you to decide whether you want callers to be able to use speech to navigate through your IVR or if you want to transcribe customer feedback. But understanding the difference between ASR and transcription helps you to make an informed decision about what you want your IVR to accomplish and how that’s done.