With the rise of virtual personal assistants, like Siri and Cortana, the way in which they “just work” has started to create a perception that this type of technology should be ubiquitous in all voice applications. This is kind of like how crime procedural TV shows have led to juries expecting “CSI-like” evidence at trials.
Let’s face it, sometimes wants and expectations exceed feasibility. But it’s not enough to just dismiss an idea without explanation.
With that in mind, let’s address the elephant in the room: WHY is natural language processing (NLP) so difficult to implement over a standard voice line?
To begin answering that question we need to take a look at how automatic speech recognition (ASR) functions and compare it to NLP. Now it’s worth bearing in mind that ASR is a rather broad heading that includes NLP. In other words, all NLP is ASR, but not all ASR is NLP. How’s that for alphabet soup?
We’ll cover how ASR and NLP work, and then move on to discuss the financial and technological constraints that NLP faces.
How Traditional ASR Works
When interacting with a standard voice application over the phone audio data is sent to a computer. The computer turns that wave form into digital information. Then the frequencies derived from that audio are matched to phonemes. Grammars are compiled into trigrams that reflect larger phonetic sounds that can be matched, e.g. tio, nde, sth. The computer compares the audio phonemes to the trigrams to determine what was said. Or, in the very least it creates a list of what it thinks, statistically, was said as well as some possible alternatives.
Armed with this information, that information is matched to a grammar. All interactive voice response (IVR) applications have grammars that define the words the ASR recognizes. Once the information is matched to the appropriate grammar, the application generates a list of potential matches. The IVR then takes the appropriate action depending on whether the information is valid according to the program’s code.
Building a grammar for an IVR application isn’t a trivial matter, which is why we provide those to our customers. For example, a grammar that recognizes street numbers has to be able to recognize all the different possible ways someone might say the number 411.
Basically, what the ASR is doing here is trying to match what a user says with an IVR’s built-in grammar.
Therein lies the primary difference between ASR and NLP. When it comes to NLP, the software attempts to not just understand what is said, but what its intent was.
How NLP Works
Think about how you would ask an application like Siri, which is essentially a natural language IVR, to remind you about an appointment on Wednesday at 3pm. There are thousands of different ways that someone could voice this request. Some pieces of this request are more static than others, such as the day of the week or time of day. But the words and phrases that surround those pieces of information can be extremely dynamic. This means to deduce a single intention, an appointment reminder, requires a huge amount of information.
All of the thousands of possible phonetic possibilities are converted into trigrams. And those trigrams that are most likely to be associated, statistically speaking, with that intention are grouped together. The more an individual uses NLP the more accurately the software can refine and predict that person’s intent for that specific, well… intention.
For example, there’s a band called Camera Obscura, and if I tell Siri to “play Camera Obscura” it tends to place the intent emphasis on camera, and opens the camera application. That makes sense for most people though. It then becomes up to the user to clarify the intent of their request. In this case, “play music by Camera Obscura” generates the desired outcome.
NLP software essentially creates a series of bins where it stores all the different phonetic pieces that can then be grouped together to make more coherent intentions. A single intention will have a huge phrase list. Replicate this process for every possible intention and we’re talking about a lot of data.
How Much NLP Costs
Now if you’re thinking about how much it would cost to compile, analyze, and maintain such a large corpus of data, then the numbers on your mental cash register should be spinning at hyper speed.
There’s a reason that NLP isn’t more common. It takes a lot of money to build out the technology for a reliable system. We’re talking several million dollars to get it up and running and another $1–2M per year to continually fine-tune that database once it is built.
Remember that it’s thought that Apple paid around $200M for the Siri technology. But even before that sale, Active Technologies, the start up that spun out of the Stanford Research Institute, where Siri was first developed, raised $24M in funding for it. Pocket change this is not. When you consider that the Siri team at Apple remains one of the largest at the company, the cost of developing NLP keeps going up.
This is why “Big Data” companies with deep pockets and seemingly infinite resources, like Apple and Google, can afford to operate NLP technology. Also, it’s worth noting that the companies that do develop NLP typically only do NLP. This is because they already have a large corpus of data at their disposal so a good chunk of the work is already done. Re-using components for different projects makes NLP much easier for vendors. The focus becomes refining and updating the product.
With prohibitive ramp up costs, most companies can’t realistically implement their own NLP solution. Most companies don’t have the necessary type and volume of data that they can bring to the table either. For technology that is based on determining intents, it takes a lot of resources to get there.
Limitations of the PSTN
As if the financial burden wasn’t significant enough, there are technological limitations with the PSTN that make natural language processing technology virtually impossible to implement over a standard telephone connection.
With speech recognition the quality of the audio matters. It really matters. This stands to reason. If you’re taking pictures, the more powerful your camera the better your pictures will be. Having a high resolution camera with high frame rate capabilities, will yield better results than a standard point-and-shoot camera.
The same goes with speech recognition. The bare minimum for satisfactory results is true 16bit, 16kHz audio data, but the better the audio data quality, the better speech recognition will work. To put this into context, professional sound recording, like that done in a studio, is recorded at 24bit, and 96kHz, and CD quality audio is 16bit, 44.1kHz.
The quality of the audio data transmitted through the PSTN, however, is way lower. It is typically 8bit, 8kHz compressed audio. The compression increases the capacity of the PSTN, but doesn’t do speech recognition any favors. Compressed audio loses a ton of data. It also cuts out background noise, which may sound like a benefit. But for speech recognition background noise is important for noise cancellation.
A virtual assistant, like Siri, doesn’t send audio over a network. The device captures the audio locally at a high quality with all the benefits of noise cancellation, too. The audio-to-phoneme process occurs locally on the device as well. It’s just the digital phoneme data, which is 1s and 0s, that are sent back to the server for processing.
The limitations of the PSTN eliminates the feasibility of achieving true transcription on a typical handset. And certainly open-ended transcription is particularly complicated. Many transcription providers will either leave errors in the transcription, like what you may see in your visual voice mail. Others use a hybrid approach whereby machines do the bulk of the transcription and humans just correct the errors.
Choosing the Right Solution
The ease with which virtual assistants function these days makes natural language processing alluring to all kinds of companies. But understanding the financial implications for developing or incorporating that type of technology is vitally important. Of course, even the deepest pockets won’t be able to overcome the limitations of the PSTN for providing the necessary audio quality for NLP.
Fortunately, all is not lost for those concerned with providing stellar customer service. There are plenty of ways to create great voice applications that use ASR and defined grammars.