Improving Speech Interface Implementations
By N.D. Ludlow
April 26, 2017
Do you remember the HAL-9000 computer saying “I’m sorry, Dave. I am afraid I can’t do that.” in the movie 2001: A Space Odyssey? Speech Processing and Natural Language Understanding (NLU) has progressed in the last couple of decades into ubiquitous use today in nearly all smart phones. But how can we improve upon this technology’s use in our business applications?
Despite the progress of Artificial Intelligence (AI), one of the biggest advances for speech processing is due to increased computing power. When I visited speech processing research labs in the early 1990s, each speaker was matched to a different listener model—one for a male speaker, one for female, one for British English, and another for American English, not to mention all the other ones for different languages and accents. The listener models use AI methods such as Hidden Markov Models, Dynamic Time Warping, and Neural Network Classification. Speakers would have to select the appropriate listener model and sometimes, even asked to speak some standardized training sentences.
Today, we still use a similar approach, but listener model selection and training is behind the scenes and hidden from the speaker. The speech system activates a recognizer algorithm that runs several potential listener models to quickly discern which is best. This requires computational horsepower that we didn’t realistically have two decades ago. As Gordon Moore’s Law accurately predicted, computational speed went up and size went down--so much so, that now it works on a mobile phone and will likely work for your application too!
The other major advance was a blending of both syntactic and semantic rules of language coupled with statistical approaches that predict what is the likely next word or grammar structure. Noam Chomsky, Professor Emeritus at MIT and the father of modern linguistics, found many of the rules that make up the mathematical underpinnings of language. This principled approach was used to develop software via grammars, essentially diagramming sentences in hopes of designing a generic system. While monumental work, in an applied AI view to real-world applications, it only got us so far. There were often too many exceptions to the rules and it was hard to represent enough real-world knowledge through pragmatics to make the system useful.
Statistical approaches, such as calculating Tri-gram Frequencies, an idea taken from cryptanalysis, could provide the likelihood that a certain word would follow given that the previous two words were identified. Another technique of breaking a sentence up into chunks called Shallow Processing also helped to blend these two approaches. People like Stephen Pulman, Professor of Computational Linguistics at Oxford University, used such techniques to improve performance of NLU systems. When I worked with Professor Pulman at Cambridge University, his team coupled tri-gram statistical approaches and shallow processing with Chomskyan methods to build NLU systems that outperformed using only a single approach. While these statistical methods are arguably not true intelligence, they provided rather favorable results. Many of the best systems use some fashion of a blended approach of complementary methods.
You may ask, “So how can my company make better use of AI speech tools?” Here are some suggestions that may improve your system from a marginal demo to something that can truly be a marketable and usable add-on to your product offerings.
- Limit the Domain. Roger Schank, a Professor at Yale and later Northwestern University in Chicago, is known for introducing the psychological theory of Scripts to the arena of AI. Scripts organize procedural events of a given situation and can help a NLU system by working in a smaller domain. He used the example that a speaker may have a restaurant script, where certain actions happen in a particular order: you’re seated, given a menu, order drinks, drinks arrive, order food, food is delivered, you eat, you receive a bill, you pay, and you leave. Clearly, minor deviations could occur and there would be have to be other scripts such as for fast-food restaurants. However, by expecting what is the likely context, it limits the nearly infinite possibilities down to something manageable.
Futurist and author of How to Create a Mind (2012), Ray Kurzweil developed early commercial speech understanding systems that worked for radiologists to transcribe their review of X-rays. He made this invention work by having his software expect speech input in a certain order and use limited vocabulary at each junction. If a patient said, “The Patriots won the Super Bowl,” it wouldn’t understand. If you said “left proximal tibia fracture” it did remarkably well. The more you can narrow your domain, the better your speech system will perform.
- Apply a few reasonable and simple rules to the speaker. Some machine learning systems adapt to the user, but equally important and often overlooked is the fact that most users quickly adapt to a new interface. Most users are quite willing to obey certain rules, such as saying “Hello Google” to activate Android’s speech system or saying “Alexa, Stop” to cancel a process previously activated by Amazon’s in-home speech interface system, Alexa Voice Service. Keep the rules simple. By implementing a few simple rules to the speaker, you can turn an intractable problem into something that works quite well.
- Personalize the interface. Having some brief information about the speaker makes the system much more personable and the user more willing to interface with it. Knowing their first name or nick name, having access to their calendar, knowing their home address, or a few of their preferences, can help predict the next likely request of the user, and make the user more likely to use the speech interface.
When asking for personal information, allow the user to supply information voluntarily. The more data they supply, the better the AI interaction will be. You will be surprised how much personal information people are willing to share when it improves performance of their own speech application. Apple’s Siri and Microsoft’s Cortana are good examples of this.
- Find situations in which the customer prefers to interact with AI more than with a human. When bank ATMs were first introduced, there was concern that customers would lament that they didn’t have face-to-face human interaction with a teller. However, it turned out that the opposite was true. Customers often preferred going to an ATM. Even if the bank was open, most customers found it more comfortable to just hit a few keys to get their weekend cash than to have to make chit chat with local bank teller.
For tasks that are repetitive or for which the customer doesn’t want to have to explain their rationale, it is better to have an AI interface than a human. Also, for personal information, such as looking up balances on a credit cards or results of medical tests, we often prefer a machine to share this information with us as it gives us the feeling of increased privacy. Speech recognition technology is best used in situations in which customers prefer to interact with a machine than a human.
- Have the ability to bail out. Just like in a fighter jet, allow the pilot of your application, your end-user, the ability to pull the eject handle to exit the AI app and get to a person. Whether the customer is old school, or they feel the problem is too complicated to explain, or they have some deep-seated need to speak to a person, there will be those who want to hit “zero” and be transferred to an operator. Just having the ability to bail out gives the customer more comfort in adoption of the speech system. You might just be surprised by how many choose to interact with the AI speech interface rather than your call center on their next visit.
The HAL-9000 said “That's a very nice rendering, Dave. I think you've improved a great deal.” Hopefully, by applying some limits on the domain, adding some simple use rules, personalizing the interface, finding the areas in which your customers prefer to interact with a machine, and having a way to easily exit you also can say that the customer experience with your speech processing system has improved a great deal.
Nelson Ludlow has a PhD in Artificial Intelligence from University of Edinburgh, Scotland and did Post-Doctoral work in Natural Language Processing at Cambridge University, England. He was the Director of AI at the US Air Force Research Laboratory, has been CEO of both private and publicly held software companies building AI applications. He is currently on adjunct faculty and lectures in Computer Science and AI at Wright State University, Washington State University, and the University of Washington.