Can We Talk?
by Walter Rolandi, PhD
March 7, 2018
In the words of my friend Bruce Balentine, “Speech recognition is the technology of the future. Always has been, always will be.” Generally speaking, and considering the past 20 years of progress, I am inclined to agree with his summation.
While the pace of this progress has been slow, it is nonetheless real. Significant things are happening in the industry, particularly in the field of virtual assistants.
The terms “natural language processing” (NLP) and “conversational dialogue” have been bandied about in some form for a while, and their meanings have often changed.
Typically, an innovation will come along that enables voice user interface designers to more closely, if ever so slightly, approximate human-to-human conversations. The last big leap in interactive voice response (IVR) technology was statistical language models (SLMs). Well-designed SLMs will permit users to express themselves and be correctly understood in an almost indeterminate number of ways. This ability suggests intelligence and can be impressive.
Yet an SLM only allows users to say the n-number of things that can be said during a particular IVR state. SLMs, as used today, are just single-state grammars that allow for great variation in the ways users specify their wishes. They simply afford many ways to make menu selections.
SLMs arguably support more natural language processing, but it’s difficult to assert (at least of their use in IVRs) that they somehow support greater conversational dialogues.
What Does It Mean to Converse?
In its most meaningful sense, conversation should consist of open-ended, mixed-initiative verbal interaction. Conversations may or may not entail the exchange of information or result in any behavioral consequence or action.
At this point in voice user interface history, most conversations seem to be command-and-control requests. The user asks (or tells) the computer something, and the computer responds with an answer or comment.
Many interactions boil down to basic speech-to-text search engine requests. This is not to say that this ability is unimpressive—it represents a significant step forward in the history of automated speech recognition. Yet such exchanges do not really constitute conversational dialogue.
Conversational dialogue requires domain and contextual knowledge that is not typically present in a two-turn dialogue query. During real conversation, one participant might say something that requires the other participant to collect additional information to complete his understanding of what has been said. This is something we humans do constantly and unconsciously. A recent interaction I had with Apple’s Siri provides an example.
While simply exploring what Siri can do and how she responds to what might be considered reasonable requests, I had the following conversation:
User: “Send an email to Walter Rolandi.”
Siri: “What’s the subject of the email?”
User: “This is a test message.”
Siri: “OK. What would you like the email to say?”
User: “‘Good morning Walter. How are you today?’”
Siri: “OK, Walter. Here’s your message to Walter Rolandi. Ready to send it?” [Siri displays dictated message and allows manual edits.]
Siri: “OK. I’ll send it.”
Granted, the interaction was not exactly a lively philosophical debate, yet it does represent a goal-driven, multi-turn dialogue that was completely consistent with the expectations a user might have, based on prior and similar human-to-human conversational dialogues. Siri’s turn-taking response time was conversationally natural, her responses were contextually appropriate, her speech recognition was 100 percent accurate, and I received my test email within seconds of telling her to send it. I had discovered an impressive and useful feature of Siri, and I was eager to discover more.
Discovering the Abilities of Virtual Assistants
And that brings us to our purpose here. Siri, and many other virtual assistants, possess impressive powers. The problem is how to discover those powers without experiencing the setbacks and frustrations that often occur during human-machine NLP interactions.
This is the speech industry’s major challenge today: how to naturally and painlessly reveal the powers of a voice user interface.
Future articles in this space will explore this question. We intend to review particular commercially available VUIs from the perspective of a generic user. We hope to include the greater VUI community in the process, having our own ongoing dialogue about dialogues.
This article by human factors expert and Swampfox UI lead was recently published by SpeechTek Magazine