Speech Recognition: Now...or never?

mgervase · ‎01-20-2009

A few months ago, I posted a 2-part blog on cisco.com on Cisco's vision for speech recognition in unified communications (Part 1, Part 2). While we have already incorporated speech recognition in a number of our Cisco Unified Communications solutions, our vision for the future involves speech playing a key role in all aspects of unified communications.

The question is...will speech recognition be essential to your organization, or will it be a feature that is nice to have?

Does speech recognition have the power to transform your communications and collaboration experience in a measurable way?

Or based on your experience with speech recognition today, do you feel that speech is a technology that will never be quite ready for "prime time"?

I'd love to read your thoughts on the past, present, and future of speech recognition and what it means to your unified communications experience.

Mark

Unified Communications Solutions Marketing

gmcgill · ‎02-27-2009

We tried speech recognition for contract numbers several times over the last few years. Single consonants like B, D, P seem to be difficult. We need better than 98% accuracy to even begin to justify the expense and were getting about 80%. I am sure our application was atypical but that was our application.

The other complaint we heard from customers was that it was slower than digit input. It seems they were able to learn the script well enough to pre-load several prompts and then multitask while the IVR script caught up. A few seconds per inquiry does add up.

mgervase · ‎03-25-2009

Greg, I agree with your comments. Speech recognition has a long way to go to truly transform our interactions. I read an article where a respected vendor called 1999 the year of speech. Here we are, a decade later, and we're not all the way there yet. As you point out, even under the best of acoustic situations, sounds like "B" "D" and "P" can be difficult for speech engines to distinguish. When you add mobile phones or IP-based softphones to the equation, the chances of speech recognition success go even lower. On the bright side, there are a number of organizations and researchers working to improve speech recognition and some of the solutions I've tried recently show some real promise. I'm excited to soon enhance our Cisco Unified Communications portfolio with these more advanced speech solutions so that our customers can truly improve their business operations.

As to your second point, its true that if speech rec is slower than using the telephone keypad, the seconds will add up quickly - and time is critical for customer service. Some of this is technology, some of it is the user paradigm. For example, when we start up a computer or walk up to an ATM, we know what to do. When we're using a speech rec solution for customer service, we don't always know what to say, how fast to say it, when to stop, and how much detail to provide. People have shown time after time that we're willing to learn new technology when it is truly compelling (Nintendo Wii, Apple iPod and iPhone, digital video recorders, etc), so the current state of speech rec shows me that the technology is not quite compelling or pervasive enough for us to learn how to interact with a speech rec solution. I can't dispute that. But I'm optimistic for the future, and I'd love to hear what the rest of you think.

Mark

Unified Communications Solutions Marketing

gps03 · ‎03-02-2009

Maybe getting a little off-point for Speech recognition, but what I really need is for WebEx to be able to capture the conversation and convert it to closed captions, without the overhead ($$) of a captionist sitting in the meeting.

Accessibility is a big deal for us and this would go along way to fixing a significant issue.

I've tried various solutions, including using windows built-in tools to at least capture the conversation to Word while I present to mixed results. Word needs focus to capture the text so you have to have two PCs running, and there is far too much training to get accurate results. As soon as a participant speaks up, accuracy drops way off.

mgervase · ‎03-25-2009

Gregory, you've pointed out a key use for speech recognition. If a speech solution could successfully transcribe live conversations - whether over the phone, videoconference, or face-to-face - it would transform the way many organizations operate and the way people collaborate. In the legal world, the benefits of transcribing depositions and trials is worth the cost, so stenographers capture every word (whether you like it or not!). But this is cost prohibitive for most everyday business despite our best efforts to maximize accessibility for everyone. WebEx already allows you to record the audio and slideshow from a WebEx conference, but to your point, the next evolution would be live transcription. I'm optimistic that the technology will get there in the near future, and I'd love to keep this conversation going with others. What do you think?

Mark

Unified Communications Solutions Marketing

wireless1 · ‎03-27-2009

Mark, I don't know much about speech recognition in the applications you are talking about, but I can tell you that SR technology is really beginning to gain some traction in supply chain logistics. In that market, SR is replacing barcode scanning because it allows a worker to use both hands for their primary task, which is to move inventory. With SR, they don't have to pause to pick up a barcode scanner. And SR vendors have gotten the accuracy into the upper 90% range.

Supply chain applications differ in some very important aspects from what you are looking for.

- They deal with a very limited vocabulary. The recognizer has to deal with numbers, letters, and simple phrases, such as "Pick twelve".

- Many of them are user-specific. That is, the speech recognizer is trained to recognize an individual's specific accent and tonality.

- Most often, the recognition engine runs on a local device, generally with a wired headset. These headsets sample the voice at higher rates than standard headsets to improve fidelity, and a close connection to the recognizer eliminates many transmission issues.

But the point is that vertical technologies, where speech recognition delivers concrete economic benefits to an enterprise, are going to drive technology improvements. If SR technology is not where you need it to be today, it is quickly approaching that point. One of Cisco's solution partners, Datria, now has a speech recognition application for supply chain that runs on a remote sever using a standard VoIP handset as the client device. That model can be less expensive to implement than the traditional speech recognition model. If Datria is successful, we will see other vendors adopt the model. And that puts us one step closer to where you want to be.

mgervase · ‎03-28-2009

wireless1,

Great post! I agree with you 100% that speech rec with a specific function with a clearly understood user interface, like the supply chain example you provided, are driving the technology and value forward. Often, a new technology tries to be too many things, but by keeping the use case focused, there are real benefits to be gained by a specific group of users. The iPhone App Store is a parallel example - hundreds of specific, mostly single-function applications that promise and deliver a benefit to the user. In specific use cases, speech technology is dependable and valuable - just as you pointed out.

Mark

UC Solutions Marketing

Kevinriches · ‎04-06-2009

Gregory,

Have you looked at using SpinVox alongside the WebEx service? Could be an ideal solution for you?

Kevin

Victor Lam · ‎03-27-2009

Not quite what you're looking for, I'm sure... but, definitely needs to be mentioned.

Everyone knows what CAPTCHA is, and most people understand how it's supposed to reduce the deluge of spam and potentially other malicious use of Internet bandwidth. Yet, over 95% of e-mail traffic is spam (and worse, phishing, etc.). There have been multiple talks about how easy it is to defeat CAPTCHA in an automated way, not just with an army of monkeys.

Similarly, real-time audio makes hacking VoIP much less attractive than other sources of valuable data (e.g., web applications). It's more difficult to parse real-time audio than it is to put a CAPTCHA through an OCR engine. Armies of monkeys are required to actually listen (an interpret context) of real-time audio to extract valuable data. Real-time speech recognition might actually facilitate such parsing, analogous to an OCR engine for audio. Could you imagine how much faster credit card fraud could occur by just capturing and interpreting 1 hour of call center traffic?

If real time speech recognition is made widely available, I'm in favor of pairing up the technology with pertinent security measures (or at least best practices) before full public release. Otherwise, we might as well all start learning hand gestures for our credit card numbers to send that information over video. =)