Showing results for 
Search instead for 
Did you mean: 

Speech and text-based media


It has been fascinating to watch the evolution of media types over the last several years. We started off with a realtime speech communication mode - a phone call - and an asynchronous text-based communication mode - email. In addition, there was voicemail as the asynchronous messaging mechanism, but it was tightly contained within the speech box. You could only access  voicemail from your phone - whether you left a message or wanted to listen to a message. Over time, this has changed in some ways - we now have unified messaging where you can access your voicemail from your email, but you still have to listen to them - as a wav file or mp3, maybe, but you still listen to your messages. Unity provides the ability to pick up an voicemail message being recorded and continue as a live call - thus escalating an async communication to a synchronous voice communication session, but once again, this is strictly within the voice context.

Meanwhile, text-based communication became hypertext-based communication - whether HTML or XML. It expanded from asynchronous email to synchronous - Instant Messages. It became more sophisticated - blogs, wikis, newsgroups evolved as different structured forms of async communication, very different from the venerable email. IM expanded to conversations that act as "channels" you can subscribe to - thus extending what was a form of synchronous communication into the asynchronous space. Google Wave is, in a way, the next step in that direction since it aims for the "grand unification" of all forms of synchronous and asynchronous text-based communication - IM, email, blogs, wikis etc. Structured communication seems to have incorporated hypertext as a core element.

Where does this leave voice? Click-to-call and click-to-conference have been the traditional means of integrating voice and video communications into text-based media (such as web page or an outlook client) or business processes. However, click-to-call does not bridge content, since the content of the voice or video conversation is lost in the text domain, and vice versa. Voice inherently has higher "information density" than text-based media, since it has emotion, stresson particular words etc which are difficult to maintain in the text domain. Emoticons  are one example of how text-based media such as IM or email have tried to convey emotions. Speech-to-text (Speech Recognition) and Text-to-speech bridge the two media in terms of content, even though emotion and stress are lost. The key issues for Speech Recognition have been accuracy and cost. Unity voicemail has some capabilities in this area, and it continues to be an area of interest for Cisco. Going forward, as the cost of providing these services drops and the accuracy increases, we expect to see more widespread deployment of both Speech recognition and text-to-speech as complex structured communications are based on hypertext and voice content needs to be bridged to this core.

To summarize, a lot of the walled gardens of communication are coming down as people seek a communication infrastructure that maps to their work style, rather than try to work within the constraints of what is available in their enterprise. This includes sync vs async communication and text vs voice/video communication. This is an interesting area that promises rich rewards in terms of the collaboration it enables, and the productivity benefits that ensue.

Content for Community-Ad