October 02, 2007

Mashups: Web Meets Telco

Joel Hughes, General Manager for Mobile Applications at NMS, is moderating the second to last session at Connect 2007 entitled "Mashups:  Web Meets Telco" with panelists:

Matt Gross, Director of Product Management, WHERE by uLocate Communications -- a GPS-based platform for location aware services.  Other value they bring is that they have negotiated deals with several US operators (Alltel and Sprint) and they make these services available to 3rd parties.

Alan Quayle, Consultant, Business and Service Development -- talking his consulting practice and refers to British Telecom's Web21c as a sample application he's been involved with.

Kevin Nethercott, President and COO, LignUp -- talking about their web platform to program PBX functions and do enterprise mashups, specifically mentioning Mashup Camps and applications prototypes done in 15 minutes.

Sam Aparicio, CTO, Angel.com -- an IVR and call center solution that exposes a web API for control.  His example application is a mashup with Salesforce.com.

This real time blogging is new for me.  Perhaps the resulting notes will only be useful to me...  In any event, this is going to be more stream of consciousness...

First impression: there are several sources of telephony controlled by web interfaces, but they all cost money or are just sandboxes.  No one has mentioned APIs that let developers also participate in billing, or ways that web telephony can be available at or near free.  Surely one or the other is needed to see this industry take off.

For now, all the discussion is about enterprise developers.  Mashups are a new way for enterprises to develop their IVR and call center solutions.  So far, no examples of developers targeting consumers.  Alan has mentioned that BT's API has also been used for Salesforce.com integration.

Sam just alluded to a Verizon API.  I'm not sure what he's referring to.  I'll try and ask afterward...

Now Alan is discussing cost.  When location data cost 25 pence per inquiry, no one used it.  No one has a good suggestion of how these kind of services should be priced.  What about an application that needs two database dips, an international phone call and two international SMSs?  General agreement that this is an issue, but no specific suggestions or pointers to solutions.

The popular APIs to mashup with (besides Salesforce.com) are Google Maps, Flickr, Facebook and Amazon (for storage), at least today.

Now Joel has explicitly asked how things are priced.  Amazon is priced by the GByte per month and Salesforce.com is by user per month.  Enterprise telecom groups either buy the platform or pay per transaction until they buy the platform.  Their issues are reliability and minimizing costs.

Matt Gross describes the value uLocate brings to Sprint is to handle the long tail of developers that Sprint can't deal with directly.

Alan says BT's Web21c has helped expose issues and challenges.  They are still experimenting.  As far as location goes, the pricing is still too steep.  Alan expects that Google and others will note the GPS coordinates of every cell site, retrieve that from the handset and skip interfacing with the operator.  Operators will have to price their location info appropriately or they will be bypassed.  Consumers do seem willing to pay for safety-related location-based services.

Relation of mashups with IMS.  North American operators are adopting IMS to control VoIP telephony, but IMS is just the base for connection control.  The broader service delivery platform is less standard and yet that's what's required as the platform that supports web services.  SIP-based applications going a lot slower than web APIs.  In the end it's applications that matter.  No one on this panel cares about IMS except as an underlying layer that supports their web APIs.

A question from the audience: how do you see the handset interfaces w.r.t. mashups.  Kevin is focused on voice telephony.  He doesn't see the mobile handset as the application UI for now.  In the long term, he expects browser-based user interfaces, but for now, he's doing voice.  Alan seems to agree.  Handsets are so different that all you can rely on is basic voice and SMS.  Sam complains that even the so-called open API phones, you can't actually to deep features on the handset.  This crew is focused on voice.

May 15, 2007

Packet-based processing -- DSP vendors still don't get it

I arrived at the Communications Developer Conference this afternoon in time to attend a talk by Dr. Huan-Yu Su of Mindspeed Technologies.  His talk had the curious title of Semiconductor Design Solutions for IMS and indeed it touched on multiple subjects and many layers, from IMS service objectives down to the evolution of semi-conductor processes.  But the interesting stuff was DSP related.  Mindspeed makes DSP chips and DSP software, which today might best be characterized as signal processing systems on silicon (SoC).

But, despite Powerpoint bullets about packet-based systems, when Dr. Su showed a timing diagram of how individual DSP cores were able to process multiple algorithms on many channels, he showed a TDM system.  Each media stream got a time slice in a 20 ms scheduling cycle.  This means each incoming flow has to hit a jitter buffer and be queued up for it's slice of the TDM cycle.  When I questioned this, Dr. Su replied that it was a hard problem and so far they had addressed it by shortening the 20 ms scheduling cycle to 5 ms.

None of this is to discredit Dr. Su who is a smart guy and gave an interesting presentation.  I have seen equivalent approaches from Texas Instruments and Freescale Semiconductor.  All of the DSP vendors are driven by the TDM-to-VoIP gateway application where 8 KHz (and 20 ms) operation is the norm and completely acceptable.  No one has addressed low latency packet processing.

It's not that hard here is the answer!

Scheduling a DSP to handle packets, whose arrival is statistical in nature, with minimum latency while guaranteeing that all work gets done, is a lot like scheduling packet transmission over a fixed capacity link with QoS guarantees.  There was a ton of academic and practical work done on this subject as part of ATM switch development in the 90s.  Much of it is directly applicable to scheduling DSP processing.

Those who are really interested might read Leap Forward Virtual Clock: A New Fair Queuing Scheme with Guaranteed Delay and Throughput Fairness by Suri, Varghese and Chandranmenon in the proceedings of INFOCOM '97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. This is not the only relevant paper but it is one I am familiar with in some detail.  Quoting from the abstract:

We describe an efficient fair queuing scheme, Leap Forward Virtual Clock, that provides end-to-end delay bounds similar to WFQ, along with throughput fairness. Our scheme can be implemented with a worst-case time O(log log N) per packet (inclusive of sorting costs), which improves upon all previously known schemes that guarantee delay and throughput fairness similar to WFQ.

At light load, packets are scheduled immediately and experience minimum latency.  Under heavy load, packets are queued but get processed in a time that matches their allocated average arrival rate.  So at heavy processor load, individual flows experience some additional delay but they also have their jitter smoothed out (thus minimizing the required depth of the jitter buffer at the ultimate destination).

March 30, 2007

Wideband audio makes for great mobile telephony

At the VON conference last week I had an interesting conversation with Karine Lachapelle of VoiceAge.  Beside her interest in the status of my ride-on model railroad :-), we discussed our mutual interest in wideband speech coding and Karine mentioned a trial of wideband audio on mobile phones in Germany last spring. 

I had completely missed this, but luckily she sent me some references which I read on the plane flying back from CTIA.

Background (skip this if you’re already up on wideband audio)

From it’s 19th century beginnings, telephony has provided a very limited representation of human speech.  For the first 80+ years, the main problem was affordable microphone technology.  Today, relatively high quality microphones can be included in consumer products at little cost.  Unfortunately, digital telephony standards (basic coding and 64 Kbps circuits), were frozen in the 1960s, based on the voice quality of 1920s microphones.  As a result, traditional telephony doesn’t pass sounds above 3.2 KHz, making it hard to distinguish “s” or “z” sounds (which contain energy up to 6 KHz and beyond) or tell letters like “c”, “d”, “e”, “g” apart.

The advent of VoIP brought the potential for dramatic improvements in voice quality, and indeed, I spoke repeatedly about wideband audio at early VON conferences, but to no avail.  Early VoIP companies tried to duplicate “toll quality” telephone speech, rather than leapfrogging the telecom industry.  That finally changed with the advent of Skype, where PC-to-PC calls can sound like you’re in the same room.  And, with Skype's growing adoption, more and more people are realizing how good voice communications can be.

Mobile Voice Quality

With the advent of mobile telephony, radio capacity was a bottleneck so mobile systems used aggressive speech compression, even at some sacrifice in voice quality.  Consumers put up with this as mobility was such an advantage.  With 3G, radio capacity is still an issue, however advertised data capacities significantly exceed what’s required for voice, so there is new flexibility to consider better quality speech coding.  Further more, speech coding algorithms have improved to the point where wideband audio requires only a little more radio capacity than today’s narrowband coders.

But while mobile telephony uses (potentially variable rate) packet transmissions on the radio links, calls are still switched using 64 Kbps circuits.  To the extent mobile operators are trialing, or even deploying, IMS technology, it’s for new applications.  No mobile operator has converted their basic voice telephony to IMS yet.

Luckily there’s a kludge that’s been standardized for several years, is available and is at least partially deployed.  It’s called TFO, tandem free operation.  TFO provides a way to tunnel lower rate signals through traditional 64 Kbps circuit switches.  Indeed, TFO provides a viable way to connect two mobile handsets anywhere in the world provided the handsets use the same speech coder and the networks support TFO.

Ericsson’s trial with T-Mobile in Germany

The most interesting article Karine pointed out is by Berkehammar et. al. in the Ericsson Review No. 3, 2006.  The article describes a consumer trial (150 users) on the T-Mobile network in Germany in April and May 2006.  The trial used AMR-WB (Adaptive Multi-Rate, WideBand) speech coding (versus the normal or “narrowband” AMR-NB used in most GSM networks today).

The short summary: 

More than 70% of the users perceived a distinct improvement in voice quality — they found that they could more easily place and complete calls in noisy environments, and reported that the improved voice quality created a greater sense of privacy, discretion and comfort.

If you add in those that merely found it good or “nice to have,” approval ratings were over 90%.

When do we get it?

The limiting factor for deployment will be handsets.  Wideband mobile connections only work if both parties have wideband handsets, otherwise the call reverts to traditional mobile performance.  Ericsson claims their "terminal platform" will support AMR-WB in 1Q07, however that's Ericsson, not Sony-Ericsson — the people who actually make handsets commercially.

So the good news is Ericsson is promoting wideband telephony and T-Mobile Germany has seen it's positive impact on the subscriber experience.  But there is no committed date for commercial handsets or commercial service.  It seems service providers continue to rely on mobility to make up for marginal voice quality, at least for the immediate future.

Feel free to talk this up with anyone who'll listen.  Voice may not be "sexy" but wideband would do more for most mobile subscribers than any of the new "data" services getting all the attention.

September 03, 2006

Internet-scale database - Google's Bigtable

There's a very interesting preprint of a paper from Google Research due to be presented at OSDI '06 in November.  It's Bigtable: A Distributed Storage System for Structured Data by Chang et al. (PDF here). 

Bigtable is a database that runs on top of the Google File System.  I first heard of it in the May '06 interview of Google's Jeff Dean by O'Reilly's Radar. Like the databases supporting other major Internet applications, it's not a straight relational database.  This paper gives the details for those of you interested in such things.

It also includes the following gee whiz items:

  • The Google web crawl creates two tables totally 850 terabytes of data. 
  • Google Analytics apparently keeps all clicks, forever.  It's currently using two tables totally 220 terabytes.
    Hmm..., if they really keep everything forever, this could end up larger than their web crawl.
  • The maps for Google Earth occupy a mere 70 terabytes.

September 02, 2006

Large Speech Corpora

I didn’t attend SpeechTek this year and I haven’t had much involvement with speech technology in the past 12 months, but this post from Google Research in August 2006 brings me back to the topic I was discussing in August 2005.  Google is offering researchers some interesting data.

We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

Over one trillion word seuences!  Of course this is a text corpus — useful to an automatic speech recognizer as a ranked set of likely word sequences, but only to the extent people’s speech patterns match our written language…

What’s really needed is a large body of annotated speech recordings — millions of hours.  In my presentation at SpeechTek last year, I suggested one way such a speech corpus could be accumulated.  On a free video storage site, provide machine translations of the audio associated with user-created videos.  Offer users the ability to search for specific events in their videos based on words in the sound track and also offer them a simple interface to correct the machine transcription.  Useful search is the incentive, a human verified transcription is the byproduct.  I recognize these annotations would not be "professional", but it’s more productive for researchers to focus on automating quality measures for millions of amateur annotations, than to produce another 1000 hours of professional transcription.

When I prepared my 2005 presentation, I hadn’t heard of YouTube or Google Video — two sites that appear to exactly match what’s needed.  I doubt YouTube has the resources or can spare the attention to tackle this, but Google has the resources and the researchers — after all they did attract Kai-Fu Lee last year.

August 04, 2005

More Interesting Items from SpeechTEK

I'm back in Boston now looking at some notes (on a napkin, no less) and a few business cards... In no particular order...

On Monday morning in the AVIOS track, Yoon Kim, CEO of Novauris, gave a presentation on their technology.  Novauris appears to have recruited core members for the old Dragon Systems team (both US & UK).  The interesting things I heard were their approach to recognizing long free-form speech inputs (more than a few seconds) by combining HMM acoustic and phonetic models with text search technology.  They also couple information forward and backward, for example using acoustic decoder error models to select multiple data sets to feed to later stages.

On Monday afternoon, Mark Clements, Professor at Georgia Tech and co-founder of Nexidia, gave an interesting talk on language identification. Not surprisingly, the front end of a language identification system is similar to that of a speaker-independent speech recognition system using HMM processing to identify acoustic and phonetic elements in the unknown audio sample. If you were to separately test an unknown speech sample against models trained for English and for French you could rapidly determine if the phonetic elements in the unknown speech were those of an English speaker or a French speaker.  Of course, if there were 20 or 100 potential languages, it would be computationally expensive to run each speech sample against 20 or 100 different recognizers.

Here's the trick I hadn't thought about. If you look at the errors when running French, German and Chinese samples against a recognizer trained on English models, you get distinctive error signatures. These signatures, from the English-trained recognizer, are a very strong indicator as to whether the sample is French, German or Chinese. Using this approach, the Nexidia team has built a language identifier which uses only 5 to 7 language specific models to identify samples in 20 or more languages with very high accuracy.

Finally, from my Tuesday evening discussion with Skip Cave of Intervoice, comes a recommendation to check out Language Computer Corporation's open domain question answering software that extracts answers to free form questions from heterogeneous text databases.  As described by Skip (whose opinions I respect!), there was some amount of domain tweaking required (to deal with specific corporate knowledge areas, like HR or sales management), but since then the LCC software has been producing real, useful answers to arbitrary questions from technical and non-technical people.

 

 

Large Speech Corpora - Dan Bricklin's Insights

Just when I thought I'd finished with this topic (point 3 in my presentation on Monday, revisited in yesterday's post), I see an interesting post on O'Reilly's Radar this morning, summarizing a paper from Dan Bricklin in 2001, The Cornucopia of the Commons.  Here's the summary from O'Reilly:

The insight, which Dan outlined in his paper, The Cornucopia of the Commons, is as follows: There are three ways to create a collective work: 1. Pay people. 2. Get volunteers. 3. Architect your product in such a way that people create collective value by pursuing their individual self-interest. By way of example, Yahoo! built their directory using method 1. Many open source projects as well as shared content projects like Wikipedia use method 2. But many of the great successes of the Internet age have discovered method 3.

What I've been proposing as the way to accumulate large speech corpora is to pursue Dan Bricklin's method 3, i.e. provide ways for people to contribute while pursuing their individual self-interest.

August 03, 2005

Large Speech Corpora

In my presentation on Monday I suggested we could build large (millions of hours) public speech corpora if we thought of the problem from an Internet and community software point of view.  I explained this with allusions to Google's use of links to compute page-rank, Amazon success with volunteer reviewers and Wikipedia's community created encyclopedia. And I went so far as to suggest one specific approach based on videos from camcorder owners.  Look back at my earlier blog entry for details.

Last night I had a very enjoyable discussion with Skip Cave, Chief Scientist at Intervoice. Skip had missed my presentation on Monday, so at one point I ran over what I'd said.. When I described my camcorder approach, Skip immediately suggested a podcasting approach might be an even faster way to generate a large, accurately transcribed speech corpus.

Skip's suggested offering free services for podcasters including the ability to have a machine transcription generated for each audio file in their podcasting feed. Combine the transcriptions with a wiki-like user interface so the podcaster, or any listener/reader who views the transcription, can easily flag, and optionally correct, any errors in the machine transcription. Having a machine transcription would be a major benefit to podcasters and their audience as it would facilitate rapidly locating specific subjects in an audio file. And given an easy way to correct transcription errors, the podcaster and/or their audience would likely do the necessary editing. If the user interface software noted how much of the file was examined by anyone who was motivated enough to make a correction, it would be possible to flag which machine transcribed content had been "proof read". As an extra safety, one could require transcriptions be checked by several different users before being judged correct.

While there are fewer podcasters than camcorder users today, the podcasting community is growing far more rapidly. Also podcasters and their audiences are a lot more Internet-savvy, so Skip is onto something. If we're seeking millions of hours of correctly transcribed speech, enlisting help from podcasters and their audiences could get us there more rapidly than working with amateur videos taken by camcorder owners.

In any event, what's important is the concept.  Think of interesting community projects where a large, evolving speech corpus is a byproduct of something else that participants value. Your suggestions are encouraged. Use the comment form below or take your idea to the venue of your choice.

August 02, 2005

Disruptive Trends & Speech Technology

As promised, here's what I attempted to convey in my presentation at SpeechTEK yesterday.  The PowerPoint for my presentation should be available soon, however on Susan Berkley's advice I didn't actually use my (mostly text) presentation but instead just talked.

The speech industry has had 20-25 years of continuous improvement and, throughout the 80s and the 90s, a continuous stream of new start-ups bringing the latest from academia to the "real world." In recent years, that's changed. The dominant trend has been business consolidation.  Have we run out of interesting new things (that are also useful)?  Maybe in the short term, but I argue, not in the longer term (say 2 - 5 years)!

I cited trends in mobile communications, silicon evolution and the Internet that suggest we're poised for a new round of progress. These trends have three main impacts which I treated in order from moderately significant to very significant.

The first has to do with speech recognition over the telephone.  Here there are two encouraging trends.  First, the advent of Skype is making people aware of wideband audio and the fact that telephony speech doesn't have to sound so bad. As most of my audience was already aware, traditional telephone speech is missing the high frequencies making it hard for anyone, human or machine, to tell the difference between a spoken "c", "z" or "e". Skype is showing the world that telephony doesn't have to sound bad.  With audio over IP, telephony can be HiFi.  Of course a telephony revolution may take a decade or more, but awareness is the first step.

A trend with the potential for more immediate impact is the advent of mobile handsets the support simultaneous voice and data.  Until recently mobile handsets that supported data forced you to choose either voice or data -- you couldn't use your data connection while you were talking.  Now that simultaneous voice and data is becoming possible, we have the opportunity to deploy the Aurora Distributed Speech Recognition (DSR) technology that was standardized by ETSI some years ago.  If you are not familiar with DSR, it provides a way to extract the acoustic parameters needed for speech recognition using software on a mobile handset and then send those parameters to a recognition server over a data path independent of the normal voice path.  This optimizes for battery life on the handset while avoiding the speech coding degradations imposed by normal mobile phone technology. Up to this point, DSR has not been widely deployed because there was no way to include the DSR data in a normal mobile phone call.  With emerging mobile phones (supporting simultaneous voice & data) we have the potential to include DSR (over mobile data) with any normal voice call. This approach is possible today and will be increasing viable as these new mobile phones are deployed over the next few years.

My second major point had to do with algorithms.  The potentially disruptive trends here are the emergence of multiple CPU cores per chip and the development of supercomputers built from hundreds or thousands of commodity computers.  For 30 years the speech recognition industry has leveraged increasing CPU clock speeds.  Yes, we've also gone from 16-bit to 32-bit to 64-bit CPUs, but the dominant trend has been increasing clock speed. However clock speed increases will play a smaller role in the future as silicon evolution becomes dominated by multi-core approaches.  Over the next 3-5 years, we'll see 2, 4, 8 and even 16 Intel CPU cores per chip, with only modest increases in clock speeds.

Going forward, the speech industry can take advantage of these trends but, at a minimum, we need to rewrite existing software to leverage parallel processing. Ideally, this transition will foster significant new algorithmic approaches whether they are relatively specific changes like those suggested by Shinozaki & Furui or the outcome of major research efforts like those suggested by Jim Baker under the rubric "Extreme Speech Recognition".

My third and final point emerges from community projects and social software efforts which leverage the extremely low transaction costs that are possible with the Internet. I referred to phenomena like open source software, Wikipedia and the reviews posted on Amazon. In another vein, Google mines information (web links), that hundreds of millions of web sites have posted for their own purposes, to compute the values (the "page rank") of pages on the Internet.

Aside: In 1937 the economist Ronald Coase (rhymes with hose) explained the emergence of firms -- corporations and the like who aggregate services "in house" in order to reduce the transaction costs of acquiring similar services in the market place.  In the emerging studies of the Internet's impact on social and business structures, there is a very interesting paper by Yochai Benkler entitled "Coase's Penguin" in the Yale Law Review.  "Penguin" in this case is the Linux mascot as Benkler examines the nature of cooperative projects that leverage the Internet to drive transaction costs so low that non-monetary issues predominate.

How does this relate to speech recognition? Further progress in speech requires access to large speech corpora.  Today, there are private organizations that have audio data they can't share due to privacy reasons or won't share for business reasons. Publicly available speech corpora includes thousands of hours of speech -- perhaps ten or twenty thousand -- but not the millions or tens of millions of hours in multiple different languages that will be needed to leverage the massively parallel speech engines of the future.  But we could obtain tens of millions of hours of public, annotated speech data, if we think about the problem in new ways.  I gave one possible example...

Some of you may be familiar with Flickr, a website which hosts photographs for people. While you can make your postings private, more than 80% of the posted photographs are made public. Consider if you had an equivalent web site for Camcorder videos...  There may be fewer camcoders than cameras, but there are still tens of millions of Camcorders in use and with each video there is a sound track -- typically people speaking.  Suppose you provided machine transcriptions for the audio associated with all these videos.  That could be a benefit to the amateur who made the video if it improved their ability to search within the video.  If you also provided a really simple user interface, you could get users to flag transcription errors and, in many cases, correct errors. What if we matched the Flickr growth rate?  (The following is from Google Answers):

A June, 2005 news report citing a “company spokesman” states that
Flickr has 775,000 registered users and 19,5 million photos and a 30
percent monthly growth rate.

The month before Yahoo! acquired Flickr, Stewart Butterfield, Flickr’s
CEO, stated in an interview that Flickr had 270,000 users and 4
million photos.

Even without that kind of growth rate it's not unreasonable to think of acquiring millions or tens of million hours of speech recordings, with user-corrected machine transcriptions, over a period of a few years.  What's needed is a new way of thinking about the issue -- one that draws on emerging trends in community projects facilitated by near zero transaction costs.

Finally, I closed with a variant of my usual upbeat view of the future, i.e. the spread of mobile phones and the Internet is having a dramatic positive impact on mankind; speech is the most natural user interface; and growth in the underlying technologies (per Moore's Law) support continued improvement in speech recognition performance.  So, speech technology will remain an exciting field to be working in -- one that will undoubtedly generate multiple new rounds of excitement (and new companies) in the years ahead.

July 28, 2005

Speaking at SpeechTek on Monday

Next week I'll be at SpeechTek in New York City and, on Monday afternoon, I'll be speaking on the Emerging Technologies panel that's scheduled for 3:30 pm to 5:30 pm as part of the AVIOS-sponsored Advanced Speech Technologies Symposium organized by Bill Scholz. The other panelists are David Pearce from Motorola Labs (UK), Mark Clements, Professor, George Institute of Technology and Juan Gilbert, Associate Professor, Auburn University.

I'm an outlier in this crowd.  The other panelists are all serious speech technology researchers.  I've long been interested in speech technologies, I've use speech technologies for many years (at NMS we provide multi-vendor speech technology with many of our platforms) and I've been friends with a variety of serious speech technology researchers over the past twenty plus years.  But I'm not a serious speech researcher myself.

Apparently I gave a rousing presentation on "Future Platforms for Speech" at an AVIOS conference some years ago which Tom Schalk and Bill Scholz still remember.  In any event, Tom called to recruit me a few months back.  Hopefully, I've come up with some relevant points for Monday's presentation.  If you are attending SpeechTek, please come by and participate (listen, throw fruit, whatever...).

If you will be far from NYC on Monday, I'll try to provide a synopsis here in my blog shortly after the event.

My Photo

NMS Home

  • NMS Communications Logo

Search this Blog

Subscribe by Email

My Online Status

Copyright 2007 NMS Communications

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Technorati


Site Meter

Upcoming Travel & Conferences


Links