Microsoft Oxford (Bing) - Speech Recognition

Update: Microsoft Oxford is now called Bing Speech API

Project Oxford exposes a rich API which paves the way for many interesting business opportunities. The following are my observations after kicking the tires with the “Speech Recognition” feature.

My first stop was the code sample library. I found the library comprehensive, with many code samples for the different features that the API exposes. Moreover, it even has support for platforms other than Windows. At the time of writing this post, a speech recognition code sample was available for download on GitHub here. I recommend using it as it is well crafted and approachable. The UI (WPF app) is rudimentary but is just good enough for a jump start. In order to use the API, a key is required and can be generated on the Oxford website. The UI has a button that helps obtaining such a key and even serializes the key for later use. Nice!

Next, I ran the code sample with the provided audio samples. The Speech API got the job done with 100 % accuracy. Impressive.

My next step was to try it with my own audio samples. I was contemplating, what would be the easiest way to test it and settled on one of the podcasts I listened to as an input. I chose Security Now by Steve Gibson as all the episodes are transcribed by a human. The problem I ran into was that the sample code supports only WAV files but the podcast was in MP3 format. For other file types (like MP3), a code change is needed to first send up a SpeechAudioFormat descriptor to describe the layout and format via DataRecognitionClient's sendAudioFormat() method. For me, it was just faster to convert the file to WAV format with the almighty FFMPEG. Unfortunately, the sample code failed to process my audio sample. That might be due to the limitation on the free tier that only allows up to 20 API calls per minute and 5000 per month, which is not enough to transcribe a full-length podcast (unless you increase the buffer size). One other observation I had, was that my CPU usage spiked. Checking the Process Explorer confirmed that the IDE, in debug mode, was the process that triggered the sudden increase in consumption of system resources. I am not sure why, as Microsoft’s servers are doing the heavy lifting here. Right?

With the previous approach failing, I just ended up using a short WAV sample. I used a voicemail from the bank so the voice was clear and without any accent (as accent can be harder to infer). Any English speaker will be able to understand it easily. I was able to successfully process the audio file this time. Unfortunately, I observed low levels of accuracy compared to the stock audio samples. The result presented below.

Keep in mind:

I am very new to this project/API, so I might not be using it to its full potential.

I used the free tier. This might have limited my testing capabilities.

This project is still in preview.

Original message (Input):

“Your client card was used at a location that is being investigated for card compromise or card copying. It is important that your card will be replaced immediately for your protection. You may visit the nearest branch at you earliest convenience for a card replacement. Thank you for choosing xxx. If you would like to repeat this message press 2”

Transcribed message by the Speech API (Output):

--- Start speech recognition using long wav file with LongDictation mode in en-us language ----

--truncated for brevity--

--- OnDataDictationResponseReceivedHandler ---

********* Final n-BEST Results *********

[0] Confidence=None, Text="Your client card has been patient but if you could just get a hard copy and put your projection."

--truncated for brevity--

--- OnDataDictationResponseReceivedHandler ---

********* Final n-BEST Results *********

[0] Confidence=None, Text="Barbiche ebranch convenient."

--- OnDataDictationResponseReceivedHandler ---

********* Final n-BEST Results *********

[0] Confidence=None, Text="Thank you for choosing our be seeing if you would like to repeat this message to."

Search This Blog

Binary Radix

Microsoft Oxford (Bing) - Speech Recognition - First Impressions

Comments

Post a Comment