I’ve been heads-down with Tempo AI and so I haven’t been able to write too much. There is clearly a lot of fever about Google Glass. Everyone I know wants to try it and with KP, Google Ventures and A16Z starting a Google Glass Venture Fund and with some other buddies starting a Google Glass Incubator called Stained Glass Labs, you know you have reached a tipping point!
My selfish interest in Glass is how we can use it with Tempo to be even more anticipatory and contextual but my curious interest is actually related to voice. There are significant misunderstandings as to how voice to text works, the technical library is known as the ASR (audio speech recognizer). For a machine to accurately translate voice to text, it needs access to millions of samples of human voice to improve the statistical models it’s built-upon. This is why each time you speak into your phone, TV, computer, those utterances (what they are technically called) are sent to a server, stored and usually manually listened-to and transcribed, to continue to improve the algorithm. On this note, I think Siri/Apple recently said they store your voice samples for 2 years and probably for this purpose.
The big challenge is the input. An utterance over the voice network (eg a 1-800 number) versus your phone application over the data network versus your car Bluetooth are different. They have different microphone qualities and they have different sorts of background noise. This is why Microsoft was rumored for many years to have the better ASR for voice calls because of Tellme but Google to be better for digital (voice via an application on the data network) because of Android (and both by the way behind market leader Nuance). This potentially means that you need millions of utterances from each unique microphone, place setting, network, device etc. It’s a lot of work and it’s taken significant engineering investment to even get to where we are today.
To make voice to text better, applications will often couple it with other fuzzy technologies. For example, Siri is rumored to post-correct the voice-to-text using NLP (natural language processing). This means if the output of your utterance is “What is the park,” Siri might post-correct this to “Where is the park” realizing that the first output was grammatically incorrect. This is in part why Siri was such a technological achievement – it was able to take garbage-in and still return a meaningful result.
With Glass, Google is collecting a new set of utterances that has never previously existed. The microphone is on your face and so this will yield millions of new utterances to build upon. But what gets me really excited is that the Glass rests on your face and so Google could potentially improve the voice-to-text by coupling it with the accelerometer. As you move your mouth, the Glass will move ever so slightly. I don’t know if this movement is significant enough to be measured or more just noise and whether it needs to be measured while sitting, walking, running etc. But if there is enough variance to derive patterns, they can effectively use the accelerometer in Glass to post-correct the voice-to-text. It’s the equivalent of a machine “lip-reading” and could potentially be more accurate than even the voice-to-text itself.
Follow This Blog Via Email
- SXSW 2016: Messaging Bots
- Reflecting on the Productivity Category
- Tech Discrimination
- Daily Active Users Don’t Matter!
- IOT: The Competition for Attention
- How Google and Others Are Using Data Science to Speed Up Productivity
- Why I’m Excited About Google Glass?
- Context Is a Layer and Not a Category
- Sales 2.0: The Bottoms-Up Sales Model
- Is UX Going to be Commoditized Next?
- To Win Enterprise, Target the Consumer
- Using a Chromebook for a Week
- Android (and the Tab!) in the Enterprise
- Ad Learnings from Recipe Search
- CPA Advertising on Mobile
- Enterprise Mobile Is Hot!
- Network Responsibility
- In-Door Mapping
- The Social Network
- Using a Mobile Phone to Drive Behavior Change
- Apps vs. Web Apps
- Understanding Developer Psychology
- Are Batteries Improving?
- Over the Air and Emerging Markets
- Are Platform Game Lobbies the Next Social Graph?
- Are Plugins (and Flash) Going Away?
- Lessons Learned in Monetization
- Understanding the Emerging Market
- OEMs, Please Choose
- Understanding PDE (Positional Determining Entity)
- Has the Power Shifted Back to Operators?
- Is the Killer App the App Store?
- Navigating the Mobile Content Landscape
- Wholesale Application Community
- Sharing Overload?
- The Ad Ballot
- Device Patent Wars
- Is Apple Making a Mobile Wallet Play?
- Top 6 Reasons Why Nokia Doesn’t Have a Capacitive Touch Screen
- Mobile Trends 2020
- Mobile Taking a Cue From the Travel Industry
- CTIA Party Analysis
- Measuring AT&Ts Network
- Mobile Cloud Computing
- Mobile Beyond & NARIP
- Mobile 2.0 Advertising Notes
- Carnival of the Mobilists #193
- The Broken Subscription Model
- The Mobile Reserve
- The Fragmented Web
- Got to Love the Music Industry
- What Platforms Should I Build For?
- To Operator or to Not to Operator
- What Do Bankers Do?
- My Older Posts