Since the late 1960s when we first heard HAL in 2001: A Space Odyssey, many have expected a day when we could converse with computers like we do with another person. And by now most of us have encountered a voice-powered virtual assistant, whether Apple’s Siri, Amazon’s Alexa, Google Assistant, Microsoft’s Cortana, or Samsung’s Bixby. Even Facebook is rumored to be getting into voice.
There was a lot of excitement following the release of Amazon’s Echo in late 2014, which introduced the Alexa voice app. But after all of this time, the technology remains primitive and doesn’t really do what we want it to.
An industry legend agrees. “These systems will not be human anytime soon,” says Adam Cheyer, who was the co-founder of both Siri and a new voice platform called Viv which was recently acquired by Samsung.
Not that we don’t need them. Screens will continue to get smaller, intelligent systems will surround us more and more, and the number of devices and applications we need to interact with will grow. Voice may turn out to be the only viable interface for such a world. Much as the graphical user interface (GUI) on the Macintosh and, later, Windows revolutionized how we interact with PCs, voice almost certainly will transform computing, eventually.
But today, the way we talk to voice interfaces today is clumsy, stilted, and different for each application. When we try to accomplish a specific task, we often end up frustrated. What we need is a universal voice interface that taps into a unified set of intelligent data that resides on the cloud in the background. While most voice systems today can accurately translate speech to recognizable text, we still await machines that can understand natural and conversational language, infer a speaker’s intent, and factor in contextual information to deliver a useful response.
It wasn’t supposed to be this way. Cheyer, who began working on speech back in 1996 at SRI International in Menlo Park, says he “imagined a future when we could talk to computers even before I saw a web browser.“ Two key companies emerged out of SRI. Nuance focused on turning speech into text, crystallizing the first essential component for virtual assistants. Later, Cheyer began work on Siri’s intelligent voice application inside SRI, layering it on top of Nuance’s software foundation.
Siri became its own company, and the app launched on iOS in February 2010, generating instant buzz. Siri was snapped up by Apple itself months later. Then in the fall of 2011, Steve Jobs stood on stage and announced that Apple would build Siri into the iPhone.
The future for virtual assistants seemed bright. But in fact, explains Cheyer, Apple’s built-in version of Siri had fewer features than the app that had launched in the App Store. That, in turn, had fewer features than the company’s prototype, which itself had fewer than the product the company promised to build when it raised money in 2007.
“When we launched Siri, it did 20 to 25 things well—stocks, weather, time, even restaurant reservations,” says Cheyer. But the service quickly found it had to aggressively manage expectations. “The biggest gap is that consumers don’t know what assistants can do,” Cheyer continues. “People will ask something that they think is reasonable, and it fails and then fails again. That results in users being afraid to explore and sticking to what they know. Users hate to feel stupid.” Even though voice-powered devices have proliferated since those early Siri days, the problem remains. Research firm VoiceLabs finds that now only 3 percent of users who try a new voice application are still using it a week later.
While we are seeing rapid advancements in artificial intelligence and computing power, today’s voice assistants exist in a fragmented landscape that prevents them from taking full advantage of what computers and cloud-based intelligence can do. Each system taps into its own data set. Amazon, Google, and others have each created their own artificial intelligence, and rely on developers to build specific applications or ‘skills.’ Each in effect lives on a little isolated island. Alexa skills aren’t interoperable and do not share data. There are more than 10,000 of them.
But to achieve what users expect and to create contextually relevant responses, voice systems will need to access multiple skills at once, along with public, subscription, and personal data. A system will need to understand what it should do based on the request, the user’s past behavior, and the context. Today, typically, you have to invoke an app and then tell it what to do: i.e. “Alexa, play Complicated by Avril Lavigne on Spotify.” Do you really want to have to use one app to find a restaurant, another to make a reservation, and a third one to send an invitation to your friend?
One company addressing this challenge is SoundHound Inc., which has been around since 2005. Its voice recognition software is used to voice-enable third party applications and it also offers its own app to compete with those from Google, Amazon, and Microsoft. Unlike the other assistants, SoundHound works on any operating system or hardware and can access information from multiple partners at once. Data from Uber, Yelp, and OpenTable can be combined together to respond to a complex query. You can say things like “Hound—find me an Italian restaurant within three blocks that’s open until midnight and has an outdoor patio and then make a reservation for four at 10 p.m.” Says SoundHound Vice President Katie McMahon: “We realized early on that we would have to concentrate all of our energy in one interface that you could speak to naturally, conversationally, and that was context aware.”
The world is calling out for a universal language that allows us to interact with machines via voice, just as we already have in the more primitive domain of the phone or tablet. Today, for instance, the ‘pinch’ has become the universal gestures to zoom in or out of a picture. “There will need to be standardization across voice platforms,” says Mark Webster, co-founder of Sayspring, a platform that helps companies design voice prototypes without coding. “Right now, it’s common for different platforms, like Alexa or Google Assistant, to require different ways of speaking to indicate the same intent.” Such a common language would need to extend beyond platforms into individual applications.
“It’s likely there will be a different way to speaking to systems over people,” Webster continues. “My wife started by saying ‘Alexa, what’s the weather today?’ and now just says ‘Alexa, weather’ as she knows she’ll get the same response. It’s more efficient not to speak to Alexa like a person.”
While the giants all seem dedicated to developing the all-encompassing assistant, others believe that voice-enabling applications within a specific domain may be more viable for now. Voysis, a Dublin-based startup, just raised $8 million to help retailers and e-commerce companies let consumers shop by voice. Says Voysis CEO Peter Cahill: “The narrower the application, the better the experience you can provide.”
We still have a long way to go. We await a single ubiquitous assistant that we talk to in one way, and that can move with us from home to car to our mobile phone, regardless of what company made each of them or what operating system they have. That almost certainly will require decoupling the hardware from the software so users can put their virtual assistant on any device. The big software companies will probably resist that, until some adept startup forces their hand.
So for now, I’ll just use the voice-enabled smart speaker in my living room for playing music, setting timers, and tormenting our Yorkie, Sadie. OK Google, Bark Like a Dog.
Josh Kampel is Techonomy’s president and chief dog whisperer.