By any measure, voice-activated technology is changing the digital landscape. About 43% of the adult U.S. population regularly uses voice search, while one in four Americans use a smart speaker in their homes to control the music they listen to or tweak the thermostat with a word. Voice commerce, or vCommerce, is poised to skyrocket from roughly $2 billion in 2019 to more than $40 billion in 2022, as consumers grow more comfortable using voice commands to make purchases online and the Internet of Things helps integrate voice technology more deeply into automobiles, medical devices and other critical machines that are part of daily life.
Speech recognition capabilities have been around for decades, but the hyper-adoption of speech technology by consumers is relatively new, a phenomenon that can be traced back to the advent of the iPhone. In 2008 Google launched an iPhone app called Google Voice Search, which allowed users to use the search engine with a voice prompt. By offloading the app’s critical processing functions to its cloud data centers, Google was able to unleash the massive amounts of computing power necessary to process and decipher speech. The resulting paradigm shift in performance significantly boosted the adoption of voice to power Google web searches.
But Apple stole the spotlight in 2011 with the integration of Siri, a feature of the iPhone 4S. Siri’s arrival coincided with significant advances in automated speech recognition, brought about by the application of artificial intelligence and machine learning to the central problem: how to get machines to understand the mechanics of spoken language, a field known as natural language processing. Despite comprehension rate of only about 83%, Siri spoke to humans in a way machines had never done before, opening the door for the first generation of digital assistants.
The current generation of voice technology follows a line of gradually increasing adoption and enthusiasm. Google made voice recognition work. Siri made it sexy. And Alexa brought it home.
In 2014 Amazon introduced the Echo smart speaker, which, according to CEO Jeff Bezos, was inspired by the conversational computer in Star Trek. It featured a digital assistant named Alexa, and the unit represented a critical shift in voice-assisted platforms: an AI-assisted home appliance designed to scale across other devices. Unlike Apple, which had restricted third-party app development for Siri, Amazon’s open API strategy embraced third-party vendors, enhancing Alexa’s capabilities while extending her connection with consumers to every room in the house. “Amazon has managed to expand Alexa’s role in people’s lives to be the place consumers go to program their day,” observes Rachel Reed, senior innovation manager at Dotdash Meredith.
This sticky connection to consumer habits is extremely powerful. According to a recent study by Google, 72% of individuals who own a voice-activated speaker rely on it as part of their daily routine. The main players in digital assistants—Apple, Google, Amazon, Microsoft and Samsung— are optimizing their platforms and services around four core user behaviors: gaining knowledge, getting help with productivity, managing other devices and buying things.
While the voice assistant installed base—a measure of adoption—is smaller on smart speakers than it is on smartphones, smart speakers are outpacing smartphone growth, with a 70% jump in sales between 2018 and 2019. Today, Amazon controls about two-thirds of that market with Alexa as the leading home automation hub, with Google and Apple also strong contenders in that market. Samsung’s Bixby is being incorporated into the company’s smart appliances, including televisions and refrigerators. Microsoft has decided instead to channel its efforts into voice-based business applications, integrating the Cortana digital assistant into Office 365.
Like Alexa, both Google Assistant and Siri go beyond their own built-in productivity functions and give users access to thousands of third-party voice apps. (Alexa dominates with over 100,000.) Siri and Google Assistant call them “actions,” while Alexa calls them “skills.” Either way, users can link them together to set up “routines,” which enable multiple tasks using a single command.
For publishers, understanding and making use of these capabilities presents a unique opportunity to become part of their audiences’ daily routine. The Flash Briefing skill on Alexa and the Narrative News action for Google Assistant, for instance, allow the publisher to deliver preprogrammed updates as part of the user’s news roundup. “By publishing short-form daily content on these platforms,” Reed explains, “brands can connect to consumers and weave their way into daily content consumption, becoming a consistent source of news, information, advice or inspiration. By reducing the number of commands required to activate multiple skills, routines can also improve retention, which is often an issue in the voice space.”
Since voice interfaces began, human-like conversation has been the North Star guiding developers, and the field of “conversational AI” is starting to deliver on the promise of human parity—understanding what’s being said about as well as a human does. One of the aims of conversational AI is to handle a conversation involving multiple back-and-forth exchanges, in which the program can continuously learn from previous voice-based interactions.
Google has claimed that Meena, its multi-domain chatbot or conversational agent, has reached near human parity and is able to address virtually anything its human counterpart wants to discuss (which is what makes it “multi-domain”). By putting conversation at the center of a new skills kit, Amazon’s Alexa Conversations leverages a deep learning-based approach that facilitates multi-turn conversations within a single skill, with the goal to bring multiple skills within a single conversation.
As Meena and Alexa hone their conversation skills, one of the next challenges will be interoperability. Today, devices that work with Alexa rarely speak to devices that work with Google Assistant, and hardly anyone speaks to Siri. Interoperability would mean that a single or predominating connectivity standard would work seamlessly across the entire ecosystem of smart devices, regardless of manufacturer or device function. Project Connected Home over IP, recently introduced by Amazon, Google and Apple, will focus on home automation systems but will also help serve as a blueprint for interoperability over the rest of the Internet of Things.
“This is much bigger than the rise of smart speakers, and it’s not just technology for technology’s sake,” observes Reed. “Voice is replacing the way we interact with our homes, our workplaces and with each other. And it makes sense—speaking is a much more natural form of communication than typing, texting or swiping. Soon, the ability to converse with all of the technology that surrounds us will become the norm.”
Ultimately, the goal of all technology is to make us more efficient, to make it so we’re able to do more with fewer resources, or to squeeze more into our increasingly busy lives. For years we’ve let our fingers do the talking. But when you consider the average person can talk nearly four times faster than they can type, you can understand how voice speaks to the one thing we all value most—our time.