The summary
Written for a business audience, this book has two distinct sections. The first provides a gentle, integrated primer on voice technologies, such as automatic speech recognition (ASR), speech to text (STT), text to speech (TTS) or voice cloning, and natural language processing (NLP), and links these to the human needs fulfilled by voice technology. The second is essentially an extended pitch deck. Unabashedly techno-optimist in outlook, it seeks to grow the market for voice technologies by encouraging the reader to examine their own organisation’s operations for voice technology use cases, and provides a detailed guide to the user research and interface design steps needed to implement a voice technology program.
This is unsurprising, given [a:Tobias Dengel|27212062|Tobias Dengel|https://s.gr-assets.com/assets/nophoto/user/u_50x66-632230dc9882b4352d753eedf9396530.png] is the CEO of WillowTree, an AI and digital product consulting company recently acquired by TELUS international for $USD 1.2 billion – which focuses on gathering training data for AI applications. His expertise in human-computer interaction (HCI) and user-centred design (UCD) is evident in the first half of the book, where voice technologies are continually grounded in user tasks and experiences. In the second, his experience is shown in the methods advocated for exploring voice use cases, with a focus on HCD methods such as journey mapping. Co-author [a:Karl Weber|213951|Karl Weber|https://images.gr-assets.com/authors/1337042384p2/213951.jpg] is an editor; his collaboration with Dengel makes the text readily approachable and succinct; terms unfamiliar to the lay reader are well described, and the use of acronyms is minimal.
The book draws heavily on examples from industry to highlight key claims, however some of these are now dated. Stanford Open Voice Assistant Lab (OVAL)’s Almond assistant was re-named Genie in 2021, however has not had any active development for over two years, and the research group has pivoted to working primarily in the large language model (LLM) space. The Open Voice Network’s initiatives on trustworthy voice assistants have now been folded into the umbrella of the Linux Foundation. This is perhaps unavoidable in such a fast-moving space.
Part One – Aligning the use of voice technology to the human need for communication
Each chapter in the first half of the book details a particular human need that is met by voice technology.
The Prologue paints a picture of the transformative power of voice tech, showing how it was used to help those physically impaired to be able to communicate again – using speech – the most natural form of communication.
The Introduction makes a bolder claim – that voice is a technological revolution – akin to the internet or to the mobile phone: nascent, latent, reaching a tipping point of “ubiquity and popularity” that we should all be prepared for lest it catch us unawares. While acknowledging that voice tech is currently limited in application, and harbours a panoply of challenges, the authors hand-wave these away, pointing to the rapid advances being made across the vibrant voice tech ecosystem – inhabited by companies such as ReadSpeaker, SoundHound, Cerence and others. The sizeable investments made in voice are given as evidence for the technological revolution, but differentiated from over-hyped failures such as blockchain and the metaverse in that voice “fulfills basic human needs”, which are articulated in subsequent chapters.
Speed makes the case for “even marginal improvements in speed/efficiency” when designing user interfaces, highlighting examples such as search engines and online shopping websites to reinforce the point that speaking to machines is often quicker than typing to them. It imagines a world where the keyboard is eschewed in favour of the the microphone as the primary mode of data input, because this is faster – and time is money. The physical toll of such a change – can you imagine speaking for the same amount of time you type? – is left unexamined. I wonder what Mica Endsley or other human factors scholars would make of this claim.
The next chapter demonstrates how voice technology meets the need of Safety – by being available to assist when the user is physically incapacitated. There is a claim made in this chapter that was particularly contentious: that having a voice assistant in the cockpit would “prevent crashes and save lives”. While plane <-> tower communication is definitely a contributory factor to many incidents, there is no discussion here of the complexity introduced by voice assistants. Imagine, for example, the utterance engine one out! being mis-transcribed as engine won naught!. Sure, the language model can be weighted for cockpit utterances, but mis-transcription is still rife, even in state of the art systems (Whisper, for example, has a 9.3% Word Error Rate as tested on Common Voice 15).
Knowledge makes the case for voice technology as an interface to the world’s information. Rather than having data at your finger tips, it’s now available on the tip of your tongue – overcoming the limitations of screen real estate. Dengel and Weber also make the case here for voice where users are not computer literate: you don’t have to know how to use a computer to ask a question of a voice assistant. What is not well explained here is that access to knowledge is mediated through millions of APIs – and to curate or synthesise them requires additional capabilities. The potential for commercialisation to skew results in a particular way (such as booking sites preferencing those providers that pay them the most) is left unaddressed. This chapter also touches on voice technology as one of many anticipatory systems – having predictive capabilities through audio feature detection to infer an event is about to happen, and respond. What isn’t covered is the downside of this form of machine surveillance, covered well by researchers Joel Stern, Sean Dockray and James Parker in their Machine Listening: Exposed collaboration.
In the chapter on Inclusion, the authors make the case for voice technology building “a more inclusive society”, pointing to advancements in screen readers, speech to text and smart hearing aids as mechanisms that help in “…liberating and empowering individuals who have too long been excluded from mainstream society…”. The challenge of machine translation for the world’s 7100 spoken languages is also addressed, and inequities in the availability of tooling for under-resourced languages and the existing Anglo-centrism of the tech sector, quite rightly, highlighted. Kathleen Siminyu’s work with Common Voice’s East Africa project, which is providing speech data and tools for the Kiswahili project, gets a mention, which delighted me, however when chatting with her, she was unaware of being featured. Absent was any argument for addressing the lack of investment in low-yield languages – languages whose speakers are not “profitable”. This is likely to remain the purview of NGOs and governments for the foreseeable future, lamentably.
Engagement makes the case for voice technology making life “more creative, entertaining and enjoyable”, using radio and television as previously emerging technologies that were fun to use, which drove adoption. Dengel and Weber speculate about what might happen to voice actors in a time of synthesised voices, seeing both the economic reality of the cost of live narration, and, counter-intuitively, the increasing value of human voices in a soundscape saturated by synthetic speech. They go on to link voice tech to the metaverse and to virtual reality, showing how it is a necessary building block in “multi-modal” experiences. Again, there was no concomitant discussion of the ethics of synthetic speech – and importantly, how “synthetification” – the growing movement to synthetic media – shapes power relations, labour relations and who profits.
The chapter on Transformation ties voice technology to “fundamental changes to business models”, through mechanisms such as voice identification through biometrics, and the aggregation of services to provide a streamlined, personal offering. It covers the move from click-through rate (CTR) in screen advertising to say-through rate (STR) for voice-enabled advertising; again however, it does not explore the ethical or societal issues such changes might bring. I’m reminded here of [a:Joseph Turow|41401|Joseph Turow|https://s.gr-assets.com/assets/nophoto/user/u_50x66-632230dc9882b4352d753eedf9396530.png]'s excellent [b:The Voice Catchers: How Marketers Listen In to Exploit Your Feelings, Your Privacy, and Your Wallet|55457694|The Voice Catchers How Marketers Listen In to Exploit Your Feelings, Your Privacy, and Your Wallet|Joseph Turow|https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1605358687l/55457694.SY75.jpg|86483898] – and how voice is being used as a mechanism to target advertising. The chapter goes one step further, exploring the use of vocal biomarkers in health – but again, without the attendant discussion of unintended consequences. Who stands to benefit if a disease can be diagnosed simply through speaking?
Part Two – A program of work for implementing voice technology use cases within the enterprise
Part Two of The Sound of the Future moves from explicating use cases for voice technology to encouraging the reader to implement them, with attendant advice on strategies for doing so.
The chapter on Falling Barriers traces the recent history of voice assistants like Siri and Alexa, positing that what people really want is something more akin to an “all-purpose valet”. This leads into a discussion on technology breakthroughs, and the factors which incentivise them, and uses the COVID-19 pandemic as a case in point – where hands-free, remote interaction provided by voice-enabled devices helped practitioners avoid infection. Here, I would have enjoyed more grounding on the various innovation theories, however this book is clearly aimed at a business, rather than academic, audience. The chapter goes on to outline the key layers of the voice technology stack, such as automatic speech recognition (ASR), natural language processing (NLP) and conversational AI, providing a precis of the current state of the art of each, and remaining barriers. The paradigm of “multi-modal interaction” is then introduced, situating voice technologies alongside haptics and visual interfaces as a constellation of interfaces that collectively are shifting how we sense and respond to our cyber-physical world. User trust in voice technologies is then introduced as another barrier which must be overcome to ease widespread adoption, in particular citing the Trustmark Initiative from the Open Voice Network as a signal that this barrier is falling. The chapter concludes with an overview of how Dengel sees trajectories of development in voice technology, from automation to business process redesign, to transformation of business models.
Making voice an integral part of your existing business systems encourages the reader to “seize the opportunities” voice technologies present, by first identifying places where voice technology could be integrated into existing business systems. The authors provide a helpful list of six principles for assessing whether an interaction is well suited to voice integration, and go on to use examples from industry to highlight how these principles are applied.
In the Training voice tools to understand your world chapter, the authors cover a problem that has long faced voice technology practitioners – the domain specific nature of spoken language. The utterance (spoken phrase) “twelve fifty” has very different meanings in different contexts – it could mean twelve pounds fifty, 12.50pm, 1250g and so on. The advice here is for organisations to identify the “friction points” their customers face, using tools such as journey mapping to better understand those contexts. The chapter goes on to advocate for prototyping of voice technology tools, using UX methods to elicit feedback to guide iterative development, and ensure that the intent of the user – the task the user wants to perform – is matched by the system. The concepts of error flow handling and conversational repair mechanisms are covered here too – essentially serving as a primer on voice user interface design.
Designing and redesigning the multimodal user experience makes the case for voice technologies as part of an omni-channel digital user experience, highlighting voice’s place in an overall brand experience. It discusses how voice can be used to augment and reinforce other digital channels, such as text-based chatbots or graphical user interfaces. Thankfully, there is little hype about the metaverse – which – given its current white elephant status in industry – would detract from the argument that voice technology on its own is transformative – the argument here is that its transformative power emerges in concert with other technology. The chapter includes advice on how to plan an iterative voice UX (VUX) experience design process, and, also pleasingly, highlights the need for inter-disciplinary teams and executive support.
The concluding paragraphs reiterate the argument that “successful new technology is about meeting basic human needs”, and that to be successful, companies must adopt voice – or face defeat in the marketplace.
The verdict
This book is helpful for businesses who are making their first forays into voice assistants, voice user experience (VUX) or conversational AI, in particular those coming to it from a product management or business analysis background. The use cases for voice are expansively surveyed, and applicable to many industries. However, the technical detail is too light for those needing a deeper guide to the pitfalls of voice technologies, such as accent bias in speech recognition, ambiguous named entity recognition in natural language processing or the privacy dangers of voice cloning.
Moreover, while Dengel and Weber correctly identify that many threads of innovation underpin the current state of the art in voice technology – hardware improvements, advances in deep learning and neural networks, and the availability of more speech data upon which machine learning models may be trained, they gloss over the many challenges in the space – the trust people need to have in assistants, the poor performance of voice technology for accented or disordered speech, the privacy and ethical challenges of requiring user data to be effective, and above all, the question of who profits from speech data gathered from individual people.
Beckoning to previous technological path dependencies, they hold that
“… this is the story of any new technical wave. It takes years for entrepreneurs, designers, and engineers to shift their thinking to take full advantage of any new technological paradigm.” - Introduction
When I think of the precursors to today’s voice and speech technologies – the Audrey and the Shoebox, the Harpy, to the Tangora, to Dragon Naturally Speaking, even as far back as Christian Gottlieb Kratzenstein’s work on synthesising human speech with the “vowel organ”, I can’t help but wonder – are voice technologies really a new technical wave? And in taking full advantage of this new technological paradigm, who is it that is taken advantage of?
If voice is the sound of the future, then we must have other conversations about what that future sounds like – and whose voices are heard.