The Future of Content Episode 25: Creating Voice Content

Key ideas

Voice content is everywhere: Interactive Voice Response systems and screen readers are voice content systems that we don’t consider “voice content.”

When we interact with digital content, we’re peppered with different notions of how to interact with that content from the perspective of a visual context, whether it’s a nav bar or breadcrumbs or a site map, or other links.

Content creating is a challenging task, and it’s especially challenging creating voice content.

Preston So is the Senior Director of Product Strategy at Oracle and the author of Voice Content and Usability. He focuses on how content is changing and adjusting to new user experiences, namely voice experience and immersive experiences that challenge the notions we have about the web and content.

Voice interfaces are everywhere, even in places and in ways that we don’t realize. IVRs (interactive voice response systems) and screen readers are very compelling examples of voice content interfaces.

More than 35% of households in the U.S. have voice assistants (smart speakers, smart home systems, gaming and VR headsets, etc.), and the number is growing quickly. Preston believes that the rapid increase is due to a technological quickening that has sped up over the last 18 months since so many of us have been homebound due to the pandemic.

I think voice-enabled search is one of those things that can straddle what we call transactional voice interactions, task-led conversations, informational interactions, or topic-led interactions.

Unlike other forms of content production and design, voice content producers often don’t control the means of production or delivery. Everything going through a device is based on the user’s preferences, which isn’t anything we have much control over as content producers. Not only is it difficult to manage a content consumer’s perception of the content, but we also don’t get to control the delivery mechanism.

Preston understands the challenges that come with trying to control every aspect of the content user’s experience. A particular challenge is the very limited subset of voices available when working with voice interfaces, which has implications for people who need to be able to hear someone who sounds like them. Listen to this week’s episode as Preston discusses how to overcome these and other user experience challenges, and be sure to check out his book, Voice Content and Usability!

Preston So

Preston So is the Senior Director of Product Strategy at Oracle and the author of Voice Content and Usability.

Links and important mentions

Stream episode 25 now, or subscribe on your favorite podcast platform below.

Episode transcript

Note: This transcript may contain some minor wording and formatting errors. Apologies in advance!

Todd Nienkerk: Welcome to The Future of Content, I’m your host, Todd Nienkerk. Every episode, we explore content—its creation, management, and distribution—by talking with people who make content possible. Our goal is to learn from diverse perspectives and industries to become better creators. The Future of Content is brought to you by Four Kitchens. We build digital content experiences for ambitious organizations.

Today, I’m joined by Preston So, Senior Director of Product Strategy at Oracle and author of Voice Content and Usability. And we’re here to talk about voice content. Welcome to The Future of Content, Preston.

Preston So: Hey, Todd, thanks so much for having me here today on The Future of Content. I’m a big fan of your show.

Todd Nienkerk: Oh, thank you so much. We’ve known each other for years and years. We go way back in the Drupal space and working at media companies and building websites for magazines and things like that. So we’ve been at the content space a while, you and I.

Preston So. Some would say too long.

Todd Nienkerk: Some! You and I and both, and everybody who knows us. So we’re not actually here, though, to talk about all of that stuff. We’re here to talk about your new book published by A Book Apart: Voice Content and Usability. Before we get into that, though, I’m curious to hear a bit more about your background. You started off in this space as a developer, correct?

Preston So: Yeah, it’s interesting because my career has been very serpentine in that way. I’ve been in every single side of the equation when it comes to driving content, whether it’s content architecture, or content strategy, or content design. I started out actually as a web and print designer, so I futzed around with HTML and CSS like the best of us back in those days. But I also was also involved in a lot of those, you know, early design studios on the web that did multimedia design. And I worked a lot on things like trifold design, magazine design—those sorts of things. So I have a little bit of a background in publishing.

But you are absolutely correct that the bulk of my career has been spent as a predominantly web technologist, working on every single aspect of the ways in which web content operates. But now these days, I’m much more interested—and I think a lot of us are very interested—in how content is modulating and adjusting to some of the new sorts of user experiences that we deal with today—namely voice experiences—immersive experiences that really challenge a lot of the foregoing notions that we’ve had about the web and content.

Todd Nienkerk: How prevalent are voice interactions these days, because that’s the kind of thing that, OK, as an expert in the space, I know the answer to that question. But as a layperson or somebody who doesn’t really know the numbers, it may seem like who are these people that are talking to their computers? And like, I don’t want Alexa in my house because I don’t want a corporate spy or whatever. Right. Like how prevalent, though, is voice content right now in technology?

Preston So: That’s a deceptively hard or deceptively easy—I’m never sure which is the right way to use that word— question, which is that there’s voice interfaces everywhere you look, to be quite honest. And I think this is one of those things that a lot of people forget is we actually have voice interfaces everywhere around us. It’s not just the Amazon Alexas that might be spying on us or Google Home devices that we use to play music. It’s also the interactive voice response, or IVR systems, that are used across every single phone hotline that we yell in frustration at whenever we have a flight delay or some sort of a hotel issue. It’s also, believe it or not, I think, one of the things that a lot of people forget as well is that the venerated screen reader in web accessibility is also a very compelling example of a voice interface. And it’s, as a matter of fact, probably the most interesting example of an out-of-the-box or off-the-shelf implementation of voice content, which is the subject of my book. However, these days, of course, I think a lot of us have that science fiction rooted notion of what a voice interface is. Many of us saw the movie 2001: A Space Odyssey and remember Hal. Many of us are also Trekkies and remember Majel Barrett’s voice as the computer in Star Trek. And that form of voice assistant or voice interface is the most familiar today that a lot of us know.

And one thing that I will share is that I think one of the most interesting aspects of what McKinsey calls the technological quickening that’s been ongoing over the course of the past year-and-a-half as we’ve been homebound is the fact that the sales numbers for smart speakers, smart home systems, also for gaming headsets and VR headsets who want, you know, I wonder why all of those things have shot through the roof when it comes to demand across the market. And now today, according to some analysts, that I completely forgot the name of, but I refer to them in my article for a list of part the the number of Americans, how, you know, these American households that today have voice assistance is well over 35 percent and is growing very, very quickly. Now, me personally, I don’t have a voice assistant here at home. I use my mom’s, which is a Google Home. And, you know, I think a lot of that is because of, you know, of course, the privacy concerns and some of the other issues related to, you know, addiction to technology or dependence on technology that I think a lot of us are dealing with, especially these days.

Todd Nienkerk: Right. Something I’ve noticed or maybe overheard or I don’t know if this is some kind of like a just a story that gets passed around is the rise of voice-based search on mobile devices. So I see people who are, you know, they’re holding their phone up like it’s a plate, you know, kind of like perpendicular to the face. And they’re talking into what they think is the microphone, but it’s actually the speaker. And and they’re either doing voice-to-text texting or they’re, you know, talking to Siri or whatever they’re doing, like voice-based search. How much of that do you consider to be part of, like the realm of voice content as you approach it? Or is that really more of a like a transactional, like an interaction with a piece of technology as opposed to, quote, content

Preston So: That really digs at, I think, some of the challenges around defining the term “content,” which I think is one of the big reasons, of course, you know, we have The Future of Content podcast and the notion of content as being something that is primarily information really does present some nuances when it comes to how we define voice interactions and voice interfaces at large. One thing I will say is a lot of people have written about these sorts of taxonomies or classifications for voice interactions. And what I’ll share is, I think voice-enabled search is one of those things that can straddle both what we call transactional voice interactions, or what Amir Shubat in Designing Bots calls “task-led conversations” and “informational interactions,” or what he calls “topic-led interactions.” Whether you’re conducting a search for a pizza topping or you’re conducting a search for some sort of ingredient, or you’re conducting a search, for example, for showtimes for Cruella, or you’re searching more importantly for information about Cruella, that is one of those things that really runs the gamut of some of the use cases that we can see with voice interfaces writ large.

I think where we’ve seen a lot of distinction emerge in the field is that the vast majority, the overwhelming majority of voice interfaces, especially voice assistants, that today we interact with our predominantly transactional, which means that our voice interfaces are not really interested in helping us learn about a topic or giving us information. They’re predominantly about helping us complete a task on our behalf, like things like playing a song or ordering a pizza with extra pineapple, which is my preference, or going and reserving a hotel room or checking a credit card balance.

There’s comparatively few voice interfaces, though, today, and I think this is still the case even in the last few years of voice interfaces that are rooted in the notion of serving information and serving content that is within the realm of what we traditionally consider content—namely these tracts of information that give us guidance as to how to live our lives or how to proceed.

Voice-enabled search, I think, is one of those interesting conduits, though, because you can definitely have an initial voice search that is searching for content. And then from there on, you kind of leaf through the voice content that emerges with voice search for a lot of these other transactional interfaces, though oftentimes you have successive voice searches that you have to complete in order to access, let’s say, the decision at the end that gets you to the goal that you’re trying to achieve. So with search, I think this is a really interesting topic, because with search, you’ve got this notion of all of these website pages that have all this information trapped inside there. How do you extricate that information in a searchable format that can now be admitted through a voice interface in a way that makes sense to a user whose only exposure to that content might have been through the website? And that poses a lot of interesting and vexing questions.

Todd Nienkerk: To what extent do you think voice content needs to have more context, such as not just the words by it and the syntax and the definition and the meeting that’s derived from just sort of plain text on a paper, but intent and tone and speed and all of the things that are communicated by human voice, or I should say human communication more broadly, that that aren’t necessarily— it’s the the problem of writing an email and being misunderstood, right? It’s that you impose your own interpretation of context and intent and tone and all of that on the on that word. To what extent is voice content uniquely faced with that challenge?

Preston So: There’s two angles that I want to approach this from, and I think it’s really rooted in the history of content up until now and the ways in which our voice interfaces are completely different from the written forms of content that we interact with. One thing that I think happened with the web, which is really interesting and of course a really great revolution in how we deliver and consume content, is the web is fundamentally very different from, let’s say, the microfilm archives or the tabloid pages that characterized the print medium and journalism for such a long time.

The web really revolutionized that because of the fact that these are very core ideas of the web links, hyperlinks, calls-to-action buttons, form elements—all these things that now are replete with these characteristic traits of what we consider today to be “the web.” What that’s done, however, is it’s also inserted an over contextualization, I would say, of our content in the web. And what I mean by that is, nowadays, whenever we interact with content that’s digital on the website or on, let’s say, a mobile device, we’re peppered with all of these different notions of how to interact with that content from the perspective of a visual context, whether that’s a nav bar or breadcrumbs or a site map that directly maps out the information architecture or even links that point to, let’s say, some content that’s referenced therein or links that allow us to go down a rabbit hole in Wikipedia. These are really interesting emblems and motifs of the web that really don’t apply to all forms of content. And the closest analogue, of course, to the hyperlink in print media is that “See page A3,” or a “See Metro section for the rest of this article.”

But in a voice environment and in an oral environment, a verbal setting, you lose a lot of those really core aspects of what makes the web great for content and also what makes the written medium great for content. And I’ll point back to what you were mentioning around some of the other things that are conveyed in some of the content that we hear as opposed to read. But when it comes to the ways in which we operate, voice interfaces in particular, I think what’s really interesting is the fact that in Amazon—Alexa, for example, or on Google Home—there’s really no way for you to underline a piece of text and color it blue. There’s really no way for you to draw a drop shadow around a piece of text and call it a button. The affordance in the ways in which we actually craft these sorts of elements that we’ve treated as gospel for such a long time on the web become very different, especially when it comes to calls to action, like read more and learn more, which are things you can’t necessarily do on an electronic device. But to go back to what you were saying, I think one of the really interesting things that a lot of people forget, and this is one of the big worries I have about the ways in which conversational interfaces have grown over time and matured nowadays. And this is very different from the way it was in the early ’90s or the early 2000s.

Nowadays, there are these frameworks available, these platforms that are provided through Google, for example, in the form of dialogue flow or some of these other really, really well-known platforms that are multichannel or omnichannel. And what I mean by that is you can create a conversational interface that then manifests as an Alexa skill or a Google Home application or a chatbot on your mobile device, or a WhatsApp messenger bot, or a Facebook messenger bot. But this fundamentally papers over some of the really important nuances and distinctions that you talked about earlier, Todd, around the written form of how we consume content and the spoken form of how we consume content.

And it’s not just the fact that there are certain distinctions that we should all keep in mind that conversation designers do. Keep in mind, like the fact that the phrase to whom it may concern is something we write, but not something we say. And also the fact that the word literally is something that we write that we say a lot more than we write in our text. But I think you alluded to a couple of things, right? And in linguistics, these are known as linguistic behaviors, these are things that convey meaning, that are not encoded or codified within the text itself. And I think this is a realm of linguistics that really is under explored in some ways, because a lot of the more formal elements of linguistics really focus on the ways in which different grammatical elements come together. But when we think about some of—

Todd Nienkerk: — the visible and definable as words and as punctuation and and how you space up paragraphs and, you know, if you write in E.E. Cummings-style poem or all of that like that. But yet linguistics does not have an equivalent body of study related to the voice or verbal equivalent of that. Is that, is that what you’re saying?

Preston So: Certainly in formal linguistics, you know, if you think about the traditional realm of Chomsky and linguistics, certainly that doesn’t exist in the ways that we might think because frankly, no one speaks the way that they write. We say “or” or we have all these stammers, or interruptions, or what are called fillers in our speech. And not only that, but we also inflect the way that we speak that we speak in very, very different and very unexpected ways.

For example, when you have a sarcastic tone or you’re using, let’s say, a more sighing or wistful tone and you’re saying the same sentence to somebody. What does that actually indicate as far as the subtext? And I think this really indicates that one of the things that I find really interesting about voice is not only the fact that what Erika Hall refers to as the notion of content in the conversational setting being communicated through time rather than space, as we find in visual media. It’s also the notion of the fact that there’s this whole axis of sound that we now have complete and open access to, namely the fact that you can use sound gestures like klaxons, or you can really convey a lot of things like emotions through the ways that you actually encode some of these things in your voice interface. And that makes it a much more enriched and in some ways a much more multifaceted and nuanced way to build a voice that starts to build a conversational interface than some of the written conversational interfaces like chocolate sauce, BlackBox, or Facebook messenger bots.

Todd Nienkerk: So you just raised something that kind of like is giving me like the brain tingle. So prior to doing publishing and things like that, my background was in radio and audio production. And so you’d think that I would already know this and this would be just part of my daily understanding of communication. But what you just said about voice content and voice interactions using sound cues like klaxons or sound effects or whatever, to create an analog indicator or something analogous to a visual indicator, like a text with a red background that says “alert.” Right? Of course, radio, especially morning-drive radio, does this all the time—like this is their whole thing is they have sound beds and sound effects and they’re hitting buttons to play quotes from reality TV shows and all of that garbage. Right? You don’t hear that quite as much in public radio and you can take that or leave it. But that production style, right? Like those have different tones and contexts and audiences and things that they’re trying to communicate. And yet, to me, at least as somebody who has worked in that industry and knows how all of that functions, it has not yet occurred to me to apply the same principles to web or more sort of technology-based, quote, technology-based voice content. Is that something that you’re seeing people are actively starting to use now in voice content as as we’re talking about it in this more like Apple, Google, website technology context?

Preston So: It’s a really interesting quandary because I think there’s several aspects of this that are both really favorable to the richness of how we experience the aural landscape today, but also some really big detriments. And that is, you know, first and foremost, the fact that a lot of these foundations that we rely on to build these voice interfaces actually lack the capabilities that we would expect to modulate, let’s say, between a sardonic tone or a more, you know, a more, let’s say, solid tone or stilted tone. And at the same time, there’s also not a whole lot of flexibility or customized ability when it comes to introducing some of these things like sound gestures or air horns or things of that nature. I think in some ways, the ways in which these multinational corporations have really driven voice interfaces toward their quote unquote omnichannel frameworks is in some ways a bit of a disadvantage. Because when you have to serve multiple audiences at the same time, when you’ve got people who really want to do the whole tenet of omnichannel publishing, which is “create once, publish everywhere,” but apply that to their conversational interfaces, it really highlights the fact that a lot of these platforms and a lot of these ecosystems are really not yet equipped with the same richness that we can apply to podcasts or talk radio or, you know, let’s say Car Talk on NPR.

And one of the things that I think is really important for designers to think about as well, and this is one of those things that I mentioned in the book, is in some ways the concentration of this control over these levers that govern how we can actually inflect voices and how we can inflect not just the individual voices that we use, but also the the soundscapes that surrounds voice content and provide these affordance for voice content is the notion that designers have a very limited palette or a very limited toolbox to use because they might want to create these really novel live audiences that don’t exist.

And it’s really difficult for a voice interface designer, let’s say, who is working independently. They’re a freelancer to be able to create this really rich voice interface that might use and employ some of these really interesting things like klaxons or sound gestures. One thing that I think is really tough as well is I know you and I are both big proponents of the open web, open source technology, and open source is also one of those things that hasn’t really seen a huge amount of evolution in the voice world. And I think a lot of that is because of this concentration of power in the brains of those who are at Amazon or Google or Apple.

But there are certain standards that have been created for these things like ASML, which is a standard for defining how synthesized speech should actually sound. There’s also a voice SML, which is use specifically for creating voice interfaces and these sorts of affordance is that you and I have been talking about. But neither of those things are in wide use by some of these major companies, all of which are taking the same approach they’ve taken with their mobile ecosystems and Android and iOS, for example, and defining these very clear lines in the sand where you have to use, for example, Alexa’s form of how we want you to build Alexa or you have to use Google’s approach and nothing else.

Todd Nienkerk: Right? Right. So that was actually, you just almost answering this question here. So it seems that unlike most or maybe all other forms of content production and design, voice content producers often don’t control the means of production or delivery. And everything is going through a device and or the user’s set preferences. So if somebody says, “I want the Australian woman to be the voice of Siri,” or, “I want the, you know, South African guy to be the voice of Siri,” well, now that’s going to have a certain that will have a certain inflection and it will probably be more culturally contextual as as a result or one would hope. But the user set that preference. And if you are are a voice content producer, you do not have as much control over. Not only is it hard to manage a content consumer’s perception of what you do, but you don’t even get to control the delivery mechanism, right? So it sounds like these standards that are being produced, like these voice SML and what was the technology that tries to handle, like the tone or the different, like, sound effects and things like that?

Preston So: ASML. Yeah. Synthesized speech markup language.

Todd Nienkerk: Yeah, got it. So there are these tools that are trying to provide that capability, but often the technology producers or the device manufacturers do not actually take advantage of this or build it into the systems.

Preston So: Yeah, and I think this digs at a couple of different problems, not only the fact that this really limits the power and the ability for designers to be able to do what it is that is their jobs, which is to really control all aspects and all dimensions of the user experience. There’s two angles that I think are really concerning about this.

The first is, of course, the easier answer to the question, which is that if you think about services like Amazon, Polli, which allow you to translate a tract of written text or some prose that’s in a paragraph into various dialects, voices and various people that represent these voices, well, that’s not necessarily faithful to the ways in which we have conversation, as we alluded to earlier. I mean, you know, the fact that I just said, “I mean,” or we always say, “you know,” after we say a sentence—these are not things that we’re going to write out necessarily within the context of text that we write for a voice interface to read out the more concerning thing. However, for me, and this is something I talk about at length in the book—voice concern and usability is what this does for not only the experience of the user who is able to have this, as you said, cultural context of hearing somebody who might be a little bit closer to their own demographic. But it also changes the ways in which we perceive the voice interface itself.

And one really scary thing that I see happening very soon here is a lot of the issues that we see with these conglomerates around some of the issues of misinformation, political trust, as well as fact checking and automated racism or algorithmic oppression are things that will percolate into voice interfaces as well. And one very interesting example of this is the fact that it’s very hard to imagine being able to hear your own community—especially if you’re a member of a marginalized community—hearing your own community represented by that voice interface.

And if it’s always going to be the same voice over and over again, that is part of this very limited subset of voices that are available to us is as human beings that are working with voice interfaces, what does that do to our level of authority or our level of credibility or trust that we can foster through these voice interfaces in some ways that could actually intensify the mistrust or the lack of representation or equity that many of us feel when we’re interacting with someone like Alexa, who, generally speaking, if you’re using the default voice and as we know very well, people very, very rarely veer from some of the defaults that very, very few people actually veer from some of the defaults that we have. It is something that could have very big implications for those who need to be able to hear consent by somebody who sounds like them.

Todd Nienkerk: Sadly, that what you just said has not really occurred to me as deeply as it probably should have up until this point, I’m reminded of that video that got passed around a few years ago of somebody with darker skin trying to use an automated soap dispenser. And how that device just clearly was never tested on anybody with darker skin. It’s that simple, like film manufacturers when film was being invented and perfected and honed, you know, Kodak and Fuji Film and all the rest, who were the subjects of photography, lighter-skinned people. Right? So you had these terrible outcomes where people with darker skin were not visible in the photograph, not because they’re invisible. And of course, there’s so much you can derive from both the like the literal and the metaphorical meaning of that. But it’s just that the the chemical makeup of the film was not tuned to to distinguish between certain tones of light like it’s that simple. And so the same thing can be applied to voice. And the way that you describe, you know, algorithmic or automated oppression in voice content, I imagine it not only shows up as a lack of representation, but I’m going to just guess and I’d love to know if you have some insight here, that it may actually show up in terms of lack of tuning the algorithm to understand what somebody is saying so that maybe people with accents can’t use certain devices because it just simply doesn’t understand what they’re saying.

Preston So: Absolutely. And this does that. I think some of the really interesting forms of problems that we see when it comes to representation in technology and, you know, writ large. And I’ll illustrate this with a couple of different examples and kind of share, you know, a little bit of the fact that we have a long ways to go when it comes to really understanding the depth of human language and how that might apply to the voice context and voice technology context. Because, frankly, one of the things that I think is really important to note is as Erika Hall, who I was lucky to have write the foreword for voice content usability, she writes in conversational design that voice and conversation is not a new interface.

It’s the oldest interface, if you think about it, because if we go back in time 500 years ago, 1,500 years ago, and show some of these ancient people a computer keyboard or a computer mouse, they’re not going to know anything about some of these things in the same ways that we don’t really know these days how to work with an abacus. But in the same vein, if we were to speak to them in their own language, we could definitely carry on a conversation in Middle English 500 years ago or ancient Greek 1,500 years ago. And one of the things that I think is really important for us to think about as designers is we’re not just talking about some of these artificial forms of technology that today are replete with some of the problems that we’ve identified.

Those are almost easier problems to solve because visual interfaces, physical interfaces—those are things that we have manufactured from zero and from scratch. Voice interfaces, however, rely on this existing undercurrent in some ways of the ways in which we as humans utilize language and the ways in which we actually interact with one another, which is constantly shifting, very spontaneous, very organic, very extemporaneous things that really don’t have any relationship to an artificial or technical or mechanical underlying foundation.

And there’s a certain interesting problem that surfaces. And I’ll share just a couple of examples of this. The first is we back when I was at Acquia, we had the opportunity to build the first voice interface for residents of the state of Georgia, which is a case study that I cover pretty much over the course of the entire book, Voice Content and Usability. And it was the first among Alexa interfaces that was content driven or information driven. Now, one of the goals of digital services in Georgia has always been to serve their content in accessible and equitable ways. They’ve been very, very big proponents of accessible web technologies for Georgia.gov, their website.

But also they came to us and they said, hey, let’s look at a voice interface because we want to make sure to be able to serve Georgians who might not feel as comfortable using a screen reader because screen readers are fundamentally still rooted and biased towards a visually structured webpage and also those who are elderly Georgians who might not feel as comfortable moving a mouse or typing on a keyboard as they would having a conversation with somebody who’s at their local deli counter. So a lot of the things that we built in were really oriented towards helping these people who might not have the same access to technology that that we have, especially in the younger generations and digital natives of our time have and actually benefit from.

But there’s one really interesting example that I share in the book, which is that we built this interface that answers any question you might have about the Georgia state government, and that could be something like renewing your driver’s license, getting a fishing license, applying for a small business loan. There was this one search result that came up over and over again, this query that we couldn’t figure out that somebody was searching for. And one of the things that we did for Georgia was to actually put side-by-side the reports in the logs that they would get from their analytics on the website with the analytics for the Alexa device that we built.

There was this one result that kept on coming up over and over again. That was “Lawson’s”—L-A-W-S-O-N-apostrophe-S—and we were like, “What is this?” And it came up like 16 times, you know, it was like what is this like? Like a brand name, like a proper name, like who’s searching for this term? And we sat, as you know, I remember like it was yesterday. We sat around, you know, a virtual table and we were thinking and racking our brains for what exactly this person was trying to say.

And as it turns out, one of the native Georgians in the room said, “Oh, I think it’s somebody with a Georgian southern drawl who was trying to say the word ‘license,'” as in driver’s license or nursing license. And could it actually make themselves understood? And this is one example. Yes. Like like driver’s license, like, you know. Yeah. Yeah. So this is a really good example of how the foundations that we rely on can already predispose us to some of these really unfavorable and negative outcomes, because no matter how perfectly we built that voice interface, even coming to an inch short of perfection, sometimes even Alexa still has trouble understanding us and beating us at her own game of human conversation.

Todd Nienkerk: Right. And the sense of marginalization to to have to interact with an automated recording. And here you are trying to say “Lawson’s,” and I’m not making fun. I’m just trying to replicate the effect and had to have this machine tell you, “I’m sorry. I don’t understand what you’re saying,” over and over again. That must be infuriating. But you can pronounce the word “license” any number. And how many examples are there even just in the U.S. American English from region to region like PyCon, he can write. That’s the feeling of of that and the frustration these venues are already so frustrating, but then you’re told by this like very attuned to like Midwest, you know, college-educated voice. Right? That like, you don’t know what you’re talking about is a really bad user experience.

Preston So: It really is. And it’s and it’s almost in some ways a somewhat dehumanizing or, you know, really oppressive experience for a lot of folks. Because if you think about this from several angles. Right? The first angle, of course, is that why are people investing in voice interfaces and voice content to begin with? And I think that’s, of course, gets to the very root of what our jobs are as content practitioners and as designers, which is what is the purpose of this and what is actually happening here in the root of why we’re building this voice interface.

And if you look at a lot of the corporations and a lot of the businesses who are adopting voice interfaces, it’s because ultimately they want to substitute or offload some of the work from those who are call center agents or customer service agents who are on the front lines of a lot of these organizations but stand to lose their jobs. And one of the things that I find really problematic about some of the ways in which a lot of the messaging has surfaced around voice interfaces is, oh, hey, we can just have Alexa or Google Home handle this for you. But what happens to the Filipino woman who is the call center person that I contacted to make my hotel reservation? What happens to, you know, this very friendly call center staffer in India who stands to lose his job because now Alexa can do his job much better, potentially, quote unquote?

Let’s say that he can. And I think there’s a lot of interesting notions also of what happens from the user’s perspective, not just from the, let’s say, interfaces perspective or the business’s perspective, which is the ways that we speak cannot be reduced down to this single mode of speech that is reflected in Alexa. And there’s a futurist called Mark Curtis who talks about the conversational singularity, which I love. It’s a compelling idea. It’s a wonderful idea of this notion where sometime in the future, just like the singularity that will have with AI, quote unquote, we’ll get to a point where having a conversation with a voice interface will be indistinguishable from having a conversation like the one you and I are having right now, Todd.

But the big question is not just, well, who actually really wants to have that sort of experience? Because there is usability research that suggests that a lot of users actually prefer when they’re having a conversation with the voice interface to have a more rehearsed or practicable kind of conversation. The bigger problem, however, is the conversational singularity is going to mean that for whom? Because I think one of the things that you and I know very well, living in the places that we do, New York City and Texas, is that there’s such a wide array of people who speak in a very, very different set of dimensions from the ones that we are used to.

And one example of this is, hey, if you’re going to be going to McAllen, Texas, for example, chances are that a lot of folks who are using these voice interfaces could switch between English and Spanish or they might be having conversations that are bilingual and change midsentence. There’s also so many other examples of this where there are communities of color, for example, who use AVC to communicate and might not actually hear that represented in some of the voice interfaces that they want to use.

And what does that do to the ways in which that intensifies and deepens and solidifies some of the biases that we have as humans since this is Pride Month? Another example of this is, of course, the differences between queer and straight passing modes of speech that so many nonbinary folks or queer folks have to deal with when it comes to interacting with other people in daily life. If we don’t actually encode or codify or represent or include some of these modes of speech and some of these distinctions that we have and some of these nuances and how we as humans actually use language, what is the purpose of voice content and what is the rationale for voice interfaces if not to really get rid of what makes human language so rich and so great in the first place?

Todd Nienkerk: Preston, I could not have ended this conversation more eloquently than you just did, so I won’t even try. And I’m so sad to say that we’re out of time and we have to go. I want to keep doing this forever. This is fascinating. Preston, thank you so much for joining us today. How can our listeners contact you or learn more about you and your most recent book, Voice Content and Usability?

Preston So: Thanks so much, Todd. And I’m sorry to be so longwinded. I wish we did have more time and I wish I kept my answers [short].

Todd Nienkerk: But also, I could listen to you all day.

Preston So: Oh, wonderful. Well, that’s very kind of you. Thank you. So my book is now available. It was just released on June 22, officially from a book apart that’s abookapart.com to get a copy of that book. Everything that Todd and I talked about over the course of this conversation is represented somewhere in that book to some degree, maybe not to this level of detail, but you can also find out more about some of the things that I work on. I work on basically those two extremes of the content world when it comes to digital content, that means content architectures and content technology, as well as content strategy and content design.

I write on both topics as a columnist for CMS Wire. I also have writing and a list of part about these topics. I also do some writing for a Smashing magazine and you can find a lot more insights about my work and some of the writing I do on this topic at preston.so, my website. If you want to reach out to me, you can find me on Twitter @PrestonSo. You can also find me on LinkedIn at PrestonsSo. And also you can reach out to me at PrestonSo@oracle.com. Last but not least, I want to do a big shout out and a big thank you to Four Kitchens for sponsoring our upcoming conference coming up on July 14 and 15, Decoupled Days, the only nonprofit conference about the future of content architectures. It’s going to be the first ever free edition of our conference that we’ve ever had in our history. Likely the only one you can find out more at decoupleddays.com. July 14 and 15. And once again, if you want to see more about my book, check out abookapart.com or my website, preston.so.

Todd Nienkerk: Thank you very much, Preston, and thank you for the Decoupled Days shout-out as well. We’re very, very happy to be a part of it. And thank you for for including us.

Well, dear listeners, I’d love to hear from you as well. What do you want to learn about the future of content? Feel free to send show ideas, suggestions or examples of the content you create. You can email me at future@fourkitchens.com. We’re also on Twitter @focpodcast.

To learn more about Four Kitchens and how we can help you create, manage and distribute your digital content. Visit fourkitchens.com, and finally make sure to subscribe to The Future of Content so you don’t miss any new episodes. Until next time, keep creating content!

Making the web a better place to teach, learn, and advocate starts here...

When you subscribe to our newsletter!

* indicates required field

Email*

Country*

We take your privacy seriously. We do not sell or share your data. We use it to enhance your experience with our site and to analyze the performance of our marketing efforts. To learn more, please see our Privacy Notice.

I agree

EU status

The Future of Content episode 25: Creating Voice Content

Key ideas

Preston So

Links and important mentions

Episode transcript

Making the web a better place to teach, learn, and advocate starts here...

How can we help?