[MUSIC] Speech is the fundamental underpinning of human social interaction. When we interact with other people, we spend most of the time talking. We talk to communicate things to other people. But talking is also just a general form of interaction that we use to pass the time and to build relationships with people. What we often call small talk. So any conversation has two components. A component of meaning, which is mostly carried by the words we say. And a social and emotional component which is carried mostly by the body language as well as the tone of voice. In most of the following videos, I'm going to talk about the social component of body language. But I'd like to start by talking about the meaning component and language itself. If I want a virtual character to have a conversation with me, it has to do a lot of very complex things. First, it needs to know when I've stopped talking and it's time for it to start talking. Body language can help with that, but the easiest way is just to check the microphone volume and see if I'm making a noise or not. It then has to take the audio data of my speech and figure out what the words are. It has to take these word and figure out what they mean in this context. And then it has to come up with a response. Another set of words that appropriately answer me. And finally, it has to take those words and generate audio of realistic and ideally, appropriately expressive speech. These last two components are the generation of speech and include the component of meaning, generating words and a component of expression, generating the audio. We can try and do all of this from scratch using an algorithm. Chatbot, pieces of software that generate dialogue, have improved massively over the last couple of years, and they can create some reasonably realistic conversations, at least in constrained domains. But it's still an incredibly difficult problem. And is very easy for chapter books to get things wrong. Speech generation goes from words to audio isn't easy either, there are a lot of good text to speech systems that can generate clear understandable audio from words. But they lack the expression of a real human voice and since the tone of voice an expression is so important to social aspects of conversation, having an unrealistic voice can ruin the social impact of an interaction. There have also been big advances in speech generation the last couple years. And expect that what I just said will no longer be true in a few years' time. But doing realistic, expressive conversation in the algorithmic way is can be really hard, if not impossible. It certainly isn't something I would recommend to a beginner. Luckily, there's an alternative. We don't need to generate all our speech from scratch. We can just record someone, an actor, performing the speech, and that will give us the audio we need. The quality's likely to be very high, particularly if you've got a good actor and a good microphone, and it will sound realistic and expressive. The downside is you'll only have a small number of possible responses your character can do. Generating speech from scratch allows you to respond appropriate to all sorts of different things. But if your audio is prerecorded, you'll only have a few tens of clips to use. But in many situations it's quite easy to make that work, since the nature of the dialogue will be constrained by the narrative you'll experience. There are particular things you want to accounts to communicate and you record audio for that. Now that leaves us the other side of the conversation, what the real person is saying? Turning audio into words actually works quite well now. I often use dictation software to write the scripts for my video lectures. Speaking makes a text more natural than if you're always typing them. And the dictation works pretty well, it makes a few mistakes but really not that many, it's probably good enough for interaction with a virtual character. The hard part is figuring out what those words mean and how the CAD should respond to them. As I mentioned just now, you could use a chatbot software and that might work but there are a lot of difficulties particularly if you want to write your own chatbot. There's also another, more artistic problem with allowing freeform dialogue with characters. If you're creating a narrative experience, a fiction, you want to have great dialogue, all of which is really in character. But the problem is that your players aren't scriptwriters, aren't actors. They aren't going to be able to improvise dialogue that will live up to the character that they're playing. So asking to do so is putting a lot of pressure on them. And when they say things that aren't in character, it will make them feel less embodied in their character and more like themselves. The easiest way to get around a problem is to limit what your user can say to only a few phrases so that you can be sure what they mean and you can be sure which responses you should give to them. Before your experience start, you can explain to the player the words and phrases that they can say. The important thing is to be very clear what will and won't be recognized. You don't want the player trying to say lots of things and the character failing to recognize them all, that can be very frustrating. An idea I particularly like is not to have players speak at all, but to use gestures. There are lots of gestures that we use to communicate with other people. In Western cultures, these include nodding our head to say yes, and shaking it to say no, and pointing at things. Yes, no, and indicating objects by pointing can actually communicate quite a lot, as you all know if you've tried shopping in a country where you don't speak the language. I know this from my own experience, my uncle was unable to speak or use sign language due to his disability, but he's able to take a full part in family conversations using only simple gestures. Sure, it limits the input from players, but it also put less pressure on them and focuses the interaction on the character's dialogue. You need to write dialogue where the character takes initiative and the player is mostly responding to it. So you save yourself the trouble of understanding comments from the player. You also know the kind of thing the player's likely to say. If this dialogue is good, it will drive the interaction and feel light. Using standard gestures like nodding your head means it will feel right. In terms they will be plausible because we're interacting with the character in the way we interact with real people. Even if it's a bit more limited. So speech interaction is really challenging, but it's still possible to create a compelling experience by using a set of prerecorded audio clips, and limiting the personal interactions with the player. This forms the basis of an interaction with a virtual character and it can be massively enhanced by realistic body language as we'll see. [MUSIC]