Devlog #2 – Dialogue interpretation

Speech and language is complex. Natural Language Processing (NLP) has been a field of research since the 1950s – though people have been speaking for a lot longer than that.

Interpreting what someone means needs to take into account situational context, relationships, cultural specifics, previous conversations, on-going events, knowledge, memories, and so on. People misunderstand or misinterpret each other all the time, or can’t find common ground to communicate on.

Speech is a key gameplay feature in EverWorld – the player can at any point speak to anyone about anything. This gives a very wide possibility space of possible speech that needs to be handled in the first few minutes of the game. A player could say anything from ‘hi’ to ‘it’s great to meet you but could you please give me a pickaxe?’

Recent (2020s) developments in generative AI and Large Language Models (LLMs) have been hugely impressive and have set high player expectations for speech and dialogue in general. These work very differently to classical an NLP type approach:

NLP

  1. Tag words with Part Of Speech Tags (e.g. a noun, verb, preposition etc)
  2. Parse dependencies – which words affect which other words
  3. Semantic labelling – adds meaning to each words and identifies named entities

Generative AI (LLMs)

  1. Converts words to tokens (parts of words that have no linguistic meaning – like ‘happy’ could become ‘ha’ and ‘ppy’
  2. Pass tokens through each ‘layer’ of the model with ‘self-attention’ for whcih words affect other words (a little like dependency parsing)
  3. Calculate probabilities of what words should come next

These are high level and generalised descriptions of the approaches and not intended to be exhaustive (or particularly scientific!). NLP can be logic / rule-based and also use machine learning models, whereas generative AI only use machine learning models. Generative AI can be used to POS tag / parse dependencies and so on, but fundamentally works the same way whether being asked to respond to a piece of dialogue or to label it.

EverWorld characters have a few requirements for them to be able to fulfil the gameplay mechanic of free text conversation. Just thinking about how to interpret dialogue from the player, they need:

  • Access to a lot of game world data
  • A ‘dictionary’ of ‘intents’ – actual meaning behind the words (e.g. the intent for ‘hi’ is Greet)
  • A way to store and save what’s been said to them
  • Fast dialogue interpretation (sub 2 seconds)

Generally speaking these lend themselves better to a more ‘traditional’ NLP type approach, which is what we use in EverWorld. We do POS tagging, match strings, identify semantic meaning (e.g. what does ‘it’ mean in a sentence), have a structured list of things that characters understand and so on.

Access to game world data

Characters need to know everything from where they are, to what they’re feeling, how hungry they are, what they did recently, who they know, what they think of who they know, memories of game events – the list goes on. They also need to know what they don’t know – one individual character doesn’t know everything about the world, and doesn’t know anything about an event that took place on the other side of the map (unless they were told about it).

That’s a lot of structured information (structured meaning data that can be stored in a database / set of programming objects). Generative AI is not great with structured information – it can handle some up to a point, but is much better with unstructured information. To respond to a single piece of dialogue, an EverWorld character needs access and a way to interact with literally thousands of data points, and do so performantly.

Alot of information about the game is stored in a SemanticTriple – a class with a subject, predicate and target (also called object in linguistics, but object means something specific in programming). EverWorld uses an entity component system for character information, with each component storing most information in SemanticTriples – for example character preferences are stored as SemanticTriple like this:

{thisEntity, dislikes, cucumber}

These SemanticTriples are searchable, and where performance requires are cached in dictionaries for easier lookups than lists. Some information is more easily stored as a custom object – like a ‘GameEvent’ or a ‘Memory’, where a SemanticTriple isn’t sufficiently granular to store the meaning.

Dictionary of intents

Characters need to understand the meaning behind words. There are tens (maybe hundreds) of ways of saying hello – but the meaning behind them all is largely the same, it’s a greeting. Characters might say something in a certain way depending on their personality / opinion (e.g. it’s clear by the way someone greets you if they don’t like you very much, or if they’re generally a grump person) – but at the end of the day characters need to know what the underlying intent is. They can also use those same intents to talk to each other, but we’ll cover that in a different post.

For any particular intent, EverWorld has a set of utterances that map that intent. So for ‘Greet’ it could be ‘hi’, ‘hey’, ‘hello’ etc. That’s a good base way of starting to understand what the player means, though beyond very basic conversation things get complex – what does ‘yes’ or ‘ok’ mean without context? For this reason, each ‘Intent’ has its own set of requirements, which are used to weight and score each possible meaning behind certain dialogue. For an Intent like ‘agreeToTellAboutSelf’, the utterance ‘ok’ is valid – but that Intent also has a requirement where the dialogue before that has to be ‘askToTellAboutSelf’. This enables the same utterance meaning lots of different things in different contexts, so saying ‘ok’ works.

Intents themselves have swappable ‘parameters’ – for instance, the intent WhatDoYouThinkOfThisFood has any kind of food or ingredient as a parameter, which means a single intent can cover hundreds or thousands of different sentences. We also use ‘placeholders’ where multiple words could be used – for example ‘can I have a pickaxe’ and ‘could I have a pickaxe’ mean the same thing, and we use an interchangeable ‘token’ for the word can / could, again expanding how much utterances the game can identify for a particular intent.

Storing and saving dialogue

Characters need to remember what’s been said to them. They need to know if a player insulted them, or told them what their favourite food was, or whatever else. All of this data needs to be both readily accessible but also serialisable so it can be stored in a save game.

Most things that happen in EverWorld cause a ‘GameEvent’. This could be anything from someone opening a door, to harvesting a plant, to saying something. Each GameEvent could contain a single piece of dialogue, but that dialogue could contain several meanings – for example, ‘Hi, how are you? I haven’t seen you in ages’ actually has three different Intents – Greet, HowAreYou, and IHaventSeenYouInAWhile. All of these need to be stored, and characters have to choose which one to respond to (or at least which one to respond to first) – so each ‘said thing’ is stored separately, with a ‘notability’ value which indicates how important it is. This is pretty crucial for realistic conversation – if I were to say ‘hi, how are you, I went to space yesterday, what’s going on?’ – it’s pretty obvious which of those are more important to talk about then others.

Fast dialogue interpretation

Characters need to respond in a reasonable amount of time. They need to both understand what the player has said, and also formulate their own response. Anything beyond a few seconds feels too long, so dialogue interpretation needs to be performant. EverWorld is a locally running single player game, which means no powerful servers in the background processing this kind of stuff.

Unfortunately dialogue necessitates the use of strings – which are not well known for their performance in C#! We go to great pains to manage strings carefully to minimise memory allocation and make the player feel like they are talking to a living entity rather than a textbox. We use lots of things here – object pools, list pools, string builders, regex, multi-threading, caches, early exits – there’s no silver bullet for making it run fast.

Generative AI + LLMs

We’ve experimented with LLMs for dialogue parsing specifically, and their might be some value here in the future as the technology improves. To start with, we’ve had to look only at SLMs (Small Language Models) that can fit in memory – quantised models of about 7b parameters need about 8GB of RAM, which is a lot off the bat. Smaller models (3-4b) can be about 4GB RAM but don’t provide great responses. The models themselves also need packaging with the app, which is not a massive issue but a consideration.

We’ve found that language models aren’t consistent in how they interpret a particular piece of dialogue. When provided with a list of all possible intent names, they choose different ones each time for the same piece of dialogue, and don’t break dialogue up into sub-intents accurately. Context windows of language models also mean that providing previous conversation or lots of game world data facts doesn’t work particularly well – language models tend to ignore or forget when too much background information is provided to them. Guessing time can be slow, and language models will also happily ‘hallucinate’ a new Intent that hasn’t been defined, which makes it difficult to use in a game like EverWorld.

We’ll keep an eye on it as the technology improves, but for now (and the foreseeable future), EverWorld will continue to use logic and rules for dialogue interpretation.