Dialogue State Architecture

Dialogue State Architecture uses dialogue acts instead of simple frame filling to perform generation; used currently more in research.

NLU: slot fillers to extract user’s utterance, using ML
Dialogue State Tracker: maintains current state of dialogue
Dialogue policy: decides what to do next (think GUS’ policy: ask, fill, respond)—but nowaday we have more complex dynamics
NLG: respond

dialogue acts

dialogue acts combines speech-acts with underlying states

slot filing

we typically do this with BIO Tagging with a BERT just like NER Tagging, but we tag for frame slots.

the final <cls> token may also work to classify domain + intent.

corrections are hard

folks sometimes uses hyperarticulation (“exaggerated prosody”) for correction, which trip up ASR

correction acts may need to be detected explicitly as a speech act:

dialogue policy

we can choose over the last frame, agent and user utterances:

\begin{equation} A = \arg\max_{a} P(A|F_{i-1}, A_{i-1}, U_{i-1}) \end{equation}

we can probably use a neural architecture to do this.

whether to confirm via ASR confirm:

\(<\alpha\): reject
\(\geq \alpha\): confirm explicitly
\(\geq \beta\): confirm implicitly
\(\geq \gamma\): no need to confirm

NLG

once the speech act is determined, we need to actually go generate it: 1) choose some attributes 2) generate utterance

We typically want to delexicalize the keywords (Henry serves French food => [restraunt] serves [cruisine] food), then run through NLG, then rehydrate with frame.