Two main Dialogue Systems architectures:
- frame based systems: talk to users + accomplish specific tasks
- LLM: reasoning as agents
Dialogue Systems vs Chatbot
Previously, when we say Chatbot we mean task-based systems
humans and chat
humans tend to think of Dialogue Systems as human-like even if they know its not. this makes users more prone to share private information and worry less about its disclosure.
ELIZA
see ELIZA
LLM Chatbots
Training Corpus
C4: colossal clean crawled corpus
patent, wikipedia, news
Chatbots
- EmphaticDialogues
- SaFeRDialogues
- Pseudo-conversations: reddit, twitter, weibo
Fine-Tuning
- quality: improving sensible and interesting responses
- safety: prevention of suggesting harmful actions
IFT: perhaps you can add positive data as fine tuning as a part of instruction-finetuning step.
Filtering: build a filter for whether something is safe/unsafe, etc.
Retrieval Augmented Generation
- call search engine
- get back a retrieved passages
- shove them into prompt
- “based on this tasks, answer:”
we can make Chatbots use RAG by adding “pseudo-participants” to make the chat bots, which the system should add.
Evaluation
- task based systems: measure task performance
- chatbot: enjoyability by humans
we evaluate chatbots by asking a human to assign a score, and observer is a third party that assigns a score via a transcript of a conversation.
participants scoring
interact with 6 turns, then score:
- avoiding repetition
- interestingness
- sensemaking
- fluency
- listening
- inquisitiveness
- humanness
- engagingness
ACUTE-EVAL: choosing who you would like to speak to
adversarial evaluation
train a human/robot classifier, use it, use the inverse of its score at the metric of the chat bot
task evaluatino
measure overall task success, or measure slot error rate
design system design
Don’t build Frankenstein: safety (ensure people aren’t crashing cars), limiting representation harm (don’t demean social groups), privacy
study users and task
what are their values? how do they interact?
build simulations
wizard of oz study: observe user interaction with a HUMAN pretending to be a chat bot
test the design
test on users
info leakage
- accidentally leaking information (microphone, etc.)
- intentionally leaking information due to advertising, etc.