Big question: how to we align agents in an interactive, dynamic way (i.e. without instruction fine tuning which is hard).
Sequentiality is hard:
- what is the context/motivation?
- how to you transfer across contexts?
- how do you plan?
Key Idea
Language is information that helps agents predict the future; instructions is world modeling
- instead of instructions => actions (executor)
- instructions => updated belief (world model)
User intent => action shouldn’t have LLM language representation in the middle as a bottleneck.
There is an underlying representation of the user’s preferences, you have to use language to coax it out of them.
Dynalang
- build model that takes vision + language as a joint input
- pass it through an auto-encoding representation
- have the world model predict the next-encoding representation
Main Idea: modeling language/tokens/images as a joint latent representation over time.
Training objective:
- reconstruction loss against the future presentation: using \(R_{i}\) to predict \(R_{i+1}\)
- predict the reward over time
- regularize?
Workflow
- take reward/preferences/behavior data
- structure learning to create the relationships between elements in the data structure