agents that uses the language to act on behave of another person or group.
Challenges
See Challenges of Language Model Agents
Methods
ReAct
See ReAct
Aguvis
Take the AgentNet dataset, and then tune a vison LM to roll out the rest of the sequence of actions given screenshots as input on top of a Qwen base model.
We can also add on top Chain of Thought to get more thinking as well.
Formulations
OSWorld
A unified task setup and evaluation.
Motivation: Given language is a universal task specification, can we create a universal digital environment—with unified observation and action spaces?
spec
config
- initial state: how to setup, what to open, what files, etc.
- evaluator: an evaluation script for task being done
Obeservation
screen, screen shot, etc.
Screenshot vs API Tradeoffs
- most websites/applications don’t have them exposed
- API outputs is very hard to verify quickly, whereas actual mouse action is easy to verify+stop
action
mouse keyboard controls, move, PyAutoGui style
dataset
369 computer use task for evaluations
evals
Claude, for one, is still really really bad at computer use. Claude computer use gets ~20% success rate versus humans’ ~70%.
Interactive Agents
Big question: how to we align agents in an interactive, dynamic way (i.e. without instruction fine tuning which is hard). Language is information that helps agents predict the future; instructions is world modeling
- instead of instructions => actions (executor)
- instructions => updated belief (world model)
User intent => action shouldn’t have LLM language representation in the middle as a bottleneck.
There is an underlying representation of the user’s preferences, you have to use language to coax it out of them.
Dynalang
- build model that takes vision + language as a joint input
- pass it through an auto-encoding representation
- have the world model predict the next-encoding representation
Main Idea: modeling language/tokens/images as a joint latent representation over time.
Training objective:
- reconstruction loss against the future presentation: using \(R_{i}\) to predict \(R_{i+1}\)
- predict the reward over time
- regularize?
Workflow
- take reward/preferences/behavior data
- structure learning to create the relationships between elements in the data structure
Evaluations
Computer Agent Arena
- an open source platform for digital ai agents
- users can preference-rank different agent performances
workflow
- select OS environment (Windows, Ubuntu…) to create identical instances
- configure computers in initial setup using preset scripts / click to have custom setup (why custom setups? to create diversity of senarios to help more generalization)
- we automatically generate interaction scenarios given a user task prompt
- finally, human perform scoring:
- Correct or Not?
- Which one is Better?
- Safe or not?
goals
- for eval: evaluate + rank agents
- training: data collection, RL, etc.