A talk by Tao Yu.
Notation
New Concepts
Important Results / Claims
- Challenges of Agent Data Collection
- Challenges of Agent Benchmarking
- human agent interaction collection procedure
Questions
Interesting Factoids
- funny: based on Computer Agent Arena results, Claude Computer Use scores lower than normal Claude because it appears that Claude Computer Use over-fitted to Ubuntu