visual prediction task
Given \(n\) frames of video \(x_{1}, …, x_{n}\), predict \(T\) subsequent frames \(x_{n+1}… x_{n+T}\).
Action Conditioned Prediction Network
Visual prediction task. Two inputs:
- taken action
- image
Predict the next \(t\) frame. Challenge! What is the action? Predicting the action is not super easy.
RAFI
Conditional Flow Matching with Video Generation.