Houjun Liu

Video Generation with Learned Priors

visual prediction task

Given \(n\) frames of video \(x_{1}, …, x_{n}\), predict \(T\) subsequent frames \(x_{n+1}… x_{n+T}\).

Action Conditioned Prediction Network

Visual prediction task. Two inputs:

taken action
image

Predict the next \(t\) frame. Challenge! What is the action? Predicting the action is not super easy.

RAFI

Conditional Flow Matching with Video Generation.

https://arxiv.org/pdf/2406.14436