Compositional Video Prediction

Yufei Ye    Maneesh Singh    Abhinav Gupta*    Shubham Tulsiani*   

Carnegie Mellon University    Facebook AI Research    Verisk Analytics   

in ICCV 2019

Paper | Code | Poster | Bibtex


We present an approach for pixel-level future prediction given an input image of a scene. We observe that a scene is comprised of distinct entities that undergo motion and present an approach that operationalizes this insight by implicitly predicting future states of independent entities while reasoning about interactions among them, and composing future video frames using predicted states. We overcome the inherent multi-modality of the task using a global trajectory-level latent random variables, and show this allow us to sample more diverse and plausible futures compared to commonly used per-timestep latent variables models. We empirically validate our approach against alternate representations choices and ways of incorporating multi-modality. We examine two datasets, one comprising of stacked objects that may fall, and another containing videos of humans performing activities in a gym, and show that our approach allows realistic stochastic video prediction across these diverse settings.


Method Overview

Our model takes as input an image with known/detected location of entities. Each entity is represented as its location and an implicit feature. Given the current entity representations and a sampled latent variable, our prediction module predicts the representations at the next time step. Our learned decoder composes the predicted representations to an image representing the predicted future. During training, a latent encoder module is used to infer the distribution over the latent variables using the initial and final frames.

paper thumbnail

Paper

arxiv, 2019.

Citation

Yufei Ye, Maneesh Singh, Abhinav Gupta, and Shubham Tulsiani.
"Compositional Video Prediction", in ICCV, 2019. Bibtex

Code



ShapeStacks Results

Results by Entity Predictors

(Click to view full resolution)



Generalization to more blocks (train with 3 blocks)

(Click to view full resolution)





Visualization of five randomly sampled future predictions

(Click to view full resolution)



Penn Action Results

(Click to view full resolution)



Visualization of three randomly sampled future predictions

(Click to view full resolution)



Acknowledgements

We would like to thank the members of the CMU Visual and Robot Learning group for fruitful discussion and helpful comments. This webpage template was borrowed from some GAN folks.