Affordance Diffusion: Synthesizing Hand-Object Interactions

CVPR 2023


Yufei Ye1*, Xueting Li2, Abhinav Gupta1, Shalini De Mello2, Stan Birchfield2, Jiaming Song2, Shubham Tulsiani1, Sifei Liu2
1Carnegie Mellon University, 2NVIDIA
(* the work was done at an internship at NVIDIA)

Paper
Slides
Poster
Code (Coming Soon)

Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (i.e., an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation.

Narrated Video

Appearance at NVIDIA GTC


Stochastic HOI Synthesis


We show stochastic generation of our model on novel instances on HOI4D dataset (below). The model generalizes surprisingly well to portable-sized objects from in-the-wild scenes. For example, the teaser image shows zero-shot generalization to EPIC-KITCHEN dataset. Please refer to our papers for comparison and more results.

Application: User Editing


The layout representation allows users to edit and control the generated hand’s structure. We show HOI image synthesis when interpolating layout representation (no temporal smoothing).


Application: Heatmap Guided Synthesis


Heatmap/Keypoint has been a prevalent representation for visual affordance and there have been many great works to predict them (Here is an incomplete list: 1,2,3,4). While they predict plausible 2D location of interaction, we ask if more information can be extracted on top of that, such as articulation, approaching direction, etc.

Application: Integrating into Scenes


Besides object-centric prediction, we show that Affordance Diffusion can be applied to scenes. Given a cluttered scene, we detect each object and synthesize its interactions individually. Each object’s layout scale is guided to appear in the same size when transferred back to the scene.

Bibtex


@inproceedings{ye2023affordance, title={Affordance Diffusion: Synthesizing Hand-Object Interactions}, author={Yufei Ye and Xueting Li and Abhinav Gupta and Shalini De Mello and Stan Birchfield and Jiaming Song and Shubham Tulsiani and Sifei Liu}, year={2023}, booktitle ={CVPR}, }

Acknowledgement: This work is supported by NVIDIA Graduate Fellowship to Yufei. The authors would like to thank Xiaolong Wang, Sudeep Dasari, Zekun Hao and Songwei Ge for helpful discussions.

Send feedback and questions to Yufei Ye. The website template is borrowed from SIREN.