Yufei Ye1*,
Xueting Li2,
Abhinav Gupta1,
Shalini De Mello2,
Stan Birchfield2,
Jiaming Song2,
Shubham Tulsiani1,
Sifei Liu2
1Carnegie Mellon University,
2NVIDIA
(* the work was done at an internship at NVIDIA)
Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (i.e., an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation.
We show stochastic generation of our model on novel instances on HOI4D dataset (below). The model generalizes surprisingly well to portable-sized objects from in-the-wild scenes. For example, the teaser image shows zero-shot generalization to EPIC-KITCHEN dataset. Please refer to our papers for comparison and more results.
The layout representation allows users to edit and control the generated hand’s structure. We show HOI image synthesis when interpolating layout representation (no temporal smoothing).
Heatmap/Keypoint has been a prevalent representation for visual affordance and there have been many great works to predict them (Here is an incomplete list: 1,2,3,4). While they predict plausible 2D location of interaction, we ask if more information can be extracted on top of that, such as articulation, approaching direction, etc.
Besides object-centric prediction, we show that Affordance Diffusion can be applied to scenes. Given a cluttered scene, we detect each object and synthesize its interactions individually. Each object’s layout scale is guided to appear in the same size when transferred back to the scene.