Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips

Abstract

We tackle the task of reconstructing hand-object interactions from short video clips. Given an input video, our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape, as well as the time-varying motion and hand articulation. While the input video naturally provides some multi-view cues to guide 3D inference, these are insufficient on their own due to occlusions and limited viewpoint variations. To obtain accurate 3D, we augment the multi-view signals with generic data-driven priors to guide reconstruction. Specifically, we learn a diffusion network to model the conditional distribution of (geometric) renderings of objects conditioned on hand configuration and category label, and leverage it as a prior to guide the novel-view renderings of the reconstructed scene. We empirically evaluate our approach on egocentric videos across 6 object categories, and observe significant improvements over prior single-view and multi-view methods. Finally, we demonstrate our system's ability to reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person interactions.

Results

We compare our method (DiffHOI) with two other template-free baselines iHOI (Ye et al.,CVPR 22) and HHOR (Huang et al., SIGGRAPHA 22) . In contrast to DiffHOI, iHOI makes per-frame prediction; while HHOR reconstruct 3D from video clips, it does not use any data-driven priors.

More comparison on: HOI4D, in-the-wild clips.

More results of ours: in-the-wild clips.

More Discussion and Analysis:

Can template-free method perform better than template-based method?

We compare DiffHOI with a template-based method (HOMAN, Hasson et al). Although there is room to improve geometry details for DiffHOI, template-based method struggles to place the object in context of hand, especially for challenging structures like handle. Furthermore, HOMAN is sensitive to the template selection.

Here are more comparison.

What does the data-driven prior look like?

Given the geometry rendering of hand (only showing surface normals) and a text prompt, we visualize 4 different generations from the diffusion model (middle). Note the left and middle column share the same text condition while middle and right column share the same hand condition. While the generated objects vary, they all appear plausible in context of the text prompt and the hand pose (overlayed images shown at the bottom row).

Here are more generation.

How does each learned prior help?

We analyze how the category and hand priors affect reconstruction by training separate diffusion models that only condition on one of them (Category/Hand Prior). We further compare with distillation without the learned prior (No Prior). We find that category prior helps object reconstructions while hand prior helps hand-object relation. Furthermore, both are important for reconstructing challenging objects, e.g. bowl.

Here are more comparison.

Which modality matter more for distillation?

We analyze how much each geometry modality (mask, normal, depth) contributes when distilling them to 3D shapes by setting their weight in SDS loss to 0. We find that normal is the most crucial modality, followed by masks. Depth helps in hand-object alignement.

Here are more comparison.