G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis

G-HOP: Generative Hand-Object Prior
for Interaction Reconstruction and Grasp Synthesis

Yufei Ye¹, Abhinav Gupta¹, Kris Kitani^1,2, Shubham Tulsiani¹
¹Carnegie Mellon University, ²Meta AI
CVPR 2024

Paper

Video

Slides

Code

Generative Hand-Object Prior

Method Overview

Hand-object interactions are represented as interaction grids within the diffusion model. This interaction grid concatenates the (latent) signed distance field for object and skeletal distance field for the hand. Given a noisy interaction grid and a text prompt, our diffusion model predicts a denoised grid. To extract 3D shape of HOI from the interaction grid, we use decoder to decode object latent code and run gradient descent on hand field to extract hand pose parameters.

HOI Generations

Input	Output 0	Output 1	Output 2	Output 3	Output 4
power drill
spray
plate
wine glass

More generations.

Reconstructing Interaction Clips

Prior-Guided Reconstruction

We parameterize HOI scene as object implicit field, hand pose, and their relative transformation (left). The scene parameters are optimized with respect to the SDS loss on extracted interaction grid and reprojection loss (right).

Comparison with Baselines

We compare with three template-free baselines: DiffHOI, iHOI and HHOR. In contrast to G-HOP (ours), DiffHOI guides reconstruction with hand-conditioned image-based prior and its reconstructed shapes are coarse; iHOI makes per-frame prediction and it's not temporally-consistent; HHOR does not use any data-driven priors and it struggles to hallucinate unobserved area.

Baseline comparisons and ablation study on HOI4D.
More in-the-wild clips.

Synthesizing Plausible Human Grasps

Prior-Guided Grasp Synthesis

We parameterize human grasps via hand articulation parameters and the relative hand-object transformation (left). These are optimized with respect to SDS loss by converting grasp (and known shape) to interaction grid (right).

Comparison with Baselines

We compare with a baseline GraspTTA and ground truth annotations (GT) on HO3D. In contrast to G-HOP (ours), GraspTTA synthesizes grasps that use more finger tip rather than palm region; GT are plausible and contains reaching out motion as it comes from a video sequences. Note that grasp generation is multi-modal, and geenration can be different from GT.
We visualize the synthesized grasps by each methods.

Input Object	GT	GraspTTA	G-HOP

Comparisons on ObMan and HO3D.
More diverse grasp generations by our method: link.

Bibtex

@inproceedings{ye2023ghop, author = {Ye, Yufei and Gupta, Abhinav and Kitani, Kris and Tulsiani, Shubham} title = {G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis}, booktitle = {CVPR}, year = {2024} }