G-HOP: Generative Hand-Object Prior
for Interaction Reconstruction and Grasp Synthesis

Yufei Ye1, Abhinav Gupta1, Kris Kitani1,2, Shubham Tulsiani1,
1Carnegie Mellon University, 2Meta AI
CVPR 2024


Generative Hand-Object Prior

Method Overview

Hand-object interactions are represented as interaction grids within the diffusion model. This interaction grid concatenates the (latent) signed distance field for object and skeletal distance field for the hand. Given a noisy interaction grid and a text prompt, our diffusion model predicts a denoised grid. To extract 3D shape of HOI from the interaction grid, we use decoder to decode object latent code and run gradient descent on hand field to extract hand pose parameters.

HOI Generations

InputOutput 0Output 1Output 2Output 3Output 4
power drill
wine glass
More generations.

Reconstructing Interaction Clips

Prior-Guided Reconstruction

We parameterize HOI scene as object implicit field, hand pose, and their relative transformation (left). The scene parameters are optimized with respect to the SDS loss on extracted interaction grid and reprojection loss (right).

Comparison with Baselines

We compare with three template-free baselines: DiffHOI, iHOI and HHOR. In contrast to G-HOP (ours), DiffHOI guides reconstruction with hand-conditioned image-based prior and its reconstructed shapes are coarse; iHOI makes per-frame prediction and it's not temporally-consistent; HHOR does not use any data-driven priors and it struggles to hallucinate unobserved area.

Synthesizing Plausible Human Grasps

Prior-Guided Grasp Synthesis

We parameterize human grasps via hand articulation parameters and the relative hand-object transformation (left). These are optimized with respect to SDS loss by converting grasp (and known shape) to interaction grid (right).

Comparison with Baselines

We compare with a baseline GraspTTA and ground truth annotations (GT) on HO3D. In contrast to G-HOP (ours), GraspTTA synthesizes grasps that use more finger tip rather than palm region; GT are plausible and contains reaching out motion as it comes from a video sequences. Note that grasp generation is multi-modal, and geenration can be different from GT.
We visualize the synthesized grasps by each methods.
Input ObjectGTGraspTTAG-HOP
Comparisons on ObMan and HO3D.
More diverse grasp generations by our method: link.


@inproceedings{ye2023ghop, author = {Ye, Yufei and Gupta, Abhinav and Kitani, Kris and Tulsiani, Shubham} title = {G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis}, booktitle = {CVPR}, year = {2024} }