Yufei Ye1,
Abhinav Gupta1,
Kris Kitani1,2,
Shubham Tulsiani1,
1Carnegie Mellon University,
2Meta AI
CVPR 2024
Hand-object interactions are represented as interaction grids within the diffusion model. This interaction grid concatenates the (latent) signed distance field for object and skeletal distance field for the hand. Given a noisy interaction grid and a text prompt, our diffusion model predicts a denoised grid. To extract 3D shape of HOI from the interaction grid, we use decoder to decode object latent code and run gradient descent on hand field to extract hand pose parameters.
Input | Output 0 | Output 1 | Output 2 | Output 3 | Output 4 |
---|---|---|---|---|---|
power drill | |||||
spray | |||||
plate | |||||
wine glass |
We parameterize HOI scene as object implicit field, hand pose, and their relative transformation (left). The scene parameters are optimized with respect to the SDS loss on extracted interaction grid and reprojection loss (right).
We parameterize human grasps via hand articulation parameters and the relative hand-object transformation (left). These are optimized with respect to SDS loss by converting grasp (and known shape) to interaction grid (right).
Input Object | GT | GraspTTA | G-HOP |
---|---|---|---|