We compare our method (DiffHOI) with two other template-free baselines iHOI (Ye et al.,CVPR 22) and HHOR (Huang et al., SIGGRAPHA 22) . In contrast to DiffHOI, iHOI makes per-frame prediction; while HHOR reconstruct 3D from video clips, it does not use any data-driven priors.
We compare DiffHOI with a template-based method (HOMAN, Hasson et al). Although there is room to improve geometry details for DiffHOI, template-based method struggles to place the object in context of hand, especially for challenging structures like handle. Furthermore, HOMAN is sensitive to the template selection.
Given the geometry rendering of hand (only showing surface normals) and a text prompt, we visualize 4 different generations from the diffusion model (middle). Note the left and middle column share the same text condition while middle and right column share the same hand condition. While the generated objects vary, they all appear plausible in context of the text prompt and the hand pose (overlayed images shown at the bottom row).
We analyze how the category and hand priors affect reconstruction by training separate diffusion models that only condition on one of them (Category/Hand Prior). We further compare with distillation without the learned prior (No Prior). We find that category prior helps object reconstructions while hand prior helps hand-object relation. Furthermore, both are important for reconstructing challenging objects, e.g. bowl.
We analyze how much each geometry modality (mask, normal, depth) contributes when distilling them to 3D shapes by setting their weight in SDS loss to 0. We find that normal is the most crucial modality, followed by masks. Depth helps in hand-object alignement.