TLDR: Given SLAMed egocentric videos, unlike existing methods that predict either hands or object poses separately, WHOLE jointly reconstructs coherent hand and object motion in the world space by guiding a generative motion prior.
Reconstruction Using the Generative Motion Prior. Given metric-SLAMed egocentric videos and the object template O, we alternate the diffusion generation step and the guidance step to predict hand motion H, object 6D trajectory T, and binary contact C as the final output x0. The diffusion model Dψ is conditioned on object geometry and an approximated hand H from an off-the-shelf hand estimator to diffuse the noisy parameters xn. The guidance step refines the denoised output by optimizing task-specific objectives g so they align with video observations ŷ such as 2D masks and contact. The contact labels Ĉ are automatically produced by prompting a VLM.
Please refer to Section 4.2 for further analysis.
Blended Generation without Guidance: The pretrained diffusion model takes the object template and approximated hand poses as input and predicts binary contact labels (for both hands, visualized in red), object motion, and hand motion. While contact details can still be refined through post-processing, the model already provides a reasonable prior over hand-object dynamics that captures how an object moves while held and how it behaves after being released. The two rows illustrate diverse samples, most notably in the timing of contact events and the resulting object trajectories.
Hand-Guided HOI Planner: Beyond reconstruction, our framework can directly synthesize diverse hand-object interaction motions from a coarse hand trajectory, picking/placing times (contact labels). The coarse hand trajectory conditions the diffusion process, while contact labels are injected at each guidance step.
We identify two failure modes. First, the reconstructed object orientation is sometimes flipped, likely because mask-based reprojection guidance alone is not strong enough to correct the generation. Second, object poses can look tilted especially after being released by the hand; with even small errors in object pose with egocentric moving camera, the resulting trajectory may no longer align well with the gravity direction in world space.