WHOLE:World-Grounded Hand-Object Lifted from Egocentric Videos

Method Overview

Reconstruction Using the Generative Motion Prior. Given metric-SLAMed egocentric videos and the object template O, we alternate the diffusion generation step and the guidance step to predict hand motion H, object 6D trajectory T, and binary contact C as the final output x₀. The diffusion model D_ψ is conditioned on object geometry and an approximated hand H from an off-the-shelf hand estimator to diffuse the noisy parameters x_n. The guidance step refines the denoised output by optimizing task-specific objectives g so they align with video observations ŷ such as 2D masks and contact. The contact labels Ĉ are automatically produced by prompting a VLM.

Comparison with Baselines

Please refer to Section 4.2 for further analysis.

HOI Motion

Obejct-Only Motion

Generation without Guidance

Blended Generation without Guidance: The pretrained diffusion model takes the object template and approximated hand poses as input and predicts binary contact labels (for both hands, visualized in red), object motion, and hand motion. While contact details can still be refined through post-processing, the model already provides a reasonable prior over hand-object dynamics that captures how an object moves while held and how it behaves after being released. The two rows illustrate diverse samples, most notably in the timing of contact events and the resulting object trajectories.

Hand-Guided HOI Planner

Hand-Guided HOI Planner: Beyond reconstruction, our framework can directly synthesize diverse hand-object interaction motions from a coarse hand trajectory, picking/placing times (contact labels). The coarse hand trajectory conditions the diffusion process, while contact labels are injected at each guidance step.

Failure Cases

We identify two failure modes. First, the reconstructed object orientation is sometimes flipped, likely because mask-based reprojection guidance alone is not strong enough to correct the generation. Second, object poses can look tilted especially after being released by the hand; with even small errors in object pose with egocentric moving camera, the resulting trajectory may no longer align well with the gravity direction in world space.

More Results

Bibtex

@article{ye2025whole, author = {Ye, Yufei and Li, Jiaman and Rong, Ryan and Liu, C. Karen}, title = {WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos}, journal = {CVPR Findings}, year = {2026} }