Yufei Ye1,
Yao Feng2,
Omid Taheri2,
Haiwen Feng2,
Shubham Tulsiani1*,
Michael J. Black2*
1Carnegie Mellon University,
2Max Planck Institute for Intelligent Systems
* Equal Contribution
TLDR: Existing methods produce convincing reprojection but their 4D trajectories are not plausible. HaPTIC reconstructs Hand Pose and 4D hand Trajectory in consistent global Coordinate while maintaining strong 2D alignment.
Overall pipeline (left): HaPTIC extends image-based model HaMeR. HaPTIC takes in frames at a time and passes them through image towers that share weights. Each image tower outputs MANO parameters in local coordinate, and trajectory parameters that directly places the predicted local hand to global 4D trajectory.
Inside one image tower (right): The image tower is based on transformer decoder. For each block, we add a cross-view self-attention layer (Cross-view SA) to fuse temporal information from other frames and a cross-attention (Global CA) to features of the original frames. Orange indicates new components introduced by ours compared to HaMeR.
The de-facto "lifting" method (Weak2Full) [19, 22, 27, 33, 46, 51] produces significant jitter and is sensitive to focal length.
Metric depth (ZoeDepth) esimtation predices smoother results but struggles under occlusions.
Holistic whole-body estimation (WHAM) only works when the person is largely visible.
HaPTIC (Ours) is able to predict equally good 2D alignment and our global trajectories are more consistent with GT.
Please refer to Section 4.1, 4.2 for further analysis.
Can jittery trajectories from feed-forward methods be improved by test-time optimization?
We find that it is possible to make the predicted trajectory smoother but is much harder to correct global motions. Overall, HaPTIC provides a better initialization for test-time optimization.