Predicting 4D Hand Trajectory from Monocular Videos

Method Overview

Overall pipeline (left): HaPTIC extends image-based model HaMeR. HaPTIC takes in frames at a time and passes them through image towers that share weights. Each image tower outputs MANO parameters in local coordinate, and trajectory parameters that directly places the predicted local hand to global 4D trajectory.
Inside one image tower (right): The image tower is based on transformer decoder. For each block, we add a cross-view self-attention layer (Cross-view SA) to fuse temporal information from other frames and a cross-attention (Global CA) to features of the original frames. Orange indicates new components introduced by ours compared to HaMeR.

Comparison with Baselines

The de-facto "lifting" method (Weak2Full) [19, 22, 27, 33, 46, 51] produces significant jitter and is sensitive to focal length.
Metric depth (ZoeDepth) esimtation predices smoother results but struggles under occlusions.
Holistic whole-body estimation (WHAM) only works when the person is largely visible.
HaPTIC (Ours) is able to predict equally good 2D alignment and our global trajectories are more consistent with GT.
Please refer to Section 4.1, 4.2 for further analysis.

Test-Time Optimization

Can jittery trajectories from feed-forward methods be improved by test-time optimization?
We find that it is possible to make the predicted trajectory smoother but is much harder to correct global motions. Overall, HaPTIC provides a better initialization for test-time optimization.

More Results

Bibtex

@article{ye2025predicting, author = {Ye, Yufei and Feng, Yao and Tehari, Omid and Feng, Haiwen, and Tulsiani, Shubham and Black, Michael J.} title = {Predicting 4D Hand Trajectory from Monocular Videos}, journal={arXiv preprint arXiv:2501.08329}, year = {2025} }