Predicting 4D Hand Trajectory from Monocular Videos

Yufei Ye1, Yao Feng2, Omid Taheri2, Haiwen Feng2, Shubham Tulsiani1*, Michael J. Black2*
1Carnegie Mellon University, 2Max Planck Institute for Intelligent Systems
* Equal Contribution

Paper Code

TLDR: Existing methods produce convincing reprojection but their 4D trajectories are not plausible. HaPTIC reconstructs Hand Pose and 4D hand Trajectory in consistent global Coordinate while maintaining strong 2D alignment.

Method Overview

Overall pipeline (left): HaPTIC extends image-based model HaMeR. HaPTIC takes in frames at a time and passes them through image towers that share weights. Each image tower outputs MANO parameters in local coordinate, and trajectory parameters that directly places the predicted local hand to global 4D trajectory.
Inside one image tower (right): The image tower is based on transformer decoder. For each block, we add a cross-view self-attention layer (Cross-view SA) to fuse temporal information from other frames and a cross-attention (Global CA) to features of the original frames. Orange indicates new components introduced by ours compared to HaMeR.


Comparison with Baselines

The de-facto "lifting" method (Weak2Full) [19, 22, 27, 33, 46, 51] produces significant jitter and is sensitive to focal length.
Metric depth (ZoeDepth) esimtation predices smoother results but struggles under occlusions.
Holistic whole-body estimation (WHAM) only works when the person is largely visible.
HaPTIC (Ours) is able to predict equally good 2D alignment and our global trajectories are more consistent with GT.
Please refer to Section 4.1, 4.2 for further analysis.


Test-Time Optimization

Can jittery trajectories from feed-forward methods be improved by test-time optimization?
We find that it is possible to make the predicted trajectory smoother but is much harder to correct global motions. Overall, HaPTIC provides a better initialization for test-time optimization.


More Results


Bibtex

@inproceedings{ye2023ghop, author = {Ye, Yufei and Feng, Yao and Tehari, Omid and Feng, Haiwen, and Tulsiani, Shubham and Black, Michael J.} title = {Predicting 4D Hand Trajectory from Monocular Videos}, booktitle = {arxiv}, year = {2024} }

Acknowledgement: The authors would like to thank Georgios Pavlakos, Dandan Shan, and Soyong Shin for comparisons with baselines HaMeR and WHAM. Yufei would like to apprciate Shashank Tripathi, Markos Diomataris, and Sai Kumar Dwivedi for fruitful discussions. Part of the work is done when Yufei was an intern at Max-Planck Institute. We also thank Ruihan Gao for proofread.

Send feedback and questions to Yufei Ye. The website template is borrowed from SIREN.