Our approach differs from two baselines (HO[Hasson et al., 19], GF[Karunratanakul et al. 20]) in three main aspects:
1) We formulate hand-held object reconstruction as conditional inference.
2) We additionally use pixel-aligned local features.
3) We predict object in a normalized wrist frame with articulation-aware positional encoding.