WACV  ยท  2026

Enhancing Monocular 3D Hand Reconstruction
with Learned Texture Priors

Giorgos Karvounas1,3, Nikolaos Kyriazis1, Iason Oikonomidis1, Georgios Pavlakos2, Antonis A. Argyros1,3
1ICS-FORTH  ยท  2University of Texas at Austin  ยท  3University of Crete

We revisit texture not as a rendering detail, but as a dense, spatially grounded cue that actively supports monocular 3D hand pose and shape estimation. Our plug-and-play texture module learns from sparse UV observations and improves both accuracy and realism when integrated into HaMeR.

Monocular 3D Hand Reconstruction
Texture Priors
Differentiable Rendering
Transformers
Teaser: texture-guided monocular 3D hand reconstruction.

Abstract

We revisit the role of texture in monocular 3D hand reconstruction, not as an afterthought for photorealism, but as a dense, spatially grounded cue that can actively support pose and shape estimation. Even in high-performing models, the overlay between predicted hand geometry and image appearance is often imperfect, suggesting that texture alignment is an underused supervisory signal.

We propose a lightweight texture module that embeds per-pixel observations into UV texture space and enables a dense alignment loss between predicted and observed hand appearances. Assuming a differentiable rendering pipeline and a mesh-based hand model with known topology, we back-project the textured hand onto the input and perform pixel-level alignment.

To isolate the value of texture-guided supervision, we augment HaMeR, a high-performing yet architecture-clean transformer for 3D hand pose estimation. Our system improves both accuracy and realism, demonstrating that appearance-guided alignment is a powerful, scalable signal for monocular hand reconstruction.

Method Overview

Texture module in UV space

The core of our approach is a texture model that operates directly on sparse UV-RGB observations. Given visible pixels from a monocular image, projected onto the hand mesh surface, we obtain a variable-length set of UV coordinates and colors. A transformer-based encoder attends to these pixel-level inputs, and a convolutional decoder upsamples the representation into a dense UV texture map.

  • Input: sparse set of UV-RGB samples from visible mesh regions.
  • Backbone: transformer encoder over irregular pixel tokens.
  • Output: coherent full-hand UV texture aligned with the mesh topology.

Texture-guided photometric supervision

We integrate the texture module into a standard image-to-mesh pipeline (HaMeR). During training, the predicted textured mesh is rendered back into the image via differentiable rendering, and we enforce a dense photometric consistency between the rendered hand and the observed RGB.

  • Render textured hand into the input view using differentiable rendering.
  • Compute pixel-wise photometric loss only on visible, hand-covered regions.
  • Backpropagate through both geometry and texture modules.

Contributions

  • Texture priors from sparse observations. We introduce the first framework that consolidates sparse, partial hand texture observations into a unified UV-space model for full texture reconstruction, trained without ground-truth textures or multiview studio data.
  • Pixel-level transformer for UV textures. Our texture module attends to pixel-perfect, variable-length UV-RGB inputs and predicts dense, coherent textures across diverse visibility patterns typical of in-the-wild imagery.
  • End-to-end photometric supervision. Leveraging differentiable rendering, we define dense photometric losses that supervise texture synthesis and, indirectly, refine hand geometry during training.
  • Plug-and-play integration with HaMeR. When added to a strong but clean baseline (HaMeR), our texture-guided supervision yields measurable improvements in both standard 3D metrics and visual alignment.

Results

Quantitative improvements

We evaluate on standard hand reconstruction benchmarks, comparing the original HaMeR baseline to our texture-guided variant. Texture priors consistently reduce joint and vertex errors, especially in challenging, highly articulated poses where appearance cues are crucial.

3D Joint Error (MPJPE)
โ†“ vs. HaMeR baseline
Vertex Error (MPVPE)
โ†“ improved alignment
Photometric Consistency
โ†‘ sharper, cleaner hands

Beyond averaged metrics, improvements are particularly noticeable for partial views, self-occlusions, and motion-blurred frames where geometry-only supervision struggles.

Qualitative comparisons

Qualitative comparison: baseline HaMeR vs. our texture-guided variant.
Left: input RGB. Middle: HaMeR mesh overlay. Right: our method. Texture-guided supervision resolves misalignments at fingertips and silhouettes and recovers more realistic appearance.
Gallery of reconstructed UV textures from in-the-wild images.
Gallery: reconstructed UV textures from in-the-wild images. The module consolidates partial observations into complete, high-fidelity textures despite occlusions and missing regions.

Resources

The following resources accompany the paper:

Citation

If you find this work useful in your research, please consider citing:

@inproceedings{Karvounas2026wacv,
  author={Karvounas, Giorgos and Kyriazis, Nikolaos and Oikonomidis, Iason and Pavlakos, Georgios and Argyros, Antonis A},
  title = {Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors},
  booktitle = {IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2026)},
  year = {2026},
  month = {March},
  address = {Tucson, Arizona, USA}
}