HRAvatar: High-Quality and Relightable Gaussian Head Avatar

Dongbin Zhang1,2, Yunfei Liu2, Lijian Lin2, Ye Zhu2, Kangjie Chen1, Minghan Qin1, Yu Li2† Haoqian Wang1† ( † means corresponding author)
1Tsinghua Shenzhen International Graduate School, Tsinghua University 2International Digital Economy Academy (IDEA)
CVPR 2025

Video Demo


Abstract

Reconstructing animatable and high-quality 3D head avatars from monocular videos, especially with realistic relighting, is a valuable task. However, the limited information from single-view input, combined with the complex head poses and facial movements, makes this challenging. Previous methods achieve real-time performance by combining 3D Gaussian Splatting with a parametric head model, but the resulting head quality suffers from inaccurate face tracking and limited expressiveness of the deformation model. These methods also fail to produce realistic effects under novel lighting conditions. To address these issues, we propose HRAvatar, a 3DGS-based method that reconstructs high-fidelity, relightable 3D head avatars. HRAvatar reduces tracking errors through end-to-end optimization and better captures individual facial deformations using learnable blendshapes and learnable linear blend skinning. Additionally, it decomposes head appearance into several physical properties and incorporates physically-based shading to account for environmental lighting. Extensive experiments demonstrate that HRAvatar not only reconstructs superior-quality heads but also achieves realistic visual effects under varying lighting conditions.

With monocular video input, HRAvatar reconstructs a high-quality, animatable 3D head avatar that enables realistic relighting effects and simple material editing.

Self-reenactment

Animation and Relighting in rotating environment

Relighting in rotating environment

Material editing

Comparison in self-reenactment

Comparison in cross-reenactment

Comparison in novel views synthesis

Comparison in relighting

Method

Given a monocular video with unknown lighting, we first track the fixed shape parameter and pose parameters through iterative optimization before training. Expression parameters and jaw poses are estimated via an expression encoder, which is optimized during training. With these parameters, we transform the Gaussian points into pose space using learnable linear blendshapes and linear blend skinning. We then render the Gaussian points to obtain albedo, roughness, reflectance, and normal maps. Finally, we compute pixel colors using physically-based shading with optimizable environment maps.

BibTeX