GUAVA: Generalizable Upper Body 3D Gaussian Avatar

Dongbin Zhang1,2, Yunfei Liu2†, Lijian Lin2, Ye Zhu2, Yang Li1, Minghan Qin1, Yu Li2‡ Haoqian Wang1† (‡ project lead, † means corresponding authors)
1Tsinghua Shenzhen International Graduate School, Tsinghua University 2International Digital Economy Academy (IDEA)

For each single image, GUAVA can reconstruct a 3D upper-body Gaussian avatar via feed-forward inference within sub-second time, enabling real-time expressive animation and novel view synthesis at 512✖️512 resolution.

Abstract

Reconstructing a high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX’s expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range ($\sim$ 0.1s), and supports real-time animation and rendering.

Video Demo


Self-Reenactment

Cross-Reenactment

Novel Views Synthesis

Comparing 3D-Based Methods in Self-Reenactment

Comparing 2D-Based Methods in Self-Reenactment

Comparing 2D-Based Methods in Cross-Reenactment

Method

Given the source and target images, we first obtain the shape, expression, and pose parameters of the EHM template model through tracking. The source image is then passed through an image encoder to extract an appearance feature map. Using these features and the tracked EHM, one branch predicts the template Gaussians, and the other predicts the UV Gaussians. These are combined to form the Ubody Gaussians in canonical space, which are then deformed into pose space using the tracked parameters from the target image. Finally, a coarse feature map is rendered and refined by a neural refiner to produce the final image.

BibTeX