EMOSH - Project Page

Overview

Given a reference image and a driving video, EMOSH achieves high-fidelity, mesh-guided expressive human animation while disentangling expressive motion from body shape.

Abstract

High-fidelity and expressive controllable human animation is essential for content creation and digital avatar applications. However, existing methods face a dilemma between expressiveness and disentanglement. Mainstream 2D pose-conditioned approaches suffer from "motion-shape entanglement", leading to the leakage of the driving subject's body shape. Conversely, methods relying on 3D priors (e.g., SMPL) achieve geometric disentanglement but struggle to capture facial expressions and complex gestures, resulting in rigid animations. To this end, we propose EMOSH, a novel framework for high-fidelity controllable human video generation. First, an Expressive Human Model (EHM) is introduced as the core control representation. By explicitly disentangling shape and pose parameters, we fundamentally resolve the body shape leakage issue. Alongside this, a robust motion tracker is designed to accurately estimate EHM parameters from video. Second, we propose a Coarse-to-Fine Hybrid Motion Injection strategy, enabling more fine-grained control over expressions and gestures. Furthermore, we introduce a Spatially-Aligned Conditioning mechanism to bridge the domain gap between training and inference, improving identity consistency. Extensive experiments demonstrate that EMOSH outperforms previous methods in both self-driven and cross-driven scenarios.

Video Demo

Self-Driven

Cross-Driven

Multi-Identity → Same Motion

Driving Motion

Multi-Identity → Same Motion (Dance)

Driving Motion (EHM Mesh)

Dynamic Zoom Camera Control

Comparison with State-of-the-Art

Kid → Cross-Act A

EMOSH (Ours)

Driving Video

HyperMotion

Wan-Animate

Kid → Cross-Act B

EMOSH (Ours)

Driving Video

HyperMotion

Wan-Animate

Man → Cross-Act

EMOSH (Ours)

Driving Video

HyperMotion

Wan-Animate

Woman → Cross-Act

EMOSH (Ours)

Driving Video

HyperMotion

Wan-Animate

Ablation Studies

Self-Driven Ablation

Ground Truth

EMOSH (Ours)

w/o Tracker

w/o Hybrid Motion

Cross-Driven Ablation

EMOSH (Full)

Driving Video

w/o Disentanglement

w/o Spatial Align

Motion Tracking Comparison

Case 1

EMOSH Tracker (Ours)

GUAVA Tracker

Original Video

Case 2

EMOSH Tracker (Ours)

GUAVA Tracker

Original Video

Method

The overall pipeline of EMOSH. Given a reference image and a driving video, we first track the EHM parameters. The motion features are injected in a coarse-to-fine manner through the Hybrid Motion Injection module. Spatially-Aligned Conditioning bridges the domain gap to ensure identity consistency.

BibTeX

@inproceedings{zhang2026emosh,
  title={EMOSH: Expressive Motion and Shape Disentanglement for Human Animation},
  author={Zhang, Dongbin and Liu, Hao and Dai, Binquan and Chen, Kangjie and Wang, Chuming and Li, Chen and LYU, Jing and Wang, Haoqian},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}

EMOSH: Expressive Motion and ShapeDisentanglement for Human Animation

Overview

Abstract

Video Demo

Self-Driven

Cross-Driven

Multi-Identity → Same Motion

Multi-Identity → Same Motion (Dance)

Dynamic Zoom Camera Control

Comparison with State-of-the-Art

Kid → Cross-Act A

Kid → Cross-Act B

Man → Cross-Act

Woman → Cross-Act

Ablation Studies

Self-Driven Ablation

Cross-Driven Ablation

Motion Tracking Comparison

Case 1

Case 2

Method

BibTeX

EMOSH: Expressive Motion and Shape
Disentanglement for Human Animation