HiSTalk

Demo Video

If the video playback is slow or stuttering, please download the demo video and play it locally for a smoother experience.

Download MP4

Comparisons with CodeTalker, FaceFormer, and DiffSpeaker.

Abstract

Speech-driven 3D talking head animation seeks to generate realistic, expressive facial motions directly from audio. Most existing methods use only single-scale (i.e., frame-level) speech features and directly regress full-face geometry, ignoring the influence of multi-level cues (phonemes, words, utterances) on facial motions, often resulting in over-smoothed, unnatural movements.

To address this, we propose HiSTalk, a hierarchical framework comprising:

Coarse Motion Generator (CMG)
Fine Motion Refiner (FMR) including:
- HSF2S Encoder
- S2D Decoder

HiSTalk achieves precise lip-sync and expressive facial animation, outperforming state-of-the-art methods on VOCASET and BIWI.

Method Overview

CMG: captures global facial motion trajectories from audio.
FMR: refines local facial dynamics through hierarchical speech modeling.

Quantitative Results

Dataset	LVE ↓	FDD ↓	LRP ↑
VOCASET-Test	2.6715	3.3657	97.21%
BIWI-Test-A	3.7554	2.7129	91.92%

User Study

Users preferred HiSTalk over CodeTalker in lip-sync (62.2%) and realism (55.6%), and over DiffSpeaker in lip-sync (57.8%) and realism (64.4%).

Ablation Study

Removing CMG or FMR increases FDD significantly on VOCASET-Test, confirming the necessity of each module.