HiSTalk

Hierarchical Speech Feature-based Landmark Displacements for 3D Talking Head Animation

Demo Video

If the video playback is slow or stuttering, please download the demo video and play it locally for a smoother experience.

Comparisons with CodeTalker, FaceFormer, and DiffSpeaker.

Abstract

Speech-driven 3D talking head animation seeks to generate realistic, expressive facial motions directly from audio. Most existing methods use only single-scale (i.e., frame-level) speech features and directly regress full-face geometry, ignoring the influence of multi-level cues (phonemes, words, utterances) on facial motions, often resulting in over-smoothed, unnatural movements.

To address this, we propose HiSTalk, a hierarchical framework comprising:

  • Coarse Motion Generator (CMG)
  • Fine Motion Refiner (FMR) including:
    • HSF2S Encoder
    • S2D Decoder

HiSTalk achieves precise lip-sync and expressive facial animation, outperforming state-of-the-art methods on VOCASET and BIWI.

Method Overview

HiSTalk Framework
  • CMG: captures global facial motion trajectories from audio.
  • FMR: refines local facial dynamics through hierarchical speech modeling.

Quantitative Results

Dataset LVE ↓ FDD ↓ LRP ↑
VOCASET-Test 2.6715 3.3657 97.21%
BIWI-Test-A 3.7554 2.7129 91.92%

User Study

Users preferred HiSTalk over CodeTalker in lip-sync (62.2%) and realism (55.6%), and over DiffSpeaker in lip-sync (57.8%) and realism (64.4%).

Ablation Study

Removing CMG or FMR increases FDD significantly on VOCASET-Test, confirming the necessity of each module.