HiSTalk

Hierarchical Speech Feature-based Landmark Displacements for 3D Talking Head Animation

Demo video showcasing comparisons with CodeTalker, FaceFormer, and DiffusionTalker.

Abstract

Speech-driven 3D talking head animation seeks to generate realistic, expressive facial motions directly from audio. Most existing methods use only single-scale (i.e., frame-level) speech features and directly regress full-face geometry, ignoring the influence of multi-level cues (phonemes, words, utterances) on facial motions, often resulting in over-smoothed, unnatural movements. To address this, we propose HiSTalk, a hierarchical framework comprising a Coarse Motion Generator (CMG) that captures global facial trajectories via a Transformer on speech embeddings, and a Fine Motion Refiner (FMR) with two stages: HSF2S encodes frame-, phoneme-, word-, and utterance-level features into weighted sparse landmark displacements using a squeeze-and-excitation gating mechanism, and S2D lifts these offsets into dense 3D deformation fields via a multi-branch attention-fusion Transformer decoder. By fusing coarse guidance with fine-grained refinements, HiSTalk achieves precise lip-sync and rich expressiveness, outperforming state-of-the-art on VOCASET and BIWI.

Method Overview

HiSTalk Pipeline
  • Coarse Motion Generator (CMG): captures global facial displacements from audio via a Wav2Vec 2.0 encoder and a Transformer decoder.
  • Fine Motion Refiner (FMR):
    • HSF2S Encoder: extracts hierarchical speech features (frame, phoneme, word, utterance) using SpeechFormer++ blocks and applies squeeze-and-excitation gating to compute sparse landmark offsets.
    • S2D Decoder: fuses sparse offsets into dense deformation fields with multi-branch attention-fusion Transformer decoders.

Quantitative Results

DatasetLVE ↓FDD ↓LRP ↑
VOCASET-Test2.67153.365797.21%
BIWI-Test-A3.75542.712991.92%

User Study

Users preferred HiSTalk over CodeTalker in lip-sync (62.2%) and realism (55.6%), and over DiffSpeaker in lip-sync (57.8%) and realism (64.4%).

Ablation Study

Removing CMG or FMR components increases FDD by up to 25.3% on VOCASET-Test, confirming each module’s necessity.