DelayGSE: A Generative Speech Enhancement Framework with Delayed Text-Aware Conditioning

Accepted at INTERSPEECH 2026
Xin Yuan1, Junling Lv1, Zezhou Xu2, Xingjun Tan1, Liangliang Li1, Yanqiang Lei1*
1Guangzhou Shiyuan Electronic Technology Company Limited, Guangzhou, China
2Shanghai University of Finance and Economics, Shanghai, China

Abstract

Recent generative speech enhancement methods based on language and diffusion models achieve strong perceptual quality but are more susceptible than discriminative approaches to speech-like hallucinations under low SNRs and transient noise. We propose DelayGSE, a text-aware generative speech enhancement framework built on a multi-codebook language model for denoising, dereverberation, and audio super-resolution. DelayGSE conditions on noisy-speech STFT features and Whisper encoder representations, and models multiple discrete codebooks in a delayed manner to stabilize generation. A text-aware mechanism suppresses hallucinations, while an importance-aware codebook weighting strategy balances perceptual fidelity and semantic consistency. Experiments demonstrate state-of-the-art performance, with ablations showing effective hallucination suppression and a 15.8% relative word error rate reduction. Audio samples are available at: https://delaygse.github.io/.

Audio Samples

Ablation Study of DelayGSE Variants Compared with Representative Baseline Models

DGSE-EW:       (DelayGSE with Equal-weight training, where all RVQ codebooks share equal CE loss weights).

DGSE-IW:        (DelayGSE with Importance-weighted training using the scheme in Section~2.5).

DGSE-IW+T:   (DelayGSE with Importance-weighted + Delayed Text-aware supervision).

DGSE-IW+TG: (DelayGSE with Importance-weighted + Delayed Text-aware supervision + with inference conditioned on Ground Truth Text).

CHIME3 (F05_440C020C_CAF.CH1) : "The rise in that category in July was led by increased orders for aircraft and parts , nonelectrical machinery, lumber and furniture ."
Noise Input DelayGSE-EW DelayGSE-IW DelayGSE-IW+T DelayGSE-IW+TG
GAN-based Storm FlowSE LLaSE-G1
Internal (Exp3) : "我在右侧五米处,一 糖比鸡蛋还贵,咖啡架对沙发来说太高了。 船在陡峭的礁石上被撞得 四分五裂, 我们试图把 硬币 放回原处但没有成功。"
Noise Input DelayGSE-EW DelayGSE-IW DelayGSE-IW+T DelayGSE-IW+TG
GAN-based Storm FlowSE LLaSE-G1

Benchmark comparison of open-source generative models and DelayGSE.

* Since DelayGSE, trained with importance-weighted learning, achieves the highest perceptual quality, the audio here is generated by DelayGSE-IW. Audio from the ablation study is used for comparing different versions of DelayGSE.

「DNS Challenge」

DNS Challenge (Without Reverb | clnsp327_air_conditioner_371242_0_snr7_tl-21_fileid_8) : "Logical order. Non-profit organizations have frequent fundraisers. The most recent geological survey found seismic activity. Corrie attacked the project with extra." (transcribe using Whisper)
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram
DNS Challenge (With Reverb | clnsp81_car_74675_2_snr18_tl-22_fileid_9) : "Doctor was in the ambulance with the patient. Puree some fruit before preparing the skewers. It's not easy to create illuminating." (transcribe using Whisper)
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram
DNS Challenge (Real Recording | ms_realrec_speakerphone_Senja_munching-01_SurfaceBook) : "She had your dark suit and greasy wash water all year." (transcribe using Whisper)
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Storm Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram

「URGENT 2025」

URGENT 2025 (English) : "That wall in the living room is white. There is one more piece of bread in the pantry. The store closes at 8pm tonight."
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram
URGENT 2025 (Chinese) : "谢拉维斯塔庄园是位于美国亚利桑那州科奇斯县的一个非建制地区。"
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram

「VCTK-DEMAND」

VCTK-DEMAND (p232_015) : "The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain. "
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram
VCTK-DEMAND (p232_013) : "Some have accepted it as a miracle without physical explanation."
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram

「Internal」

Internal (Exp1): "床前明月光,疑是地上霜,举头望明月,低头思故乡。你就是那个爱打篮球的人,总理对任何事情都要刨根问底。"
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram
Internal (Exp2): "你就是那个爱打篮球的人,总理对任何事情都要刨根问底,渐渐的他还真就睡着了,这身衣服就像被大雨淋过似的。"
Noise Input GAN-based Storm FlowSE LLaSE-G1 DelayGSE-IW (ours)
Noise Input Spectrogram
Noise Input Spectrogram
Storm Spectrogram
FlowSE Spectrogram
LLaSE-G1 Spectrogram
DelayGSE-IW Spectrogram

References

  1. Copet, Jade et al. MusicGen: Simple and controllable music generation. In NeurIPS, 2023, https://arxiv.org/pdf/2306.05284.
  2. Alexandre Défossez et al. Moshi: a speech-text foundation model for real-time dialogue. In ArXiv, 2024, https://arxiv.org/pdf/2410.00037.
  3. Kumar, Rithesh et al. DAC: High-fidelity audio compression with improved RVQGAN. In NeurIPS, 2023, https://github.com/descriptinc/descript-audio-codec.
  4. Yoach Lacombe et al. Parler-TTS. In GitHub, 2024, https://github.com/huggingface/parler-tts.
  5. Kang, Boyi et al. LLaSE-G1. In ACL, 2025, https://github.com/Kevin-naticl/LLaSE-G1.
  6. Lemercier et al. StoRM. In TASLP, 2023, https://github.com/sp-uhh/storm.
  7. Ziqian Wang et al. FlowSE. In Interspeech, 2025, https://github.com/Honee-W/FlowSE.