DelayGSE: A Generative Speech Enhancement Framework with Delayed Text-Aware Conditioning

Accepted at INTERSPEECH 2026

Xin Yuan¹, Junling Lv¹, Zezhou Xu², Xingjun Tan¹, Liangliang Li¹, Yanqiang Lei^1*

¹Guangzhou Shiyuan Electronic Technology Company Limited, Guangzhou, China
²Shanghai University of Finance and Economics, Shanghai, China

Abstract

Recent generative speech enhancement methods based on language and diffusion models achieve strong perceptual quality but are more susceptible than discriminative approaches to speech-like hallucinations under low SNRs and transient noise. We propose DelayGSE, a text-aware generative speech enhancement framework built on a multi-codebook language model for denoising, dereverberation, and audio super-resolution. DelayGSE conditions on noisy-speech STFT features and Whisper encoder representations, and models multiple discrete codebooks in a delayed manner to stabilize generation. A text-aware mechanism suppresses hallucinations, while an importance-aware codebook weighting strategy balances perceptual fidelity and semantic consistency. Experiments demonstrate state-of-the-art performance, with ablations showing effective hallucination suppression and a 15.8% relative word error rate reduction. Audio samples are available at: https://delaygse.github.io/.

DelayGSE framework. An STFT acoustic encoder and a pretrained Whisper semantic encoder condition an autoregressive Transformer that predicts RVQ tokens. We employ (i) codebook-level delay (stride-1 lower→higher conditioning) and (ii) text-first delayed text-aware generation (predict text tokens, delay acoustic tokens by k) to suppress hallucinations. Illustration shows 4-codebook example for clarity.

Illustration of estimating the importance of each RVQ codebook layer using perceptual quality, intelligibility, and speaker similarity metrics, which are then used to weight the LLM training loss (CB denotes CodeBook).

Audio Samples

Ablation Study of DelayGSE Variants Compared with Representative Baseline Models

• DGSE-EW: (DelayGSE with Equal-weight training, where all RVQ codebooks share equal CE loss weights).

• DGSE-IW: (DelayGSE with Importance-weighted training using the scheme in Section~2.5).

• DGSE-IW+T: (DelayGSE with Importance-weighted + Delayed Text-aware supervision).

• DGSE-IW+TG: (DelayGSE with Importance-weighted + Delayed Text-aware supervision + with inference conditioned on Ground Truth Text).

Noise Input	DelayGSE-EW	DelayGSE-IW	DelayGSE-IW+T	DelayGSE-IW+TG
CHIME3 (F05_440C020C_CAF.CH1) : "The rise in that category in July was led by increased orders for aircraft and parts , nonelectrical machinery, lumber and furniture ."

	GAN-based	Storm	FlowSE	LLaSE-G1

Noise Input	DelayGSE-EW	DelayGSE-IW	DelayGSE-IW+T	DelayGSE-IW+TG
Internal (Exp3) : "我在右侧五米处，一磅糖比鸡蛋还贵，咖啡架对沙发来说太高了。船在陡峭的礁石上被撞得四分五裂，我们试图把硬币放回原处但没有成功。"

	GAN-based	Storm	FlowSE	LLaSE-G1

Benchmark comparison of open-source generative models and DelayGSE.

「DNS Challenge」

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
DNS Challenge (Without Reverb \| clnsp327_air_conditioner_371242_0_snr7_tl-21_fileid_8) : "Logical order. Non-profit organizations have frequent fundraisers. The most recent geological survey found seismic activity. Corrie attacked the project with extra." (transcribe using Whisper)

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
DNS Challenge (With Reverb \| clnsp81_car_74675_2_snr18_tl-22_fileid_9) : "Doctor was in the ambulance with the patient. Puree some fruit before preparing the skewers. It's not easy to create illuminating." (transcribe using Whisper)

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
DNS Challenge (Real Recording \| ms_realrec_speakerphone_Senja_munching-01_SurfaceBook) : "She had your dark suit and greasy wash water all year." (transcribe using Whisper)

「URGENT 2025」

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
URGENT 2025 (English) : "That wall in the living room is white. There is one more piece of bread in the pantry. The store closes at 8pm tonight."

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
URGENT 2025 (Chinese) : "谢拉维斯塔庄园是位于美国亚利桑那州科奇斯县的一个非建制地区。"

「VCTK-DEMAND」

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
VCTK-DEMAND (p232_015) : "The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain. "

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
VCTK-DEMAND (p232_013) : "Some have accepted it as a miracle without physical explanation."

「Internal」

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
Internal (Exp1): "床前明月光，疑是地上霜，举头望明月，低头思故乡。你就是那个爱打篮球的人，总理对任何事情都要刨根问底。"

Noise Input	GAN-based	Storm	FlowSE	LLaSE-G1	DelayGSE-IW (ours)
Internal (Exp2): "你就是那个爱打篮球的人，总理对任何事情都要刨根问底，渐渐的他还真就睡着了，这身衣服就像被大雨淋过似的。"

References

Copet, Jade et al. MusicGen: Simple and controllable music generation. In NeurIPS, 2023, https://arxiv.org/pdf/2306.05284.
Alexandre Défossez et al. Moshi: a speech-text foundation model for real-time dialogue. In ArXiv, 2024, https://arxiv.org/pdf/2410.00037.
Kumar, Rithesh et al. DAC: High-fidelity audio compression with improved RVQGAN. In NeurIPS, 2023, https://github.com/descriptinc/descript-audio-codec.
Yoach Lacombe et al. Parler-TTS. In GitHub, 2024, https://github.com/huggingface/parler-tts.
Kang, Boyi et al. LLaSE-G1. In ACL, 2025, https://github.com/Kevin-naticl/LLaSE-G1.
Lemercier et al. StoRM. In TASLP, 2023, https://github.com/sp-uhh/storm.
Ziqian Wang et al. FlowSE. In Interspeech, 2025, https://github.com/Honee-W/FlowSE.