Recent generative speech enhancement methods based on language and diffusion models achieve strong perceptual quality but are more susceptible than discriminative approaches to speech-like hallucinations under low SNRs and transient noise. We propose DelayGSE, a text-aware generative speech enhancement framework built on a multi-codebook language model for denoising, dereverberation, and audio super-resolution. DelayGSE conditions on noisy-speech STFT features and Whisper encoder representations, and models multiple discrete codebooks in a delayed manner to stabilize generation. A text-aware mechanism suppresses hallucinations, while an importance-aware codebook weighting strategy balances perceptual fidelity and semantic consistency. Experiments demonstrate state-of-the-art performance, with ablations showing effective hallucination suppression and a 15.8% relative word error rate reduction. Audio samples are available at: https://delaygse.github.io/.
Ablation Study of DelayGSE Variants Compared with Representative Baseline Models
• DGSE-EW: (DelayGSE with Equal-weight training, where all RVQ codebooks share equal CE loss weights).
• DGSE-IW: (DelayGSE with Importance-weighted training using the scheme in Section~2.5).
• DGSE-IW+T: (DelayGSE with Importance-weighted + Delayed Text-aware supervision).
• DGSE-IW+TG: (DelayGSE with Importance-weighted + Delayed Text-aware supervision + with inference conditioned on Ground Truth Text).
| CHIME3 (F05_440C020C_CAF.CH1) : "The rise in that category in July was led by increased orders for aircraft and parts , nonelectrical machinery, lumber and furniture ." | ||||
| Noise Input | DelayGSE-EW | DelayGSE-IW | DelayGSE-IW+T | DelayGSE-IW+TG |
|---|---|---|---|---|
|
|
|
|
|
| GAN-based | Storm | FlowSE | LLaSE-G1 | |
|
|
|
|
|
| Internal (Exp3) : "我在右侧五米处,一 磅 糖比鸡蛋还贵,咖啡架对沙发来说太高了。 船在陡峭的礁石上被撞得 四分五裂, 我们试图把 硬币 放回原处但没有成功。" | ||||
| Noise Input | DelayGSE-EW | DelayGSE-IW | DelayGSE-IW+T | DelayGSE-IW+TG |
|---|---|---|---|---|
|
|
|
|
|
| GAN-based | Storm | FlowSE | LLaSE-G1 | |
|
|
|
|
|
Benchmark comparison of open-source generative models and DelayGSE.
| DNS Challenge (Without Reverb | clnsp327_air_conditioner_371242_0_snr7_tl-21_fileid_8) : "Logical order. Non-profit organizations have frequent fundraisers. The most recent geological survey found seismic activity. Corrie attacked the project with extra." (transcribe using Whisper) | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| DNS Challenge (With Reverb | clnsp81_car_74675_2_snr18_tl-22_fileid_9) : "Doctor was in the ambulance with the patient. Puree some fruit before preparing the skewers. It's not easy to create illuminating." (transcribe using Whisper) | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| DNS Challenge (Real Recording | ms_realrec_speakerphone_Senja_munching-01_SurfaceBook) : "She had your dark suit and greasy wash water all year." (transcribe using Whisper) | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| URGENT 2025 (English) : "That wall in the living room is white. There is one more piece of bread in the pantry. The store closes at 8pm tonight." | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| URGENT 2025 (Chinese) : "谢拉维斯塔庄园是位于美国亚利桑那州科奇斯县的一个非建制地区。" | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| VCTK-DEMAND (p232_015) : "The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain. " | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| VCTK-DEMAND (p232_013) : "Some have accepted it as a miracle without physical explanation." | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| Internal (Exp1): "床前明月光,疑是地上霜,举头望明月,低头思故乡。你就是那个爱打篮球的人,总理对任何事情都要刨根问底。" | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|
| Internal (Exp2): "你就是那个爱打篮球的人,总理对任何事情都要刨根问底,渐渐的他还真就睡着了,这身衣服就像被大雨淋过似的。" | |||||
| Noise Input | GAN-based | Storm | FlowSE | LLaSE-G1 | DelayGSE-IW (ours) |
|---|---|---|---|---|---|
|
|
|
|
|
|