Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhihan Li

Tsinghua University

Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency

May 28, 2025

Haoran Wang, Guanyu Chen, Bohan Li, Hankun Wang, Yiwei Guo, Zhihan Li, Xie Chen, Kai Yu

Abstract:Neural speech codecs excel in reconstructing clean speech signals; however, their efficacy in complex acoustic environments and downstream signal processing tasks remains underexplored. In this study, we introduce a novel benchmark named Environment-Resilient Speech Codec Benchmark (ERSB) to systematically evaluate whether neural speech codecs are environment-resilient. Specifically, we assess two key capabilities: (1) robust reconstruction, which measures the preservation of both speech and non-speech acoustic details, and (2) downstream task consistency, which ensures minimal deviation in downstream signal processing tasks when using reconstructed speech instead of the original. Our comprehensive experiments reveal that complex acoustic environments significantly degrade signal reconstruction and downstream task consistency. This work highlights the limitations of current speech codecs and raises a future direction that improves them for greater environmental resilience.

* Initial Upload

Via

Access Paper or Ask Questions

Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate

May 22, 2025

Hanglei Zhang, Yiwei Guo, Zhihan Li, Xiang Hao, Xie Chen, Kai Yu

Abstract:Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Recent Advances in Discrete Speech Tokens: A Review

Feb 10, 2025

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu

Figure 1 for Recent Advances in Discrete Speech Tokens: A Review

Figure 2 for Recent Advances in Discrete Speech Tokens: A Review

Figure 3 for Recent Advances in Discrete Speech Tokens: A Review

Figure 4 for Recent Advances in Discrete Speech Tokens: A Review

Abstract:The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.

* 26 pages, 8 figures, 3 tables. Work in progress

Via

Access Paper or Ask Questions

Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Dec 22, 2024

Hankun Wang, Haoran Wang, Yiwei Guo, Zhihan Li, Chenpeng Du, Xie Chen, Kai Yu

Figure 1 for Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Figure 2 for Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Figure 3 for Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Figure 4 for Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective

Abstract:Although text-based large language models exhibit human-level writing ability and remarkable intelligence, speech language models (SLMs) still struggle to generate semantically coherent outputs. There are several potential reasons for this performance degradation: (A) speech tokens mainly provide phonetic information rather than semantic information, (B) the length of speech sequences is much longer than that of text sequences, and (C) paralinguistic information, such as prosody, introduces additional complexity and variability. In this paper, we explore the influence of three key factors separately by transiting the modality from text to speech in an evolving manner. Our findings reveal that the impact of the three factors varies. Factor A has a relatively minor impact, factor B influences syntactical and semantic modeling more obviously, and factor C exerts the most significant impact, particularly in the basic lexical modeling. Based on these findings, we provide insights into the unique challenges of training SLMs and highlight pathways to develop more effective end-to-end SLMs.

Via

Access Paper or Ask Questions

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Oct 21, 2024

Yiwei Guo, Zhihan Li, Chenpeng Du, Hankun Wang, Xie Chen, Kai Yu

Abstract:Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a three-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. The 25Hz version of LSCodec also achieves the lowest bitrate (0.25kbps) of codecs so far with decent quality. Voice conversion evaluations prove the satisfactory speaker disentanglement of LSCodec, and ablation study further verifies the effectiveness of the proposed training framework.

* 5 pages, 2 figures, 4 tables. Submitted to ICASSP 2025. Demo page: https://cantabile-kwok.github.io/LSCodec/

Via

Access Paper or Ask Questions

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Sep 03, 2024

Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, Kai Yu

Figure 1 for vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Figure 2 for vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Figure 3 for vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Figure 4 for vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Abstract:We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

Revisiting VAE for Unsupervised Time Series Anomaly Detection: A Frequency Perspective

Feb 05, 2024

Zexin Wang, Changhua Pei, Minghua Ma, Xin Wang, Zhihan Li, Dan Pei, Saravan Rajmohan, Dongmei Zhang, Qingwei Lin, Haiming Zhang(+2 more)

Abstract:Time series Anomaly Detection (AD) plays a crucial role for web systems. Various web systems rely on time series data to monitor and identify anomalies in real time, as well as to initiate diagnosis and remediation procedures. Variational Autoencoders (VAEs) have gained popularity in recent decades due to their superior de-noising capabilities, which are useful for anomaly detection. However, our study reveals that VAE-based methods face challenges in capturing long-periodic heterogeneous patterns and detailed short-periodic trends simultaneously. To address these challenges, we propose Frequency-enhanced Conditional Variational Autoencoder (FCVAE), a novel unsupervised AD method for univariate time series. To ensure an accurate AD, FCVAE exploits an innovative approach to concurrently integrate both the global and local frequency features into the condition of Conditional Variational Autoencoder (CVAE) to significantly increase the accuracy of reconstructing the normal data. Together with a carefully designed "target attention" mechanism, our approach allows the model to pick the most useful information from the frequency domain for better short-periodic trend construction. Our FCVAE has been evaluated on public datasets and a large-scale cloud system, and the results demonstrate that it outperforms state-of-the-art methods. This confirms the practical applicability of our approach in addressing the limitations of current VAE-based anomaly detection models.

* WWW 2024

Via

Access Paper or Ask Questions

Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection

Aug 25, 2023

Haotian Si, Changhua Pei, Zhihan Li, Yadong Zhao, Jingjing Li, Haiming Zhang, Zulong Diao, Jianhui Li, Gaogang Xie, Dan Pei

Figure 1 for Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection

Figure 2 for Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection

Figure 3 for Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection

Figure 4 for Beyond Sharing: Conflict-Aware Multivariate Time Series Anomaly Detection

Abstract:Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.

* 11 pages, ESEC/FSE industry track 2023

Via

Access Paper or Ask Questions

Model-Agnostic Meta-Learning for Natural Language Understanding Tasks in Finance

Mar 06, 2023

Bixing Yan, Shaoling Chen, Yuxuan He, Zhihan Li

Figure 1 for Model-Agnostic Meta-Learning for Natural Language Understanding Tasks in Finance

Figure 2 for Model-Agnostic Meta-Learning for Natural Language Understanding Tasks in Finance

Figure 3 for Model-Agnostic Meta-Learning for Natural Language Understanding Tasks in Finance

Figure 4 for Model-Agnostic Meta-Learning for Natural Language Understanding Tasks in Finance

Abstract:Natural language understanding(NLU) is challenging for finance due to the lack of annotated data and the specialized language in that domain. As a result, researchers have proposed to use pre-trained language model and multi-task learning to learn robust representations. However, aggressive fine-tuning often causes over-fitting and multi-task learning may favor tasks with significantly larger amounts data, etc. To address these problems, in this paper, we investigate model-agnostic meta-learning algorithm(MAML) in low-resource financial NLU tasks. Our contribution includes: 1. we explore the performance of MAML method with multiple types of tasks: GLUE datasets, SNLI, Sci-Tail and Financial PhraseBank; 2. we study the performance of MAML method with multiple single-type tasks: a real scenario stock price prediction problem with twitter text data. Our models achieve the state-of-the-art performance according to the experimental results, which demonstrate that our method can adapt fast and well to low-resource situations.

* 13 pages, 6 figures, 8 tables

Via

Access Paper or Ask Questions

Feature Space Singularity for Out-of-Distribution Detection

Dec 16, 2020

Haiwen Huang, Zhihan Li, Lulu Wang, Sishuo Chen, Bin Dong, Xinyu Zhou

Figure 1 for Feature Space Singularity for Out-of-Distribution Detection

Figure 2 for Feature Space Singularity for Out-of-Distribution Detection

Figure 3 for Feature Space Singularity for Out-of-Distribution Detection

Figure 4 for Feature Space Singularity for Out-of-Distribution Detection

Abstract:Out-of-Distribution (OoD) detection is important for building safe artificial intelligence systems. However, current OoD detection methods still cannot meet the performance requirements for practical deployment. In this paper, we propose a simple yet effective algorithm based on a novel observation: in a trained neural network, OoD samples with bounded norms well concentrate in the feature space. We call the center of OoD features the Feature Space Singularity (FSS), and denote the distance of a sample feature to FSS as FSSD. Then, OoD samples can be identified by taking a threshold on the FSSD. Our analysis of the phenomenon reveals why our algorithm works. We demonstrate that our algorithm achieves state-of-the-art performance on various OoD detection benchmarks. Besides, FSSD also enjoys robustness to slight corruption in test data and can be further enhanced by ensembling. These make FSSD a promising algorithm to be employed in real world. We release our code at \url{https://github.com/megvii-research/FSSD_OoD_Detection}.

* 10 pages, 7 figures

Via

Access Paper or Ask Questions