Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shiliang Zhang

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Dec 23, 2023

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen

Figure 1 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Figure 2 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Figure 3 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Figure 4 for emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Abstract:We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field.

* Code, checkpoints, and extracted features are available at https://github.com/ddlBoJack/emotion2vec

Via

Access Paper or Ask Questions

Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

Dec 19, 2023

Lingyun Zuo, Keyu An, Shiliang Zhang, Zhijie Yan

Abstract:In a speech recognition system, voice activity detection (VAD) is a crucial frontend module. Addressing the issues of poor noise robustness in traditional binary VAD systems based on DFSMN, the paper further proposes semantic VAD based on multi-task learning with improved models for real-time and offline systems, to meet specific application requirements. Evaluations on internal datasets show that, compared to the real-time VAD system based on DFSMN, the real-time semantic VAD system based on RWKV achieves relative decreases in CER of 7.0\%, DCF of 26.1\% and relative improvement in NRR of 19.2\%. Similarly, when compared to the offline VAD system based on DFSMN, the offline VAD system based on SAN-M demonstrates relative decreases in CER of 4.4\%, DCF of 18.6\% and relative improvement in NRR of 3.5\%.

Via

Access Paper or Ask Questions

Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Dec 14, 2023

Fan Yu, Haoxu Wang, Ziyang Ma, Shiliang Zhang

Figure 1 for Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Figure 2 for Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Figure 3 for Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Figure 4 for Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Abstract:Recently audio-visual speech recognition (AVSR), which better leverages video modality as additional information to extend automatic speech recognition (ASR), has shown promising results in complex acoustic environments. However, there is still substantial space to improve as complex computation of visual modules and ineffective fusion of audio-visual modalities. To eliminate these drawbacks, we propose a down-up sampling-based AVSR model (Hourglass-AVSR) to enjoy high efficiency and performance, whose time length is scaled during the intermediate processing, resembling an hourglass. Firstly, we propose a context and residual aware video upsampling approach to improve the recognition performance, which utilizes contextual information from visual representations and captures residual information between adjacent video frames. Secondly, we introduce a visual-audio alignment approach during the upsampling by explicitly incorporating boundary constraint loss. Besides, we propose a cross-layer attention fusion to capture the modality dependencies within each visual encoder layer. Experiments conducted on the MISP-AVSR dataset reveal that our proposed Hourglass-AVSR model outperforms ASR model by 12.9% and 20.8% relative concatenated minimum permutation character error rate (cpCER) reduction on far-field and middle-field test sets, respectively. Moreover, compared to other state-of-the-art AVSR models, our model exhibits the highest improvement in cpCER for the visual module. Furthermore, on the benefit of our down-up sampling approach, Hourglass-AVSR model reduces 54.2% overall computation costs with minor performance degradation.

* Accepted by ICASSP 2024

Via

Access Paper or Ask Questions

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Nov 14, 2023

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, Jingren Zhou

Figure 1 for Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Figure 2 for Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Figure 3 for Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Figure 4 for Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Abstract:Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

* The code is released at https://github.com/QwenLM/Qwen-Audio

Via

Access Paper or Ask Questions

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

Nov 08, 2023

Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang

Figure 1 for Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

Figure 2 for Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

Figure 3 for Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

Figure 4 for Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

Abstract:Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on speech tasks. These models convert continuous speech signals into discrete tokens (speech discretization) and merge text and speech tokens into a shared vocabulary. Then they train a single decoder-only Transformer on a mixture of speech tasks. Specifically, all these models utilize Loss Masking on the input speech tokens for the ASR task, which means that these models do not explicitly model the dependency between the speech tokens. In this paper, we attempt to model the sequence of speech tokens in an autoregressive manner like text. However, we find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over Loss Masking. Therefore, we propose a novel approach denoted Smoothed Label Distillation (SLD), which introduces a KL divergence loss with smoothed labels on the input speech tokens to effectively model speech tokens. Experiments demonstrate that our SLD approach alleviates the limitations of the cross-entropy loss and consistently outperforms Loss Masking for decoder-only Transformer based ASR using different speech discretization methods.

* 5 pages, submitted to ICASSP 2024

Via

Access Paper or Ask Questions

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Oct 11, 2023

Jiaming Wang, Zhihao Du, Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma(+5 more)

Figure 1 for LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Figure 2 for LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Figure 3 for LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Figure 4 for LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Abstract:Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks. However, there has been limited research on applying similar frameworks to audio tasks. Previously proposed large language models for audio tasks either lack sufficient quantitative evaluations, or are limited to tasks for recognizing and understanding audio content, or significantly underperform existing state-of-the-art (SOTA) models. In this paper, we propose LauraGPT, a unified GPT model for audio recognition, understanding, and generation. LauraGPT is a versatile language model that can process both audio and text inputs and generate outputs in either modalities. It can perform a wide range of tasks related to content, semantics, paralinguistics, and audio-signal analysis. Some of its noteworthy tasks include automatic speech recognition, speech-to-text translation, text-to-speech synthesis, machine translation, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding. To achieve this goal, we use a combination of continuous and discrete features for audio. We encode input audio into continuous representations using an audio encoder and decode output audio from discrete codec codes. We then fine-tune a large decoder-only Transformer-based language model on multiple audio-to-text, text-to-audio, audio-to-audio, and text-to-text tasks using a supervised multitask learning approach. Extensive experiments show that LauraGPT achieves competitive or superior performance compared to existing SOTA models on various audio processing benchmarks.

* 10 pages, under review

Via

Access Paper or Ask Questions

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Oct 07, 2023

Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

Figure 1 for SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Figure 2 for SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Figure 3 for SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Figure 4 for SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Abstract:Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model.

Via

Access Paper or Ask Questions

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Oct 01, 2023

Shiyu Xuan, Qingpei Guo, Ming Yang, Shiliang Zhang

Figure 1 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Figure 2 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Figure 3 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Figure 4 for Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Abstract:Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in many vision-language tasks. Nevertheless, most MLLMs still lack the Referential Comprehension (RC) ability to identify a specific object or area in images, limiting their application in fine-grained perception tasks. This paper proposes a novel method to enhance the RC capability for MLLMs. Our model represents the referring object in the image using the coordinates of its bounding box and converts the coordinates into texts in a specific format. This allows the model to treat the coordinates as natural language. Moreover, we construct the instruction tuning dataset with various designed RC tasks at a low cost by unleashing the potential of annotations in existing datasets. To further boost the RC ability of the model, we propose a self-consistent bootstrapping method that extends dense object annotations of a dataset into high-quality referring-expression-bounding-box pairs. The model is trained end-to-end with a parameter-efficient tuning framework that allows both modalities to benefit from multi-modal instruction tuning. This framework requires fewer trainable parameters and less training data. Experimental results on conventional vision-language and RC tasks demonstrate the superior performance of our method. For instance, our model exhibits a 12.0% absolute accuracy improvement over Instruct-BLIP on VSR and surpasses Kosmos-2 by 24.7% on RefCOCO_val under zero-shot settings. We also attain the top position on the leaderboard of MMBench. The models, datasets, and codes are publicly available at https://github.com/SY-Xuan/Pink

Via

Access Paper or Ask Questions

Exploring RWKV for Memory Efficient and Low Latency Streaming ASR

Sep 26, 2023

Keyu An, Shiliang Zhang

Figure 1 for Exploring RWKV for Memory Efficient and Low Latency Streaming ASR

Figure 2 for Exploring RWKV for Memory Efficient and Low Latency Streaming ASR

Figure 3 for Exploring RWKV for Memory Efficient and Low Latency Streaming ASR

Figure 4 for Exploring RWKV for Memory Efficient and Low Latency Streaming ASR

Abstract:Recently, self-attention-based transformers and conformers have been introduced as alternatives to RNNs for ASR acoustic modeling. Nevertheless, the full-sequence attention mechanism is non-streamable and computationally expensive, thus requiring modifications, such as chunking and caching, for efficient streaming ASR. In this paper, we propose to apply RWKV, a variant of linear attention transformer, to streaming ASR. RWKV combines the superior performance of transformers and the inference efficiency of RNNs, which is well-suited for streaming ASR scenarios where the budget for latency and memory is restricted. Experiments on varying scales (100h - 10000h) demonstrate that RWKV-Transducer and RWKV-Boundary-Aware-Transducer achieve comparable to or even better accuracy compared with chunk conformer transducer, with minimal latency and inference memory cost.

* submitted to ICASSP 2024

Via

Access Paper or Ask Questions

The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Sep 24, 2023

Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu(+4 more)

Figure 1 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Figure 2 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Figure 3 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Figure 4 for The second multi-channel multi-party meeting transcription challenge 2.0): A benchmark for speaker-attributed ASR

Abstract:With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of speaker-attributed ASR (SA-ASR), which directly addresses the practical and challenging problem of "who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. 1) The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. 2) The open training condition sub-track, which allows for the use of all available data and models. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR.

* 8 pages, Accepted by ASRU2023

Via

Access Paper or Ask Questions