Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guoqing Zhao

Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

Mar 19, 2024

Zhigang Chen, Benjia Zhou, Jun Li, Jun Wan, Zhen Lei, Ning Jiang, Quan Lu, Guoqing Zhao

Figure 1 for Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

Figure 2 for Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

Figure 3 for Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

Figure 4 for Factorized Learning Assisted with Large Language Model for Gloss-free Sign Language Translation

Abstract:Previous Sign Language Translation (SLT) methods achieve superior performance by relying on gloss annotations. However, labeling high-quality glosses is a labor-intensive task, which limits the further development of SLT. Although some approaches work towards gloss-free SLT through jointly training the visual encoder and translation network, these efforts still suffer from poor performance and inefficient use of the powerful Large Language Model (LLM). Most seriously, we find that directly introducing LLM into SLT will lead to insufficient learning of visual representations as LLM dominates the learning curve. To address these problems, we propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free SLT. Concretely, we factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder. In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM's translation potential. This factorized training strategy proves to be highly effective as evidenced by significant improvements achieved across three SLT datasets which are all conducted under the gloss-free setting.

* Accepted by LREC-COLING-2024

Via

Access Paper or Ask Questions

SELM: Speech Enhancement Using Discrete Tokens and Language Models

Dec 15, 2023

Ziqian Wang, Xinfa Zhu, Zihan Zhang, YuanJun Lv, Ning Jiang, Guoqing Zhao, Lei Xie

Figure 1 for SELM: Speech Enhancement Using Discrete Tokens and Language Models

Figure 2 for SELM: Speech Enhancement Using Discrete Tokens and Language Models

Figure 3 for SELM: Speech Enhancement Using Discrete Tokens and Language Models

Figure 4 for SELM: Speech Enhancement Using Discrete Tokens and Language Models

Abstract:Language models (LMs) have shown superior performances in various speech generation tasks recently, demonstrating their powerful ability for semantic context modeling. Given the intrinsic similarity between speech generation and speech enhancement, harnessing semantic information holds potential advantages for speech enhancement tasks. In light of this, we propose SELM, a novel paradigm for speech enhancement, which integrates discrete tokens and leverages language models. SELM comprises three stages: encoding, modeling, and decoding. We transform continuous waveform signals into discrete tokens using pre-trained self-supervised learning (SSL) models and a k-means tokenizer. Language models then capture comprehensive contextual information within these tokens. Finally, a detokenizer and HiFi-GAN restore them into enhanced speech. Experimental results demonstrate that SELM achieves comparable performance in objective metrics alongside superior results in subjective perception. Our demos are available https://honee-w.github.io/SELM/.

* 5 pages, 2 figures. ACCEPTED BY ICASSP 2024

Via

Access Paper or Ask Questions

Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Oct 26, 2023

Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

Figure 1 for Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Figure 2 for Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Figure 3 for Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Figure 4 for Multi-Speaker Expressive Speech Synthesis via Semi-supervised Contrastive Learning

Abstract:This paper aims to build an expressive TTS system for multi-speakers, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, we construct positive-negative sample pairs at both utterance and category (such as emotion-happy or style-poet or speaker A) levels and leverage contrastive learning to better extract disentangled style, emotion, and speaker representations from speech. Furthermore, we introduce a semi-supervised training strategy to the proposed approach to effectively leverage multi-domain data, including style-labeled data, emotion-labeled data, and unlabeled data. We integrate the learned representations into an improved VITS model, enabling it to synthesize expressive speech with diverse styles and emotions for a target speaker. Experiments on multi-domain data demonstrate the good design of our model.

* 5 pages, 3 figures

Via

Access Paper or Ask Questions

Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification

Oct 07, 2023

Ze Li, Yuke Lin, Ning Jiang, Xiaoyi Qin, Guoqing Zhao, Haiying Wu, Ming Li

Figure 1 for Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification

Figure 2 for Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification

Figure 3 for Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification

Figure 4 for Multi-objective Progressive Clustering for Semi-supervised Domain Adaptation in Speaker Verification

Abstract:Utilizing the pseudo-labeling algorithm with large-scale unlabeled data becomes crucial for semi-supervised domain adaptation in speaker verification tasks. In this paper, we propose a novel pseudo-labeling method named Multi-objective Progressive Clustering (MoPC), specifically designed for semi-supervised domain adaptation. Firstly, we utilize limited labeled data from the target domain to derive domain-specific descriptors based on multiple distinct objectives, namely within-graph denoising, intra-class denoising and inter-class denoising. Then, the Infomap algorithm is adopted for embedding clustering, and the descriptors are leveraged to further refine the target domain's pseudo-labels. Moreover, to further improve the quality of pseudo labels, we introduce the subcenter-purification and progressive-merging strategy for label denoising. Our proposed MoPC method achieves 4.95% EER and ranked the 1$^{st}$ place on the evaluation set of VoxSRC 2023 track 3. We also conduct additional experiments on the FFSVC dataset and yield promising results.

Via

Access Paper or Ask Questions

Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

Sep 25, 2023

Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li

Figure 1 for Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

Figure 2 for Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

Figure 3 for Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

Figure 4 for Haha-Pod: An Attempt for Laughter-based Non-Verbal Speaker Verification

Abstract:It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verification based on the most ubiquitous form of non-verbal voice, laughter. First, we use a semi-automatic pipeline to collect a new Haha-Pod dataset from open-source podcast media. The dataset contains over 240 speakers' laughter clips with corresponding high-quality verbal speech. Second, we propose a Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal (laughter) signals. Considering Haha-Pod as a test set, two trials (S2L-Eval) are designed to verify the speaker's identity through laugh sounds. Experimental results demonstrate that our method can significantly improve the performance of the S2L-Eval test set with only a minor degradation on the VoxCeleb1 test set. The Haha-Pod dataset is open to access on https://drive.google.com/file/d/1J-HBRTsm_yWrcbkXupy-tiWRt5gE2LzG/view?usp=drive_link.

* accepted by ASRU 2023

Via

Access Paper or Ask Questions

VoxBlink: X-Large Speaker Verification Dataset on Camera

Aug 23, 2023

Yuke Lin, Xiaoyi Qin, Ming Cheng, Ning Jiang, Guoqing Zhao, Ming Li

Figure 1 for VoxBlink: X-Large Speaker Verification Dataset on Camera

Figure 2 for VoxBlink: X-Large Speaker Verification Dataset on Camera

Figure 3 for VoxBlink: X-Large Speaker Verification Dataset on Camera

Figure 4 for VoxBlink: X-Large Speaker Verification Dataset on Camera

Abstract:In this paper, we contribute a novel and extensive dataset for speaker verification, which contains noisy 38k identities/1.45M utterances (VoxBlink) and relatively cleaned 18k identities/1.02M (VoxBlink-Clean) utterances for training. Firstly, we accumulate a 60K+ users' list with their avatars and download their short videos on YouTube. We then established an automatic and scalable pipeline to extract relevant speech and video segments from these videos. To our knowledge, the VoxBlink dataset is one of the largest speaker recognition datasets available. Secondly, we conduct a series of experiments based on different backbones trained on a mix of the VoxCeleb2 and the VoxBlink-Clean. Our findings highlight a notable performance improvement, ranging from 13% to 30%, across different backbone architectures upon integrating our dataset for training. The dataset will be made publicly available shortly.

* submit to ICASSP2024

Via

Access Paper or Ask Questions

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Aug 17, 2023

Ming Cheng, Weiqing Wang, Xiaoyi Qin, Yuke Lin, Ning Jiang, Guoqing Zhao, Ming Li

Figure 1 for The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Figure 2 for The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Figure 3 for The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Figure 4 for The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Abstract:This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity detection, clustering-based diarization, overlapped speech detection, and target-speaker voice activity detection, where each procedure has a fused output from 3 sub-models. Finally, we fuse different clustering-based and TSVAD-based diarization systems using DOVER-Lap and achieve the 4.30% diarization error rate (DER), which ranks first place on track 4 of the challenge leaderboard.

Via

Access Paper or Ask Questions

The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

Aug 17, 2023

Ze Li, Yuke Lin, Xiaoyi Qin, Ning Jiang, Guoqing Zhao, Ming Li

Figure 1 for The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

Figure 2 for The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

Figure 3 for The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

Figure 4 for The DKU-MSXF Speaker Verification System for the VoxCeleb Speaker Recognition Challenge 2023

Abstract:This paper is the system description of the DKU-MSXF System for the track1, track2 and track3 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). For Track 1, we utilize a network structure based on ResNet for training. By constructing a cross-age QMF training set, we achieve a substantial improvement in system performance. For Track 2, we inherite the pre-trained model from Track 1 and conducte mixed training by incorporating the VoxBlink-clean dataset. In comparison to Track 1, the models incorporating VoxBlink-clean data exhibit a performance improvement by more than 10% relatively. For Track3, the semi-supervised domain adaptation task, a novel pseudo-labeling method based on triple thresholds and sub-center purification is adopted to make domain adaptation. The final submission achieves mDCF of 0.1243 in task1, mDCF of 0.1165 in Track 2 and EER of 4.952% in Track 3.

* arXiv admin note: text overlap with arXiv:2210.05092

Via

Access Paper or Ask Questions

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Jul 10, 2023

Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang, Guoqing Zhao

Figure 1 for The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Figure 2 for The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Figure 3 for The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Figure 4 for The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

Abstract:This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker timbre and linguistic content disentanglement. Based on the two-stage framework, pre-trained speaker embedding is leveraged as a condition to transfer the speaker timbre in the source English speech to the translated Chinese speech. Experimental results show that our system has high translation accuracy, speech naturalness, sound quality, and speaker similarity. Moreover, it shows good robustness to multi-source data.

* IWSLT@ACL 2023 system paper. Our submitted system ranks 1st in the S2ST task of the IWSLT 2023 evaluation campaign

Via

Access Paper or Ask Questions

TreeMAN: Tree-enhanced Multimodal Attention Network for ICD Coding

May 29, 2023

Zichen Liu, Xuyuan Liu, Yanlong Wen, Guoqing Zhao, Fen Xia, Xiaojie Yuan

Figure 1 for TreeMAN: Tree-enhanced Multimodal Attention Network for ICD Coding

Figure 2 for TreeMAN: Tree-enhanced Multimodal Attention Network for ICD Coding

Figure 3 for TreeMAN: Tree-enhanced Multimodal Attention Network for ICD Coding

Figure 4 for TreeMAN: Tree-enhanced Multimodal Attention Network for ICD Coding

Abstract:ICD coding is designed to assign the disease codes to electronic health records (EHRs) upon discharge, which is crucial for billing and clinical statistics. In an attempt to improve the effectiveness and efficiency of manual coding, many methods have been proposed to automatically predict ICD codes from clinical notes. However, most previous works ignore the decisive information contained in structured medical data in EHRs, which is hard to be captured from the noisy clinical notes. In this paper, we propose a Tree-enhanced Multimodal Attention Network (TreeMAN) to fuse tabular features and textual features into multimodal representations by enhancing the text representations with tree-based features via the attention mechanism. Tree-based features are constructed according to decision trees learned from structured multimodal medical data, which capture the decisive information about ICD coding. We can apply the same multi-label classifier from previous text models to the multimodal representations to predict ICD codes. Experiments on two MIMIC datasets show that our method outperforms prior state-of-the-art ICD coding approaches. The code is available at https://github.com/liu-zichen/TreeMAN.

Via

Access Paper or Ask Questions