Alert button
Picture for Xiangdong Su

Xiangdong Su

Alert button

Improving CTC-AED model with integrated-CTC and auxiliary loss regularization

Aug 15, 2023
Daobin Zhu, Xiangdong Su, Hongbin Zhang

Figure 1 for Improving CTC-AED model with integrated-CTC and auxiliary loss regularization
Figure 2 for Improving CTC-AED model with integrated-CTC and auxiliary loss regularization
Figure 3 for Improving CTC-AED model with integrated-CTC and auxiliary loss regularization
Figure 4 for Improving CTC-AED model with integrated-CTC and auxiliary loss regularization

Connectionist temporal classification (CTC) and attention-based encoder decoder (AED) joint training has been widely applied in automatic speech recognition (ASR). Unlike most hybrid models that separately calculate the CTC and AED losses, our proposed integrated-CTC utilizes the attention mechanism of AED to guide the output of CTC. In this paper, we employ two fusion methods, namely direct addition of logits (DAL) and preserving the maximum probability (PMP). We achieve dimensional consistency by adaptively affine transforming the attention results to match the dimensions of CTC. To accelerate model convergence and improve accuracy, we introduce auxiliary loss regularization for accelerated convergence. Experimental results demonstrate that the DAL method performs better in attention rescoring, while the PMP method excels in CTC prefix beam search and greedy search.

Viaarxiv icon

TransERR: Translation-based Knowledge Graph Completion via Efficient Relation Rotation

Jun 26, 2023
Jiang Li, Xiangdong Su

Figure 1 for TransERR: Translation-based Knowledge Graph Completion via Efficient Relation Rotation
Figure 2 for TransERR: Translation-based Knowledge Graph Completion via Efficient Relation Rotation
Figure 3 for TransERR: Translation-based Knowledge Graph Completion via Efficient Relation Rotation
Figure 4 for TransERR: Translation-based Knowledge Graph Completion via Efficient Relation Rotation

This paper presents translation-based knowledge graph completion method via efficient relation rotation (TransERR), a straightforward yet effective alternative to traditional translation-based knowledge graph completion models. Different from the previous translation-based models, TransERR encodes knowledge graphs in the hypercomplex-valued space, thus enabling it to possess a higher degree of translation freedom in mining latent information between the head and tail entities. To further minimize the translation distance, TransERR adaptively rotates the head entity and the tail entity with their corresponding unit quaternions, which are learnable in model training. The experiments on 7 benchmark datasets validate the effectiveness and the generalization of TransERR. The results also indicate that TransERR can better encode large-scale datasets with fewer parameters than the previous translation-based models. Our code is available at: \url{https://github.com/dellixx/TransERR}.

Viaarxiv icon

Coarse-to-Fine Recursive Speech Separation for Unknown Number of Speakers

Mar 30, 2022
Zhenhao Jin, Xiang Hao, Xiangdong Su

Figure 1 for Coarse-to-Fine Recursive Speech Separation for Unknown Number of Speakers
Figure 2 for Coarse-to-Fine Recursive Speech Separation for Unknown Number of Speakers
Figure 3 for Coarse-to-Fine Recursive Speech Separation for Unknown Number of Speakers
Figure 4 for Coarse-to-Fine Recursive Speech Separation for Unknown Number of Speakers

The vast majority of speech separation methods assume that the number of speakers is known in advance, hence they are specific to the number of speakers. By contrast, a more realistic and challenging task is to separate a mixture in which the number of speakers is unknown. This paper formulates the speech separation with the unknown number of speakers as a multi-pass source extraction problem and proposes a coarse-to-fine recursive speech separation method. This method comprises two stages, namely, recursive cue extraction and target speaker extraction. The recursive cue extraction stage determines how many computational iterations need to be performed and outputs a coarse cue speech by monitoring statistics in the mixture. As the number of recursive iterations increases, the accumulation of distortion eventually comes into the extracted speech and reminder. Therefore, in the second stage, we use a target speaker extraction network to extract a fine speech based on the coarse target cue and the original distortionless mixture. Experiments show that the proposed method archived state-of-the-art performance on the WSJ0 dataset with a different number of speakers. Furthermore, it generalizes well to an unseen large number of speakers.

Viaarxiv icon

An Edge Information and Mask Shrinking Based Image Inpainting Approach

Jun 11, 2020
Huali Xu, Xiangdong Su, Meng Wang, Xiang Hao, Guanglai Gao

Figure 1 for An Edge Information and Mask Shrinking Based Image Inpainting Approach
Figure 2 for An Edge Information and Mask Shrinking Based Image Inpainting Approach
Figure 3 for An Edge Information and Mask Shrinking Based Image Inpainting Approach
Figure 4 for An Edge Information and Mask Shrinking Based Image Inpainting Approach

In the image inpainting task, the ability to repair both high-frequency and low-frequency information in the missing regions has a substantial influence on the quality of the restored image. However, existing inpainting methods usually fail to consider both high-frequency and low-frequency information simultaneously. To solve this problem, this paper proposes edge information and mask shrinking based image inpainting approach, which consists of two models. The first model is an edge generation model used to generate complete edge information from the damaged image, and the second model is an image completion model used to fix the missing regions with the generated edge information and the valid contents of the damaged image. The mask shrinking strategy is employed in the image completion model to track the areas to be repaired. The proposed approach is evaluated qualitatively and quantitatively on the dataset Places2. The result shows our approach outperforms state-of-the-art methods.

* Accepted by ICME2020 
Viaarxiv icon

SNR-based teachers-student technique for speech enhancement

May 29, 2020
Xiang Hao, Xiangdong Su, Zhiyu Wang, Qiang Zhang, Huali Xu, Guanglai Gao

Figure 1 for SNR-based teachers-student technique for speech enhancement
Figure 2 for SNR-based teachers-student technique for speech enhancement
Figure 3 for SNR-based teachers-student technique for speech enhancement
Figure 4 for SNR-based teachers-student technique for speech enhancement

It is very challenging for speech enhancement methods to achieves robust performance under both high signal-to-noise ratio (SNR) and low SNR simultaneously. In this paper, we propose a method that integrates an SNR-based teachers-student technique and time-domain U-Net to deal with this problem. Specifically, this method consists of multiple teacher models and a student model. We first train the teacher models under multiple small-range SNRs that do not coincide with each other so that they can perform speech enhancement well within the specific SNR range. Then, we choose different teacher models to supervise the training of the student model according to the SNR of the training data. Eventually, the student model can perform speech enhancement under both high SNR and low SNR. To evaluate the proposed method, we constructed a dataset with an SNR ranging from -20dB to 20dB based on the public dataset. We experimentally analyzed the effectiveness of the SNR-based teachers-student technique and compared the proposed method with several state-of-the-art methods.

* Accepted to 2020 IEEE International Conference on Multimedia and Expo (ICME 2020) 
Viaarxiv icon

Sub-band Knowledge Distillation Framework for Speech Enhancement

May 29, 2020
Xiang Hao, Shixue Wen, Xiangdong Su, Yun Liu, Guanglai Gao, Xiaofei Li

Figure 1 for Sub-band Knowledge Distillation Framework for Speech Enhancement
Figure 2 for Sub-band Knowledge Distillation Framework for Speech Enhancement
Figure 3 for Sub-band Knowledge Distillation Framework for Speech Enhancement

In single-channel speech enhancement, methods based on full-band spectral features have been widely studied. However, only a few methods pay attention to non-full-band spectral features. In this paper, we explore a knowledge distillation framework based on sub-band spectral mapping for single-channel speech enhancement. Specifically, we divide the full frequency band into multiple sub-bands and pre-train an elite-level sub-band enhancement model (teacher model) for each sub-band. These teacher models are dedicated to processing their own sub-bands. Next, under the teacher models' guidance, we train a general sub-band enhancement model (student model) that works for all sub-bands. Without increasing the number of model parameters and computational complexity, the student model's performance is further improved. To evaluate our proposed method, we conducted a large number of experiments on an open-source data set. The final experimental results show that the guidance from the elite-level teacher models dramatically improves the student model's performance, which exceeds the full-band model by employing fewer parameters.

Viaarxiv icon