Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Chu

An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Jul 22, 2021
Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan

Figure 1 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Figure 2 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Figure 3 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Figure 4 for An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Non-autoregressive mechanisms can significantly decrease inference time for speech transformers, especially when the single step variant is applied. Previous work on CTC alignment-based single step non-autoregressive transformer (CASS-NAT) has shown a large real time factor (RTF) improvement over autoregressive transformers (AT). In this work, we propose several methods to improve the accuracy of the end-to-end CASS-NAT, followed by performance analyses. First, convolution augmented self-attention blocks are applied to both the encoder and decoder modules. Second, we propose to expand the trigger mask (acoustic boundary) for each token to increase the robustness of CTC alignments. In addition, iterated loss functions are used to enhance the gradient update of low-layer parameters. Without using an external language model, the WERs of the improved CASS-NAT, when using the three methods, are 3.1%/7.2% on Librispeech test clean/other sets and the CER is 5.4% on the Aishell1 test set, achieving a 7%~21% relative WER/CER improvement. For the analyses, we plot attention weight distributions in the decoders to visualize the relationships between token-level acoustic embeddings. When the acoustic embeddings are visualized, we find that they have a similar behavior to word embeddings, which explains why the improved CASS-NAT performs similarly to AT.

* Accepted to Interspeech2021

Via

Access Paper or Ask Questions

CBNetV2: A Composite Backbone Network Architecture for Object Detection

Jul 12, 2021
Tingting Liang, Xiaojie Chu, Yudong Liu, Yongtao Wang, Zhi Tang, Wei Chu, Jingdong Chen, Haibin Ling

Figure 1 for CBNetV2: A Composite Backbone Network Architecture for Object Detection

Figure 2 for CBNetV2: A Composite Backbone Network Architecture for Object Detection

Figure 3 for CBNetV2: A Composite Backbone Network Architecture for Object Detection

Figure 4 for CBNetV2: A Composite Backbone Network Architecture for Object Detection

Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to better train existing open-sourced pre-trained backbones under the pre-training fine-tuning protocol. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. CBNetV2 has strong generalization capabilities for different backbones and head designs of the detector architecture. Without additional pre-training, CBNetV2 can be adapted to various backbones, including manual-based and NAS-based, as well as CNN-based and Transformer-based ones. Experiments provide strong evidence showing that composite backbones are more efficient, effective, and resource-friendly than wider and deeper networks. CBNetV2 is compatible with the head designs of most mainstream detectors, including one-stage and two-stage detectors, as well as anchor-based and anchor-free-based ones, and significantly improve their performances by more than 3.0% AP over the baseline on COCO. Particularly, under the single-model and single-scale testing protocol, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev, which is significantly better than the state-of-the-art result (i.e., 57.7% box AP and 50.2% mask AP). Code is available at https://github.com/VDIGPKU/CBNetV2.

* 11 pages, 8 figures

Via

Access Paper or Ask Questions

Low Resource German ASR with Untranscribed Data Spoken by Non-native Children -- INTERSPEECH 2021 Shared Task SPAPL System

Jun 18, 2021
Jinhan Wang, Yunzheng Zhu, Ruchao Fan, Wei Chu, Abeer Alwan

Figure 1 for Low Resource German ASR with Untranscribed Data Spoken by Non-native Children -- INTERSPEECH 2021 Shared Task SPAPL System

Figure 2 for Low Resource German ASR with Untranscribed Data Spoken by Non-native Children -- INTERSPEECH 2021 Shared Task SPAPL System

Figure 3 for Low Resource German ASR with Untranscribed Data Spoken by Non-native Children -- INTERSPEECH 2021 Shared Task SPAPL System

This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German. ~ 5 hours of transcribed data and ~ 60 hours of untranscribed data are provided to develop a German ASR system for children. For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances. In order to explore the use of the untranscribed data, various approaches are implemented and combined together to incrementally improve the system performance. First, bidirectional autoregressive predictive coding (Bi-APC) is used to learn initial parameters for acoustic modelling using the provided untranscribed data. Second, incremental semi-supervised learning is further used to iteratively generate pseudo-transcribed data. Third, different data augmentation schemes are used at different training stages to increase the variability and size of the training data. Finally, a recurrent neural network language model (RNNLM) is used for rescoring. Our system achieves a word error rate (WER) of 39.68% on the evaluation data, an approximately 12% relative improvement over the official baseline (45.21%).

* Accepted to INTERSPEECH 2021

Via

Access Paper or Ask Questions

CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

May 23, 2021
Hao Huang, Yongtao Wang, Zhaoyu Chen, Yuheng Li, Zhi Tang, Wei Chu, Jingdong Chen, Weisi Lin, Kai-Kuang Ma

Figure 1 for CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

Figure 2 for CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

Figure 3 for CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

Figure 4 for CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

Malicious application of deepfakes (i.e., technologies can generate target faces or face attributes) has posed a huge threat to our society. The fake multimedia content generated by deepfake models can harm the reputation and even threaten the property of the person who has been impersonated. Fortunately, the adversarial watermark could be used for combating deepfake models, leading them to generate distorted images. The existing methods require an individual training process for every facial image, to generate the adversarial watermark against a specific deepfake model, which are extremely inefficient. To address this problem, we propose a universal adversarial attack method on deepfake models, to generate a Cross-Model Universal Adversarial Watermark (CMUA-Watermark) that can protect thousands of facial images from multiple deepfake models. Specifically, we first propose a cross-model universal attack pipeline by attacking multiple deepfake models and combining gradients from these models iteratively. Then we introduce a batch-based method to alleviate the conflict of adversarial watermarks generated by different facial images. Finally, we design a more reasonable and comprehensive evaluation method for evaluating the effectiveness of the adversarial watermark. Experimental results demonstrate that the proposed CMUA-Watermark can effectively distort the fake facial images generated by deepfake models and successfully protect facial images from deepfakes in real scenes.

* 9 pages, 9 figures

Via

Access Paper or Ask Questions

PairRE: Knowledge Graph Embeddings via Paired Relation Vectors

Nov 07, 2020
Linlin Chao, Jianshan He, Taifeng Wang, Wei Chu

Figure 1 for PairRE: Knowledge Graph Embeddings via Paired Relation Vectors

Figure 2 for PairRE: Knowledge Graph Embeddings via Paired Relation Vectors

Figure 3 for PairRE: Knowledge Graph Embeddings via Paired Relation Vectors

Figure 4 for PairRE: Knowledge Graph Embeddings via Paired Relation Vectors

Distance based knowledge graph embedding methods show promising results on link prediction task, on which two topics have been widely studied: one is the ability to handle complex relations, such as N-to-1, 1-to-N and N-to-N, the other is to encode various relation patterns, such as symmetry/antisymmetry. However, the existing methods fail to solve these two problems at the same time, which leads to unsatisfactory results. To mitigate this problem, we propose PairRE, a model with improved expressiveness and low computational requirement. PairRE represents each relation with paired vectors, where these paired vectors project connected two entities to relation specific locations. Beyond its ability to solve the aforementioned two problems, PairRE is advantageous to represent subrelation as it can capture both the similarities and differences of subrelations effectively. Given simple constraints on relation representations, PairRE can be the first model that is capable of encoding symmetry/antisymmetry, inverse, composition and subrelation relations. Experiments on link prediction benchmarks show PairRE can achieve either state-of-the-art or highly competitive performances. In addition, PairRE has shown encouraging results for encoding subrelation.

Via

Access Paper or Ask Questions

CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Oct 28, 2020
Ruchao Fan, Wei Chu, Peng Chang, Jing Xiao

Figure 1 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Figure 2 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Figure 3 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

Figure 4 for CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition

We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8%/9.1% on Librispeech test clean/other dataset without an external LM, and a CER of 5.8% on Aishell1 Mandarin corpus, respectively1. Compared to the AT baseline, the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF. When decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3% on the test-clean set, indicating the potential of the proposed method.

* Submitted to ICASSP2021

Via

Access Paper or Ask Questions

Question Directed Graph Attention Network for Numerical Reasoning over Text

Sep 16, 2020
Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xiaochuan, Yuyu Zhang, Le Song, Taifeng Wang, Yuan Qi, Wei Chu

Figure 1 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Figure 2 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Figure 3 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Figure 4 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Numerical reasoning over texts, such as addition, subtraction, sorting and counting, is a challenging machine reading comprehension task, since it requires both natural language understanding and arithmetic computation. To address this challenge, we propose a heterogeneous graph representation for the context of the passage and question needed for such reasoning, and design a question directed graph attention network to drive multi-step numerical reasoning over this context graph.

* Accepted at EMNLP 2020

Via

Access Paper or Ask Questions