Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dan Su

Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance

May 31, 2021
Dan Su, Tiezheng Yu, Pascale Fung

Figure 1 for Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance

Figure 2 for Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance

Figure 3 for Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance

Figure 4 for Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance

Query focused summarization (QFS) models aim to generate summaries from source documents that can answer the given query. Most previous work on QFS only considers the query relevance criterion when producing the summary. However, studying the effect of answer relevance in the summary generating process is also important. In this paper, we propose QFS-BART, a model that incorporates the explicit answer relevance of the source documents given the query via a question answering model, to generate coherent and answer-related summaries. Furthermore, our model can take advantage of large pre-trained models which improve the summarization performance significantly. Empirical results on the Debatepedia dataset show that the proposed model achieves the new state-of-the-art performance.

* The two authors contribute equally. Accepted as a short paper in Findings of ACL 2021

Via

Access Paper or Ask Questions

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

May 28, 2021
Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng

Figure 1 for DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

Figure 2 for DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

Figure 3 for DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion

Singing voice conversion (SVC) is one promising technique which can enrich the way of human-computer interaction by endowing a computer the ability to produce high-fidelity and expressive singing voice. In this paper, we propose DiffSVC, an SVC system based on denoising diffusion probabilistic model. DiffSVC uses phonetic posteriorgrams (PPGs) as content features. A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram produced by the diffusion/forward process and its corresponding step information as input to predict the added Gaussian noise. We use PPGs, fundamental frequency features and loudness features as auxiliary input to assist the denoising process. Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.

* Preprint. 8 pages, 2 figures and 1 table

Via

Access Paper or Ask Questions

Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

May 13, 2021
Yan Xu, Etsuko Ishii, Zihan Liu, Genta Indra Winata, Dan Su, Andrea Madotto, Pascale Fung

Figure 1 for Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

Figure 2 for Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

Figure 3 for Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

Figure 4 for Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters

To diversify and enrich generated dialogue responses, knowledge-grounded dialogue has been investigated in recent years. Despite the success of the existing methods, they mainly follow the paradigm of retrieving the relevant sentences over a large corpus and augment the dialogues with explicit extra information, which is time- and resource-consuming. In this paper, we propose KnowExpert, an end-to-end framework to bypass the retrieval process by injecting prior knowledge into the pre-trained language models with lightweight adapters. To the best of our knowledge, this is the first attempt to tackle this task relying solely on a generation-based approach. Experimental results show that KnowExpert performs comparably with the retrieval-based baselines, demonstrating the potential of our proposed direction.

Via

Access Paper or Ask Questions

Latency-Controlled Neural Architecture Search for Streaming Speech Recognition

May 08, 2021
Liqiang He, Shulin Feng, Dan Su, Dong Yu

Figure 1 for Latency-Controlled Neural Architecture Search for Streaming Speech Recognition

Figure 2 for Latency-Controlled Neural Architecture Search for Streaming Speech Recognition

Figure 3 for Latency-Controlled Neural Architecture Search for Streaming Speech Recognition

Figure 4 for Latency-Controlled Neural Architecture Search for Streaming Speech Recognition

Recently, neural architecture search (NAS) has attracted much attention and has been explored for automatic speech recognition (ASR). Our prior work has shown promising results compared with hand-designed neural networks. In this work, we focus on streaming ASR scenarios and propose the latency-controlled NAS for acoustic modeling. First, based on the vanilla neural architecture, normal cells are altered to be causal cells, in order to control the total latency of the neural network. Second, a revised operation space with a smaller receptive field is proposed to generate the final architecture with low latency. Extensive experiments show that: 1) Based on the proposed neural architecture, the neural networks with a medium latency of 550ms (millisecond) and a low latency of 190ms can be learned in the vanilla and revised operation space respectively. 2) For the low latency setting, the evaluation network can achieve more than 19\% (average on the four test sets) relative improvements compared with the hybrid CLDNN baseline, on a 10k-hour large-scale dataset. Additional 11\% relative improvements can be achieved if the latency of the neural network is relaxed to the medium latency setting.

* Submitted to INTERSPEECH 2021

Via

Access Paper or Ask Questions

SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

May 07, 2021
Zhao You, Shulin Feng, Dan Su, Dong Yu

Figure 1 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Figure 2 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Figure 3 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Figure 4 for SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Recently, Mixture of Experts (MoE) based Transformer has shown promising results in many domains. This is largely due to the following advantages of this architecture: firstly, MoE based Transformer can increase model capacity without computational cost increasing both at training and inference time. Besides, MoE based Transformer is a dynamic network which can adapt to the varying complexity of input instances in realworld applications. In this work, we explore the MoE based model for speech recognition, named SpeechMoE. To further control the sparsity of router activation and improve the diversity of gate values, we propose a sparsity L1 loss and a mean importance loss respectively. In addition, a new router architecture is used in SpeechMoE which can simultaneously utilize the information from a shared embedding network and the hierarchical representation of different MoE layers. Experimental results show that SpeechMoE can achieve lower character error rate (CER) with comparable computation cost than traditional static networks, providing 7.0%-23.0% relative CER improvements on four evaluation datasets.

* 5 pages, 2 figures. Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

Mar 31, 2021
Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu

Figure 1 for TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

Figure 2 for TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

Figure 3 for TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

Figure 4 for TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation

In this paper, we exploit the effective way to leverage contextual information to improve the speech dereverberation performance in real-world reverberant environments. We propose a temporal-contextual attention approach on the deep neural network (DNN) for environment-aware speech dereverberation, which can adaptively attend to the contextual information. More specifically, a FullBand based Temporal Attention approach (FTA) is proposed, which models the correlations between the fullband information of the context frames. In addition, considering the difference between the attenuation of high frequency bands and low frequency bands (high frequency bands attenuate faster than low frequency bands) in the room impulse response (RIR), we also propose a SubBand based Temporal Attention approach (STA). In order to guide the network to be more aware of the reverberant environments, we jointly optimize the dereverberation network and the reverberation time (RT60) estimator in a multi-task manner. Our experimental results indicate that the proposed method outperforms our previously proposed reverberation-time-aware DNN and the learned attention weights are fully physical consistent. We also report a preliminary yet promising dereverberation and recognition experiment on real test data.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

Mar 08, 2021
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

Figure 1 for Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

Figure 2 for Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

Figure 3 for Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

Figure 4 for Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

One of the leading single-channel speech separation (SS) models is based on a TasNet with a dual-path segmentation technique, where the size of each segment remains unchanged throughout all layers. In contrast, our key finding is that multi-granularity features are essential for enhancing contextual modeling and computational efficiency. We introduce a self-attentive network with a novel sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA) SS performance at significantly smaller model size and computational cost. Forward along each block inside Sandglasset, the temporal granularity of the features gradually becomes coarser until reaching half of the network blocks, and then successively turns finer towards the raw signal level. We also unfold that residual connections between features with the same granularity are critical for preserving information after passing through the bottleneck layer. Experiments show our Sandglasset with only 2.3M parameters has achieved the best results on two benchmark SS datasets -- WSJ0-2mix and WSJ0-3mix, where the SI-SNRi scores have been improved by absolute 0.8 dB and 2.4 dB, respectively, comparing to the prior SOTA results.

* Accepted in ICASSP 2021

Via

Access Paper or Ask Questions

Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

Mar 02, 2021
Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

Figure 1 for Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

Figure 2 for Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

Figure 3 for Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

Figure 4 for Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect

We study the cocktail party problem and propose a novel attention network called Tune-In, abbreviated for training under negative environments with interference. It firstly learns two separate spaces of speaker-knowledge and speech-stimuli based on a shared feature space, where a new block structure is designed as the building block for all spaces, and then cooperatively solves different tasks. Between the two spaces, information is cast towards each other via a novel cross- and dual-attention mechanism, mimicking the bottom-up and top-down processes of a human's cocktail party effect. It turns out that substantially discriminative and generalizable speaker representations can be learnt in severely interfered conditions via our self-supervised training. The experimental results verify this seeming paradox. The learnt speaker embedding has superior discriminative power than a standard speaker verification method; meanwhile, Tune-In achieves remarkably better speech separation performances in terms of SI-SNRi and SDRi consistently in all test modes, and especially at lower memory and computational consumption, than state-of-the-art benchmark systems.

* Accepted in AAAI 2021

Via

Access Paper or Ask Questions

Contrastive Separative Coding for Self-supervised Representation Learning

Mar 01, 2021
Jun Wang, Max W. Y. Lam, Dan Su, Dong Yu

Figure 1 for Contrastive Separative Coding for Self-supervised Representation Learning

Figure 2 for Contrastive Separative Coding for Self-supervised Representation Learning

To extract robust deep representations from long sequential modeling of speech data, we propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC). Our key finding is to learn such representations by separating the target signal from contrastive interfering signals. First, a multi-task separative encoder is built to extract shared separable and discriminative embedding; secondly, we propose a powerful cross-attention mechanism performed over speaker representations across various interfering conditions, allowing the model to focus on and globally aggregate the most critical information to answer the "query" (current bottom-up embedding) while paying less attention to interfering, noisy, or irrelevant parts; lastly, we form a new probabilistic contrastive loss which estimates and maximizes the mutual information between the representations and the global speaker vector. While most prior unsupervised methods have focused on predicting the future, neighboring, or missing samples, we take a different perspective of predicting the interfered samples. Moreover, our contrastive separative loss is free from negative sampling. The experiment demonstrates that our approach can learn useful representations achieving a strong speaker verification performance in adverse conditions.

* Accepted in ICASSP 2021

Via

Access Paper or Ask Questions