Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuancheng Ren

additional authors not shown

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

May 15, 2021

Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu Sun, Yuexian Zou

Figure 1 for Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Figure 2 for Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Figure 3 for Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Figure 4 for Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Abstract:Skip connection, is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.

* Accepted by COLING2020 (The 28th International Conference on Computational Linguistics (COLING 2020))

Via

Access Paper or Ask Questions

Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Mar 29, 2021

Wenkai Yang, Lei Li, Zhiyuan Zhang, Xuancheng Ren, Xu Sun, Bin He

Figure 1 for Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Figure 2 for Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Figure 3 for Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Figure 4 for Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Abstract:Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. Victim models can maintain competitive performance on clean samples while behaving abnormally on samples with a specific trigger word inserted. Previous backdoor attacking methods usually assume that attackers have a certain degree of data knowledge, either the dataset which users would use or proxy datasets for a similar task, for implementing the data poisoning procedure. However, in this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier. We hope this work can raise the awareness of such a critical security risk hidden in the embedding layers of NLP models. Our code is available at https://github.com/lancopku/Embedding-Poisoning.

* NAACL-HLT 2021, Long Paper

Via

Access Paper or Ask Questions

Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

Feb 22, 2021

Shaoxiong Feng, Xuancheng Ren, Kan Li, Xu Sun

Figure 1 for Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

Figure 2 for Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

Figure 3 for Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

Figure 4 for Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

Abstract:Neural dialogue models suffer from low-quality responses when interacted in practice, demonstrating difficulty in generalization beyond training data. Recently, knowledge distillation has been used to successfully regularize the student by transferring knowledge from the teacher. However, the teacher and the student are trained on the same dataset and tend to learn similar feature representations, whereas the most general knowledge should be found through differences. The finding of general knowledge is further hindered by the unidirectional distillation, as the student should obey the teacher and may discard some knowledge that is truly general but refuted by the teacher. To this end, we propose a novel training framework, where the learning of general knowledge is more in line with the idea of reaching consensus, i.e., finding common knowledge that is beneficial to different yet all datasets through diversified learning partners. Concretely, the training task is divided into a group of subtasks with the same number of students. Each student assigned to one subtask not only is optimized on the allocated subtask but also imitates multi-view feature representation aggregated from other students (i.e., student peers), which induces students to capture common knowledge among different subtasks and alleviates the over-fitting of students on the allocated subtasks. To further enhance generalization, we extend the unidirectional distillation to the bidirectional distillation that encourages the student and its student peers to co-evolve by exchanging complementary knowledge with each other. Empirical results and analysis demonstrate that our training framework effectively improves the model generalization without sacrificing training efficiency.

* Accepted by AAAI 2021; Camera ready version

Via

Access Paper or Ask Questions

Accelerating Pre-trained Language Models via Calibrated Cascade

Dec 29, 2020

Lei Li, Yankai Lin, Shuhuai Ren, Deli Chen, Xuancheng Ren, Peng Li, Jie Zhou, Xu Sun

Figure 1 for Accelerating Pre-trained Language Models via Calibrated Cascade

Figure 2 for Accelerating Pre-trained Language Models via Calibrated Cascade

Figure 3 for Accelerating Pre-trained Language Models via Calibrated Cascade

Abstract:Dynamic early exiting aims to accelerate pre-trained language models' (PLMs) inference by exiting in shallow layer without passing through the entire model. In this paper, we analyze the working mechanism of dynamic early exiting and find it cannot achieve a satisfying trade-off between inference speed and performance. On one hand, the PLMs' representations in shallow layers are not sufficient for accurate prediction. One the other hand, the internal off-ramps cannot provide reliable exiting decisions. To remedy this, we instead propose CascadeBERT, which dynamically selects a proper-sized, complete model in a cascading manner. To obtain more reliable model selection, we further devise a difficulty-aware objective, encouraging the model output class probability to reflect the real difficulty of each instance. Extensive experimental results demonstrate the superiority of our proposal over strong baseline models of PLMs' acceleration including both dynamic early exiting and knowledge distillation methods.

* Preprint

Via

Access Paper or Ask Questions

CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Oct 30, 2020

Fuli Luo, Pengcheng Yang, Shicheng Li, Xuancheng Ren, Xu Sun

Figure 1 for CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Figure 2 for CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Figure 3 for CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Figure 4 for CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Abstract:Pre-trained self-supervised models such as BERT have achieved striking success in learning sequence representations, especially for natural language processing. These models typically corrupt the given sequences with certain types of noise, such as masking, shuffling, or substitution, and then try to recover the original input. However, such pre-training approaches are prone to learning representations that are covariant with the noise, leading to the discrepancy between the pre-training and fine-tuning stage. To remedy this, we present ContrAstive Pre-Training (CAPT) to learn noise invariant sequence representations. The proposed CAPT encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals. In this way, it not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also aids the pre-trained model in better capturing global semantics of the input via more effective sentence-level supervision. Different from most prior work that focuses on a particular modality, comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6\% absolute gain on GLUE benchmarks and 0.8\% absolute increment on $\text{NLVR}^2$.

* Corrected typos

Via

Access Paper or Ask Questions

Regularizing Dialogue Generation by Imitating Implicit Scenarios

Oct 06, 2020

Shaoxiong Feng, Xuancheng Ren, Hongshen Chen, Bin Sun, Kan Li, Xu Sun

Figure 1 for Regularizing Dialogue Generation by Imitating Implicit Scenarios

Figure 2 for Regularizing Dialogue Generation by Imitating Implicit Scenarios

Figure 3 for Regularizing Dialogue Generation by Imitating Implicit Scenarios

Figure 4 for Regularizing Dialogue Generation by Imitating Implicit Scenarios

Abstract:Human dialogues are scenario-based and appropriate responses generally relate to the latent context knowledge entailed by the specific scenario. To enable responses that are more meaningful and context-specific, we propose to improve generative dialogue systems from the scenario perspective, where both dialogue history and future conversation are taken into account to implicitly reconstruct the scenario knowledge. More importantly, the conversation scenarios are further internalized using imitation learning framework, where the conventional dialogue model that has no access to future conversations is effectively regularized by transferring the scenario knowledge contained in hierarchical supervising signals from the scenario-based dialogue model, so that the future conversation is not required in actual inference. Extensive evaluations show that our approach significantly outperforms state-of-the-art baselines on diversity and relevance, and expresses scenario-specific knowledge.

* Accepted by EMNLP 2020 (long paper)

Via

Access Paper or Ask Questions

Collaborative Group Learning

Sep 16, 2020

Shaoxiong Feng, Hongshen Chen, Xuancheng Ren, Zhuoye Ding, Kan Li, Xu Sun

Figure 1 for Collaborative Group Learning

Figure 2 for Collaborative Group Learning

Figure 3 for Collaborative Group Learning

Figure 4 for Collaborative Group Learning

Abstract:Collaborative learning has successfully applied knowledge transfer to guiding a pool of small student networks towards robust local minima. However, previous approaches typically struggle with drastically aggravated student homogenization and rapidly growing computational complexity when the number of students rises. In this paper, we propose Collaborative Group Learning, an efficient framework that aims to maximize student population without sacrificing generalization performance and computational efficiency. First, each student is established by randomly routing on a modular neural network, which is not only parameter-efficient but also facilitates flexible knowledge communication between students due to random levels of representation sharing and branching. Second, to resist homogenization and further reduce the computational cost, students first compose diverse feature sets by exploiting the inductive bias from sub-sets of training data, and then aggregate and distill supplementary knowledge by choosing a random sub-group of students at each time step. Empirical evaluations on both image and text tasks indicate that our method significantly outperforms various state-of-the-art collaborative approaches whilst enhancing computational efficiency.

Via

Access Paper or Ask Questions

Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption

Jun 10, 2020

Xu Sun, Zhiyuan Zhang, Xuancheng Ren, Ruixuan Luo, Liangyou Li

Figure 1 for Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption

Figure 2 for Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption

Figure 3 for Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption

Figure 4 for Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption

Abstract:We argue that the vulnerability of model parameters is of crucial value to the study of model robustness and generalization but little research has been devoted to understanding this matter. In this work, we propose an indicator to measure the robustness of neural network parameters by exploiting their vulnerability via parameter corruption. The proposed indicator describes the maximum loss variation in the non-trivial worst-case scenario under parameter corruption. For practical purposes, we give a gradient-based estimation, which is far more effective than random corruption trials that can hardly induce the worst accuracy degradation. Equipped with theoretical support and empirical validation, we are able to systematically investigate the robustness of different model parameters and reveal vulnerability of deep neural networks that has been rarely paid attention to before. Moreover, we can enhance the models accordingly with the proposed adversarial corruption-resistant training, which not only improves the parameter robustness but also translates into accuracy elevation.

Via

Access Paper or Ask Questions

Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning

Jun 03, 2020

Fenglin Liu, Xuancheng Ren, Guangxiang Zhao, Xu Sun, Liangyou Li

Figure 1 for Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning

Figure 2 for Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning

Figure 3 for Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning

Figure 4 for Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning

Abstract:In sequence-to-sequence learning, the attention mechanism has been a great success in bridging the information between the encoder and the decoder. However, it is often overlooked that the decoder obtains only a single view of the source sequences, i.e., the representations generated by the last encoder layer. Although those representations are supposed to be a comprehensive, global view of source sequences, such practice keeps the decoders from concrete, fine-grained source information generated by other encoder layers. In this work, we propose to encourage the decoder to take the full advantage of the multi-level source representations for layer-wise cross-view decoding. Concretely, different views of the source sequences are presented to different decoder layers and multiple strategies are explored to route the source representations. In particular, the granularity consistent attention (GCA) strategy proves the most efficient and effective in the experiments on the neural machine translation task, surpassing the previous state-of-the-art architecture on three benchmark datasets.

* Achieve state-of-the-art BLEU scores on WMT14 EN-DE, EN-FR, and IWSLT DE-EN datasets

Via

Access Paper or Ask Questions

Exploring and Distilling Cross-Modal Information for Image Captioning

Mar 15, 2020

Fenglin Liu, Xuancheng Ren, Yuanxin Liu, Kai Lei, Xu Sun

Figure 1 for Exploring and Distilling Cross-Modal Information for Image Captioning

Figure 2 for Exploring and Distilling Cross-Modal Information for Image Captioning

Figure 3 for Exploring and Distilling Cross-Modal Information for Image Captioning

Figure 4 for Exploring and Distilling Cross-Modal Information for Image Captioning

Abstract:Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. Based on the Transformer, to perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our Transformer-based model achieves a CIDEr score of 129.3 in offline COCO evaluation on the COCO testing set with remarkable efficiency in terms of accuracy, speed, and parameter budget.

* Accepted by IJCAI 2019

Via

Access Paper or Ask Questions