Alert button
Picture for Hai Zhao

Hai Zhao

Alert button

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Oct 19, 2022
Hongqiu Wu, Ruixue Ding, Hai Zhao, Boli Chen, Pengjun Xie, Fei Huang, Min Zhang

Figure 1 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning
Figure 2 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning
Figure 3 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning
Figure 4 for Forging Multiple Training Objectives for Pre-trained Language Models via Meta-Learning

Multiple pre-training objectives fill the vacancy of the understanding capability of single-objective language modeling, which serves the ultimate purpose of pre-trained language models (PrLMs), generalizing well on a mass of scenarios. However, learning multiple training objectives in a single model is challenging due to the unknown relative significance as well as the potential contrariety between them. Empirical studies have shown that the current objective sampling in an ad-hoc manual setting makes the learned language representation barely converge to the desired optimum. Thus, we propose \textit{MOMETAS}, a novel adaptive sampler based on meta-learning, which learns the latent sampling pattern on arbitrary pre-training objectives. Such a design is lightweight with negligible additional training overhead. To validate our approach, we adopt five objectives and conduct continual pre-training with BERT-base and BERT-large models, where MOMETAS demonstrates universal performance gain over other rule-based sampling strategies on 14 natural language processing tasks.

* EMNLP 2022 (findings) 
Viaarxiv icon

Sentence Representation Learning with Generative Objective rather than Contrastive Objective

Oct 16, 2022
Bohong Wu, Hai Zhao

Figure 1 for Sentence Representation Learning with Generative Objective rather than Contrastive Objective
Figure 2 for Sentence Representation Learning with Generative Objective rather than Contrastive Objective
Figure 3 for Sentence Representation Learning with Generative Objective rather than Contrastive Objective
Figure 4 for Sentence Representation Learning with Generative Objective rather than Contrastive Objective

Though offering amazing contextualized token-level representations, current pre-trained language models take less attention on accurately acquiring sentence-level representation during their self-supervised pre-training. However, contrastive objectives which dominate the current sentence representation learning bring little linguistic interpretability and no performance guarantee on downstream semantic tasks. We instead propose a novel generative self-supervised learning objective based on phrase reconstruction. To overcome the drawbacks of previous generative methods, we carefully model intra-sentence structure by breaking down one sentence into pieces of important phrases. Empirical studies show that our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods not only on the STS benchmarks, but also on downstream semantic retrieval and reranking tasks. Our code is available at https://github.com/chengzhipanpan/PaSeR.

* Accepted by the Main Conference of EMNLP 2022, long paper. arXiv admin note: substantial text overlap with arXiv:2204.09358 
Viaarxiv icon

Towards End-to-End Open Conversational Machine Reading

Oct 13, 2022
Sizhe Zhou, Siru Ouyang, Zhuosheng Zhang, Hai Zhao

Figure 1 for Towards End-to-End Open Conversational Machine Reading
Figure 2 for Towards End-to-End Open Conversational Machine Reading
Figure 3 for Towards End-to-End Open Conversational Machine Reading
Figure 4 for Towards End-to-End Open Conversational Machine Reading

In open-retrieval conversational machine reading (OR-CMR) task, machines are required to do multi-turn question answering given dialogue history and a textual knowledge base. Existing works generally utilize two independent modules to approach this problem's two successive sub-tasks: first with a hard-label decision making and second with a question generation aided by various entailment reasoning methods. Such usual cascaded modeling is vulnerable to error propagation and prevents the two sub-tasks from being consistently optimized. In this work, we instead model OR-CMR as a unified text-to-text task in a fully end-to-end style. Experiments on the OR-ShARC dataset show the effectiveness of our proposed end-to-end framework on both sub-tasks by a large margin, achieving new state-of-the-art results. Further ablation studies support that our framework can generalize to different backbone models.

* 10 pages, 2 figures, 10 tables 
Viaarxiv icon

Task Compass: Scaling Multi-task Pre-training with Task Prefix

Oct 12, 2022
Zhuosheng Zhang, Shuohang Wang, Yichong Xu, Yuwei Fang, Wenhao Yu, Yang Liu, Hai Zhao, Chenguang Zhu, Michael Zeng

Figure 1 for Task Compass: Scaling Multi-task Pre-training with Task Prefix
Figure 2 for Task Compass: Scaling Multi-task Pre-training with Task Prefix
Figure 3 for Task Compass: Scaling Multi-task Pre-training with Task Prefix
Figure 4 for Task Compass: Scaling Multi-task Pre-training with Task Prefix

Leveraging task-aware annotated data as supervised signals to assist with self-supervised learning on large-scale unlabeled data has become a new trend in pre-training language models. Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks. To tackle the challenge, we propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks. We conduct extensive experiments on 40 datasets, which show that our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships. The task relationships reflected by the prefixes align transfer learning performance between tasks. They also suggest directions for data augmentation with complementary tasks, which help our model achieve human-parity results on commonsense reasoning leaderboards. Code is available at https://github.com/cooelf/CompassMTL

* Findings of EMNLP 2022 
Viaarxiv icon

Instance Regularization for Discriminative Language Model Pre-training

Oct 11, 2022
Zhuosheng Zhang, Hai Zhao, Ming Zhou

Figure 1 for Instance Regularization for Discriminative Language Model Pre-training
Figure 2 for Instance Regularization for Discriminative Language Model Pre-training
Figure 3 for Instance Regularization for Discriminative Language Model Pre-training
Figure 4 for Instance Regularization for Discriminative Language Model Pre-training

Discriminative pre-trained language models (PrLMs) can be generalized as denoising auto-encoders that work with two procedures, ennoising and denoising. First, an ennoising process corrupts texts with arbitrary noising functions to construct training instances. Then, a denoising language model is trained to restore the corrupted tokens. Existing studies have made progress by optimizing independent strategies of either ennoising or denosing. They treat training instances equally throughout the training process, with little attention on the individual contribution of those instances. To model explicit signals of instance contribution, this work proposes to estimate the complexity of restoring the original sentences from corrupted ones in language model pre-training. The estimations involve the corruption degree in the ennoising data construction process and the prediction confidence in the denoising counterpart. Experimental results on natural language understanding and reading comprehension benchmarks show that our approach improves pre-training efficiency, effectiveness, and robustness. Code is publicly available at https://github.com/cooelf/InstanceReg

* Accepted to EMNLP 2022 
Viaarxiv icon

Semantic-Preserving Adversarial Code Comprehension

Sep 12, 2022
Yiyang Li, Hongqiu Wu, Hai Zhao

Figure 1 for Semantic-Preserving Adversarial Code Comprehension
Figure 2 for Semantic-Preserving Adversarial Code Comprehension
Figure 3 for Semantic-Preserving Adversarial Code Comprehension
Figure 4 for Semantic-Preserving Adversarial Code Comprehension

Based on the tremendous success of pre-trained language models (PrLMs) for source code comprehension tasks, current literature studies either ways to further improve the performance (generalization) of PrLMs, or their robustness against adversarial attacks. However, they have to compromise on the trade-off between the two aspects and none of them consider improving both sides in an effective and practical way. To fill this gap, we propose Semantic-Preserving Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving attacks while forcing the model to predict the correct labels under these worst cases. Experiments and analysis demonstrate that SPACE can stay robust against state-of-the-art attacks while boosting the performance of PrLMs for code.

* Accepted by COLING 2022 
Viaarxiv icon

Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning

Aug 23, 2022
Letian Peng, Zuchao Li, Hai Zhao

Figure 1 for Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning
Figure 2 for Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning
Figure 3 for Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning
Figure 4 for Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning

Commonsense reasoning is an appealing topic in natural language processing (NLP) as it plays a fundamental role in supporting the human-like actions of NLP systems. With large-scale language models as the backbone, unsupervised pre-training on numerous corpora shows the potential to capture commonsense knowledge. Current pre-trained language model (PLM)-based reasoning follows the traditional practice using perplexity metric. However, commonsense reasoning is more than existing probability evaluation, which is biased by word frequency. This paper reconsiders the nature of commonsense reasoning and proposes a novel commonsense reasoning metric, Non-Replacement Confidence (NRC). In detail, it works on PLMs according to the Replaced Token Detection (RTD) pre-training objective in ELECTRA, in which the corruption detection objective reflects the confidence on contextual integrity that is more relevant to commonsense reasoning than existing probability. Our proposed novel method boosts zero-shot performance on two commonsense reasoning benchmark datasets and further seven commonsense question-answering datasets. Our analysis shows that pre-endowed commonsense knowledge, especially for RTD-based PLMs, is essential in downstream reasoning.

Viaarxiv icon

Learning Better Masking for Better Language Model Pre-training

Aug 23, 2022
Dongjie Yang, Zhuosheng Zhang, Hai Zhao

Figure 1 for Learning Better Masking for Better Language Model Pre-training
Figure 2 for Learning Better Masking for Better Language Model Pre-training
Figure 3 for Learning Better Masking for Better Language Model Pre-training
Figure 4 for Learning Better Masking for Better Language Model Pre-training

Masked Language Modeling (MLM) has been widely used as the denoising objective in pre-training language models (PrLMs). Existing PrLMs commonly adopt a random-token masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training. However, the model may receive complicated impact from pre-training status, which changes accordingly as training time goes on. In this paper, we show that such time-invariant MLM settings on masking ratio and masked content are unlikely to deliver an optimal outcome, which motivates us to explore the influence of time-variant MLM settings. We propose two scheduled masking approaches that adaptively tune the masking ratio and contents in different training stages, which improves the pre-training efficiency and effectiveness verified on the downstream tasks. Our work is a pioneer study on time-variant masking strategy on ratio and contents and gives a better understanding of how masking ratio and masked content influence the MLM pre-training.

Viaarxiv icon

Adversarial Self-Attention for Language Understanding

Jun 25, 2022
Hongqiu Wu, Hai Zhao

Figure 1 for Adversarial Self-Attention for Language Understanding
Figure 2 for Adversarial Self-Attention for Language Understanding
Figure 3 for Adversarial Self-Attention for Language Understanding
Figure 4 for Adversarial Self-Attention for Language Understanding

An ultimate language system aims at the high generalization and robustness when adapting to diverse scenarios. Unfortunately, the recent white hope pre-trained language models (PrLMs) barely escape from stacking excessive parameters to the over-parameterized Transformer architecture to achieve higher performances. This paper thus proposes \textit{Adversarial Self-Attention} mechanism (ASA), which adversarially reconstructs the Transformer attentions and facilitates model training from contaminated model structures, coupled with a fast and simple implementation for better PrLM building. We conduct comprehensive evaluation across a wide range of tasks on both pre-training and fine-tuning stages. For pre-training, ASA unfolds remarkable performance gain compared to regular training for longer periods. For fine-tuning, ASA-empowered models consistently outweigh naive models by a large margin considering both generalization and robustness.

Viaarxiv icon

Generative or Contrastive? Phrase Reconstruction for Better Sentence Representation Learning

Apr 20, 2022
Bohong Wu, Hai Zhao

Figure 1 for Generative or Contrastive? Phrase Reconstruction for Better Sentence Representation Learning
Figure 2 for Generative or Contrastive? Phrase Reconstruction for Better Sentence Representation Learning
Figure 3 for Generative or Contrastive? Phrase Reconstruction for Better Sentence Representation Learning
Figure 4 for Generative or Contrastive? Phrase Reconstruction for Better Sentence Representation Learning

Though offering amazing contextualized token-level representations, current pre-trained language models actually take less attention on acquiring sentence-level representation during its self-supervised pre-training. If self-supervised learning can be distinguished into two subcategories, generative and contrastive, then most existing studies show that sentence representation learning may more benefit from the contrastive methods but not the generative methods. However, contrastive learning cannot be well compatible with the common token-level generative self-supervised learning, and does not guarantee good performance on downstream semantic retrieval tasks. Thus, to alleviate such obvious inconveniences, we instead propose a novel generative self-supervised learning objective based on phrase reconstruction. Empirical studies show that our generative learning may yield powerful enough sentence representation and achieve performance in Sentence Textual Similarity (STS) tasks on par with contrastive learning. Further, in terms of unsupervised setting, our generative method outperforms previous state-of-the-art SimCSE on the benchmark of downstream semantic retrieval tasks.

* Preprint 
Viaarxiv icon