Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lemao Liu

DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

Oct 30, 2023

Jiahao Xu, Wei Shao, Lihui Chen, Lemao Liu

Figure 1 for DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

Figure 2 for DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

Figure 3 for DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

Figure 4 for DistillCSE: Distilled Contrastive Learning for Sentence Embeddings

Abstract:This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation. The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation. However, the vanilla DistillCSE through the standard implementation of knowledge distillation only achieves marginal improvements due to severe overfitting. The further quantitative analyses demonstrate the reason that the standard knowledge distillation exhibits a relatively large variance of the teacher model's logits due to the essence of contrastive learning. To mitigate the issue induced by high variance, this paper accordingly proposed two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components. Experiments on standard benchmarks demonstrate that the proposed DistillCSE outperforms many strong baseline methods and yields a new state-of-the-art performance.

Via

Access Paper or Ask Questions

Rethinking Word-Level Auto-Completion in Computer-Aided Translation

Oct 24, 2023

Xingyu Chen, Lemao Liu, Guoping Huang, Zhirui Zhang, Mingming Yang, Shuming Shi, Rui Wang

Figure 1 for Rethinking Word-Level Auto-Completion in Computer-Aided Translation

Figure 2 for Rethinking Word-Level Auto-Completion in Computer-Aided Translation

Figure 3 for Rethinking Word-Level Auto-Completion in Computer-Aided Translation

Figure 4 for Rethinking Word-Level Auto-Completion in Computer-Aided Translation

Abstract:Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted Translation. It aims at providing word-level auto-completion suggestions for human translators. While previous studies have primarily focused on designing complex model architectures, this paper takes a different perspective by rethinking the fundamental question: what kind of words are good auto-completions? We introduce a measurable criterion to answer this question and discover that existing WLAC models often fail to meet this criterion. Building upon this observation, we propose an effective approach to enhance WLAC performance by promoting adherence to the criterion. Notably, the proposed approach is general and can be applied to various encoder-based architectures. Through extensive experiments, we demonstrate that our approach outperforms the top-performing system submitted to the WLAC shared tasks in WMT2022, while utilizing significantly smaller model sizes.

* EMNLP2023

Via

Access Paper or Ask Questions

On Synthetic Data for Back Translation

Oct 20, 2023

Jiahao Xu, Yubin Ruan, Wei Bi, Guoping Huang, Shuming Shi, Lihui Chen, Lemao Liu

Figure 1 for On Synthetic Data for Back Translation

Figure 2 for On Synthetic Data for Back Translation

Figure 3 for On Synthetic Data for Back Translation

Figure 4 for On Synthetic Data for Back Translation

Abstract:Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: {\em what kind of synthetic data contributes to BT performance?} Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.

Via

Access Paper or Ask Questions

Towards General Error Diagnosis via Behavioral Testing in Machine Translation

Oct 20, 2023

Junjie Wu, Lemao Liu, Dit-Yan Yeung

Abstract:Behavioral testing offers a crucial means of diagnosing linguistic errors and assessing capabilities of NLP models. However, applying behavioral testing to machine translation (MT) systems is challenging as it generally requires human efforts to craft references for evaluating the translation quality of such systems on newly generated test cases. Existing works in behavioral testing of MT systems circumvent this by evaluating translation quality without references, but this restricts diagnosis to specific types of errors, such as incorrect translation of single numeric or currency words. In order to diagnose general errors, this paper proposes a new Bilingual Translation Pair Generation based Behavior Testing (BTPGBT) framework for conducting behavioral testing of MT systems. The core idea of BTPGBT is to employ a novel bilingual translation pair generation (BTPG) approach that automates the construction of high-quality test cases and their pseudoreferences. Experimental results on various MT systems demonstrate that BTPGBT could provide comprehensive and accurate behavioral testing results for general error diagnosis, which further leads to several insightful findings. Our code and data are available at https: //github.com/wujunjie1998/BTPGBT.

* 15 pages, 2 figures, accepted by Findings of EMNLP 2023

Via

Access Paper or Ask Questions

IMTLab: An Open-Source Platform for Building, Evaluating, and Diagnosing Interactive Machine Translation Systems

Oct 17, 2023

Xu Huang, Zhirui Zhang, Ruize Gao, Yichao Du, Lemao Liu, Gouping Huang, Shuming Shi, Jiajun Chen, Shujian Huang

Abstract:We present IMTLab, an open-source end-to-end interactive machine translation (IMT) system platform that enables researchers to quickly build IMT systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. IMTLab treats the whole interactive translation process as a task-oriented dialogue with a human-in-the-loop setting, in which human interventions can be explicitly incorporated to produce high-quality, error-free translations. To this end, a general communication interface is designed to support the flexible IMT architectures and user policies. Based on the proposed design, we construct a simulated and real interactive environment to achieve end-to-end evaluation and leverage the framework to systematically evaluate previous IMT systems. Our simulated and manual experiments show that the prefix-constrained decoding approach still gains the lowest editing cost in the end-to-end evaluation, while BiTIIMT achieves comparable editing cost with a better interactive experience.

* Accepted by EMNLP2023

Via

Access Paper or Ask Questions

Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Oct 16, 2023

Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, Yixuan Su

Figure 1 for Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Figure 2 for Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Figure 3 for Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Figure 4 for Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective

Abstract:There are a number of diverging hypotheses about the neural text degeneration problem, i.e., generating repetitive and dull loops, which makes this problem both interesting and confusing. In this work, we aim to advance our understanding by presenting a straightforward and fundamental explanation from the data perspective. Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data. Subsequent experiments also demonstrate that by selectively dropping out the attention to repetitive words in training data, degeneration can be significantly minimized. Furthermore, our empirical analysis illustrates that prior works addressing the degeneration issue from various standpoints, such as the high-inflow words, the likelihood objective, and the self-reinforcement phenomenon, can be interpreted by one simple explanation. That is, penalizing the repetitions in training data is a common and fundamental factor for their effectiveness. Moreover, our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.

* Accepted to NeurIPS 2023

Via

Access Paper or Ask Questions

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild

Sep 19, 2023

Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi

Abstract:Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.

* work in progress. https://textbind.github.io/

Via

Access Paper or Ask Questions

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Sep 03, 2023

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen(+5 more)

Figure 1 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Figure 2 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Figure 3 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Figure 4 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Abstract:While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.

* work in progress; 32 pages

Via

Access Paper or Ask Questions

Rethinking Translation Memory Augmented Neural Machine Translation

Jun 12, 2023

Hongkun Hao, Guoping Huang, Lemao Liu, Zhirui Zhang, Shuming Shi, Rui Wang

Figure 1 for Rethinking Translation Memory Augmented Neural Machine Translation

Figure 2 for Rethinking Translation Memory Augmented Neural Machine Translation

Figure 3 for Rethinking Translation Memory Augmented Neural Machine Translation

Figure 4 for Rethinking Translation Memory Augmented Neural Machine Translation

Abstract:This paper rethinks translation memory augmented neural machine translation (TM-augmented NMT) from two perspectives, i.e., a probabilistic view of retrieval and the variance-bias decomposition principle. The finding demonstrates that TM-augmented NMT is good at the ability of fitting data (i.e., lower bias) but is more sensitive to the fluctuations in the training data (i.e., higher variance), which provides an explanation to a recently reported contradictory phenomenon on the same translation task: TM-augmented NMT substantially advances vanilla NMT under the high-resource scenario whereas it fails under the low-resource scenario. Then we propose a simple yet effective TM-augmented NMT model to promote the variance and address the contradictory phenomenon. Extensive experiments show that the proposed TM-augmented NMT achieves consistent gains over both conventional NMT and existing TM-augmented NMT under two variance-preferable (low-resource and plug-and-play) scenarios as well as the high-resource scenario.

* 15 pages, 2 figures, accepted by ACL2023 findings

Via

Access Paper or Ask Questions

Sen2Pro: A Probabilistic Perspective to Sentence Embedding from Pre-trained Language Model

Jun 04, 2023

Lingfeng Shen, Haiyun Jiang, Lemao Liu, Shuming Shi

Abstract:Sentence embedding is one of the most fundamental tasks in Natural Language Processing and plays an important role in various tasks. The recent breakthrough in sentence embedding is achieved by pre-trained language models (PLMs). Despite its success, an embedded vector (Sen2Vec) representing a point estimate does not naturally express uncertainty in a taskagnostic way. This paper thereby proposes an efficient framework on probabilistic sentence embedding (Sen2Pro) from PLMs, and it represents a sentence as a probability density distribution in an embedding space to reflect both model uncertainty and data uncertainty (i.e., many-to-one nature) in the sentence representation. The proposed framework performs in a plug-and-play way without retraining PLMs anymore, and it is easy to implement and generally applied on top of any PLM. The superiority of Sen2Pro over Sen2Vec has been theoretically verified and practically illustrated on different NLP tasks.

* Accepted to ACL2023 workshop Rep4NLP

Via

Access Paper or Ask Questions