Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuming Shi

RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation

Oct 11, 2023

Yue Zhang, Leyang Cui, Enbo Zhao, Wei Bi, Shuming Shi

Figure 1 for RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation

Figure 2 for RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation

Figure 3 for RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation

Figure 4 for RobustGEC: Robust Grammatical Error Correction Against Subtle Context Perturbation

Abstract:Grammatical Error Correction (GEC) systems play a vital role in assisting people with their daily writing tasks. However, users may sometimes come across a GEC system that initially performs well but fails to correct errors when the inputs are slightly modified. To ensure an ideal user experience, a reliable GEC system should have the ability to provide consistent and accurate suggestions when encountering irrelevant context perturbations, which we refer to as context robustness. In this paper, we introduce RobustGEC, a benchmark designed to evaluate the context robustness of GEC systems. RobustGEC comprises 5,000 GEC cases, each with one original error-correct sentence pair and five variants carefully devised by human annotators. Utilizing RobustGEC, we reveal that state-of-the-art GEC systems still lack sufficient robustness against context perturbations. In addition, we propose a simple yet effective method for remitting this issue.

* Accepted to EMNLP 2023 (main conference, long paper)

Via

Access Paper or Ask Questions

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild

Sep 19, 2023

Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi

Abstract:Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.

* work in progress. https://textbind.github.io/

Via

Access Paper or Ask Questions

A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

Sep 17, 2023

Yi Chen, Haiyun Jiang, Wei Bi, Rui Wang, Longyue Wang, Shuming Shi, Ruifeng Xu

Figure 1 for A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

Figure 2 for A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

Figure 3 for A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

Figure 4 for A Benchmark for Text Expansion: Datasets, Metrics, and Baselines

Abstract:This work presents a new task of Text Expansion (TE), which aims to insert fine-grained modifiers into proper locations of the plain text to concretize or vivify human writings. Different from existing insertion-based writing assistance tasks, TE requires the model to be more flexible in both locating and generation, and also more cautious in keeping basic semantics. We leverage four complementary approaches to construct a dataset with 12 million automatically generated instances and 2K human-annotated references for both English and Chinese. To facilitate automatic evaluation, we design various metrics from multiple perspectives. In particular, we propose Info-Gain to effectively measure the informativeness of expansions, which is an important quality dimension in TE. On top of a pre-trained text-infilling model, we build both pipelined and joint Locate&Infill models, which demonstrate the superiority over the Text2Text baselines, especially in expansion informativeness. Experiments verify the feasibility of the TE task and point out potential directions for future research toward better automatic text expansion.

Via

Access Paper or Ask Questions

TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design

Sep 11, 2023

Yongrui Chen, Haiyun Jiang, Xinting Huang, Shuming Shi, Guilin Qi

Abstract:High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collection methods are limited by unrealistic manual labeling costs or by the hallucination of relying solely on LLM generation. To address the problems, this paper presents a scalable method to automatically collect high-quality instructional adaptation data by training language models to automatically design tasks based on human-written texts. Intuitively, human-written text helps to help the model attenuate illusions during the generation of tasks. Unlike instruction back-translation-based methods that directly take the given text as a response, we require the model to generate the \textit{instruction}, \textit{input}, and \textit{output} simultaneously to filter the noise. The results of the automated and manual evaluation experiments demonstrate the quality of our dataset.

* Work in progress

Via

Access Paper or Ask Questions

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Sep 03, 2023

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen(+5 more)

Figure 1 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Figure 2 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Figure 3 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Figure 4 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Abstract:While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.

* work in progress; 32 pages

Via

Access Paper or Ask Questions

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Aug 12, 2023

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu

Figure 1 for GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Figure 2 for GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Figure 3 for GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Figure 4 for GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Abstract:Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

* 13 pages, 4 figures, 9 tables

Via

Access Paper or Ask Questions

Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Jul 22, 2023

Longyue Wang, Zefeng Du, Donghuai Liu, Deng Cai, Dian Yu, Haiyun Jiang, Yan Wang, Leyang Cui, Shuming Shi, Zhaopeng Tu

Figure 1 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Figure 2 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Figure 3 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Figure 4 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Abstract:Modeling discourse -- the linguistic phenomena that go beyond individual sentences, is a fundamental yet challenging aspect of natural language processing (NLP). However, existing evaluation benchmarks primarily focus on the evaluation of inter-sentence properties and overlook critical discourse phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge. We totally evaluate 20 general-, in-domain and commercial models based on Transformer, advanced pretraining architectures and large language models (LLMs). Our results show (1) the challenge and necessity of our evaluation benchmark; (2) fine-grained pretraining based on literary document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field: https://github.com/longyuewangdcu/Disco-Bench.

* Zhaopeng Tu is the corresponding author

Via

Access Paper or Ask Questions

On the Cultural Gap in Text-to-Image Generation

Jul 06, 2023

Bingshuai Liu, Longyue Wang, Chenyang Lyu, Yong Zhang, Jinsong Su, Shuming Shi, Zhaopeng Tu

Figure 1 for On the Cultural Gap in Text-to-Image Generation

Figure 2 for On the Cultural Gap in Text-to-Image Generation

Figure 3 for On the Cultural Gap in Text-to-Image Generation

Figure 4 for On the Cultural Gap in Text-to-Image Generation

Abstract:One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data, which signifies the disparity in generated image quality when the cultural elements of the input text are rarely collected in the training set. Although various T2I models have shown impressive but arbitrary examples, there is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. To bridge the gap, we propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture. By analyzing the flawed images generated by the Stable Diffusion model on the C3 benchmark, we find that the model often fails to generate certain cultural objects. Accordingly, we propose a novel multi-modal metric that considers object-text alignment to filter the fine-tuning data in the target culture, which is used to fine-tune a T2I model to improve cross-cultural generation. Experimental results show that our multi-modal metric provides stronger data selection performance on the C3 benchmark than existing metrics, in which the object-text alignment is crucial. We release the benchmark, data, code, and generated images to facilitate future research on culturally diverse T2I generation (https://github.com/longyuewangdcu/C3-Bench).

* Equal contribution: Bingshuai Liu and Longyue Wang. Work done while Bingshuai Liu and Chengyang Lyu were interning at Tencent AI Lab. Zhaopeng Tu is the corresponding author

Via

Access Paper or Ask Questions

SkillNet-X: A Multilingual Multitask Model with Sparsely Activated Skills

Jun 28, 2023

Zhangyin Feng, Yong Dai, Fan Zhang, Duyu Tang, Xiaocheng Feng, Shuangzhi Wu, Bing Qin, Yunbo Cao, Shuming Shi

Abstract:Traditional multitask learning methods basically can only exploit common knowledge in task- or language-wise, which lose either cross-language or cross-task knowledge. This paper proposes a general multilingual multitask model, named SkillNet-X, which enables a single model to tackle many different tasks from different languages. To this end, we define several language-specific skills and task-specific skills, each of which corresponds to a skill module. SkillNet-X sparsely activates parts of the skill modules which are relevant either to the target task or the target language. Acting as knowledge transit hubs, skill modules are capable of absorbing task-related knowledge and language-related knowledge consecutively. Based on Transformer, we modify the multi-head attention layer and the feed forward network layer to accommodate skill modules. We evaluate SkillNet-X on eleven natural language understanding datasets in four languages. Results show that SkillNet-X performs better than task-specific baselines and two multitask learning baselines (i.e., dense joint model and Mixture-of-Experts model). Furthermore, skill pre-training further improves the performance of SkillNet-X on almost all datasets. To investigate the generalization of our model, we conduct experiments on two new tasks and find that SkillNet-X significantly outperforms baselines.

Via

Access Paper or Ask Questions

Explicit Syntactic Guidance for Neural Text Generation

Jun 25, 2023

Yafu Li, Leyang Cui, Jianhao Yan, Yongjing Yin, Wei Bi, Shuming Shi, Yue Zhang

Figure 1 for Explicit Syntactic Guidance for Neural Text Generation

Figure 2 for Explicit Syntactic Guidance for Neural Text Generation

Figure 3 for Explicit Syntactic Guidance for Neural Text Generation

Figure 4 for Explicit Syntactic Guidance for Neural Text Generation

Abstract:Most existing text generation models follow the sequence-to-sequence paradigm. Generative Grammar suggests that humans generate natural language texts by learning language grammar. We propose a syntax-guided generation schema, which generates the sequence guided by a constituency parse tree in a top-down direction. The decoding process can be decomposed into two parts: (1) predicting the infilling texts for each constituent in the lexicalized syntax context given the source sentence; (2) mapping and expanding each constituent to construct the next-level syntax context. Accordingly, we propose a structural beam search method to find possible syntax structures hierarchically. Experiments on paraphrase generation and machine translation show that the proposed method outperforms autoregressive baselines, while also demonstrating effectiveness in terms of interpretability, controllability, and diversity.

* ACL 2023

Via

Access Paper or Ask Questions