Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shujian Huang

Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation

Feb 28, 2025

Xiang Geng, Zhejian Lai, Jiajun Chen, Hao Yang, Shujian Huang

Abstract:Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce ADSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. ADSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the quality of word-level labels. ADSE further identifies the shortest phrase covering consecutive error tokens, mimicking human annotation behavior, to assign the final phrase-level labels. Specially, we underscore that the translation model can not annotate translations of itself accurately. Extensive experiments demonstrate that ADSQE outperforms SOTA baselines like COMET in both supervised and unsupervised settings. Further analysis offers insights into synthetic data generation that could benefit reward models for other tasks.

Via

Access Paper or Ask Questions

Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Feb 21, 2025

Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch

Figure 1 for Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Figure 2 for Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Figure 3 for Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Figure 4 for Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

Abstract:Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: https://github.com/NJUNLP/context-synthesis.

Via

Access Paper or Ask Questions

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Feb 11, 2025

Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan

Abstract:Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.

Via

Access Paper or Ask Questions

Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

Jan 21, 2025

Wei Zou, Shujian Huang, Jiajun Chen

Figure 1 for Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

Figure 2 for Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

Figure 3 for Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

Figure 4 for Extend Adversarial Policy Against Neural Machine Translation via Unknown Token

Abstract:Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy' that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.

* CCMT 2024
* accepted by CCMT 2024()

Via

Access Paper or Ask Questions

SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment

Jan 07, 2025

Yuchun Fan, Yongyu Mu, Yilin Wang, Lei Huang, Junhao Ruan, Bei Li, Tong Xiao, Shujian Huang, Xiaocheng Feng, Jingbo Zhu

Figure 1 for SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment

Figure 2 for SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment

Figure 3 for SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment

Figure 4 for SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment

Abstract:Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9 compared to the two-stage method.

* Accepted by COLING 2025 (Oral)

Via

Access Paper or Ask Questions

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Dec 26, 2024

Jiawei Yu, Xiang Geng, Yuang Li, Mengxin Ren, Wei Tang, Jiahuan Li, Zhibin Lan, Min Zhang, Hao Yang, Shujian Huang(+1 more)

Figure 1 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Figure 2 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Figure 3 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Figure 4 for "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Abstract:Spoken named entity recognition (NER) aims to identify named entities from speech, playing an important role in speech processing. New named entities appear every day, however, annotating their Spoken NER data is costly. In this paper, we demonstrate that existing Spoken NER systems perform poorly when dealing with previously unseen named entities. To tackle this challenge, we propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs. Specifically, we first use a large language model (LLM) to generate sentences from the sampled named entities and then use a text-to-speech (TTS) system to generate the speech. Furthermore, we introduce a noise metric to filter out noisy data. To evaluate our approach, we release a novel Spoken NER benchmark along with a corresponding NED containing 8,853 entities. Experiment results show that our method achieves state-of-the-art (SOTA) performance in the in-domain, zero-shot domain adaptation, and fully zero-shot settings. Our data will be available at https://github.com/DeepLearnXMU/HeardU.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Self-Evolution Knowledge Distillation for LLM-based Machine Translation

Dec 19, 2024

Yuncheng Song, Liang Ding, Changtong Zan, Shujian Huang

Figure 1 for Self-Evolution Knowledge Distillation for LLM-based Machine Translation

Figure 2 for Self-Evolution Knowledge Distillation for LLM-based Machine Translation

Figure 3 for Self-Evolution Knowledge Distillation for LLM-based Machine Translation

Figure 4 for Self-Evolution Knowledge Distillation for LLM-based Machine Translation

Abstract:Knowledge distillation (KD) has shown great promise in transferring knowledge from larger teacher models to smaller student models. However, existing KD strategies for large language models often minimize output distributions between student and teacher models indiscriminately for each token. This overlooks the imbalanced nature of tokens and their varying transfer difficulties. In response, we propose a distillation strategy called Self-Evolution KD. The core of this approach involves dynamically integrating teacher distribution and one-hot distribution of ground truth into the student distribution as prior knowledge, which promotes the distillation process. It adjusts the ratio of prior knowledge based on token learning difficulty, fully leveraging the teacher model's potential. Experimental results show our method brings an average improvement of approximately 1.4 SacreBLEU points across four translation directions in the WMT22 test sets. Further analysis indicates that the improvement comes from better knowledge transfer from teachers, confirming our hypothesis.

* COLING 2025

Via

Access Paper or Ask Questions

DPLM-2: A Multimodal Diffusion Protein Language Model

Oct 17, 2024

Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, Quanquan Gu

Figure 1 for DPLM-2: A Multimodal Diffusion Protein Language Model

Figure 2 for DPLM-2: A Multimodal Diffusion Protein Language Model

Figure 3 for DPLM-2: A Multimodal Diffusion Protein Language Model

Figure 4 for DPLM-2: A Multimodal Diffusion Protein Language Model

Abstract:Proteins are essential macromolecules defined by their amino acid sequences, which determine their three-dimensional structures and, consequently, their functions in all living organisms. Therefore, generative protein modeling necessitates a multimodal approach to simultaneously model, understand, and generate both sequences and structures. However, existing methods typically use separate models for each modality, limiting their ability to capture the intricate relationships between sequence and structure. This results in suboptimal performance in tasks that requires joint understanding and generation of both modalities. In this paper, we introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. To enable structural learning with the language model, 3D coordinates are converted to discrete tokens using a lookup-free quantization-based tokenizer. By training on both experimental and high-quality synthetic structures, DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. We also implement an efficient warm-up strategy to exploit the connection between large-scale evolutionary data and structural inductive biases from pre-trained sequence-based protein language models. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures eliminating the need for a two-stage generation approach. Moreover, DPLM-2 demonstrates competitive performance in various conditional generation tasks, including folding, inverse folding, and scaffolding with multimodal motif inputs, as well as providing structure-aware representations for predictive tasks.

Via

Access Paper or Ask Questions

Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Oct 07, 2024

Jiahuan Li, Yiqing Cao, Shujian Huang, Jiajun Chen

Figure 1 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Figure 2 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Figure 3 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Figure 4 for Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge

Abstract:Having been trained on massive pretraining data, large language models have shown excellent performance on many knowledge-intensive tasks. However, pretraining data tends to contain misleading and even conflicting information, and it is intriguing to understand how LLMs handle these noisy data during training. In this study, we systematically analyze LLMs' learning preferences for data with conflicting knowledge. We find that pretrained LLMs establish learning preferences similar to humans, i.e., preferences towards formal texts and texts with fewer spelling errors, resulting in faster learning and more favorable treatment of knowledge in data with such features when facing conflicts. This finding is generalizable across models and languages and is more evident in larger models. An in-depth analysis reveals that LLMs tend to trust data with features that signify consistency with the majority of data, and it is possible to instill new preferences and erase old ones by manipulating the degree of consistency with the majority data.

* accepted by EMNLP 2024, main conference

Via

Access Paper or Ask Questions

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Aug 21, 2024

Hao Zhou, Zhijun Wang, Shujian Huang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Weihua Luo, Jiajun Chen

Figure 1 for MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Figure 2 for MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Figure 3 for MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Figure 4 for MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Abstract:Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data. Enhancing non-English language capabilities through post-pretraining often results in catastrophic forgetting of the ability of original languages. Previous methods either achieve good expansion with severe forgetting or slight forgetting with poor expansion, indicating the challenge of balancing language expansion while preventing forgetting. In this paper, we propose a method called MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviate this problem. MoE-LPR employs a two-stage training approach to enhance the multilingual capability. First, the model is post-pretrained into a Mixture-of-Experts (MoE) architecture by upcycling, where all the original parameters are frozen and new experts are added. In this stage, we focus improving the ability on expanded languages, without using any original language data. Then, the model reviews the knowledge of the original languages with replay data amounting to less than 1% of post-pretraining, where we incorporate language priors routing to better recover the abilities of the original languages. Evaluations on multiple benchmarks show that MoE-LPR outperforms other post-pretraining methods. Freezing original parameters preserves original language knowledge while adding new experts preserves the learning ability. Reviewing with LPR enables effective utilization of multilingual knowledge within the parameters. Additionally, the MoE architecture maintains the same inference overhead while increasing total model parameters. Extensive experiments demonstrate MoE-LPR's effectiveness in improving expanded languages and preserving original language proficiency with superior scalability. Code and scripts are freely available at https://github.com/zjwang21/MoE-LPR.git.

Via

Access Paper or Ask Questions