Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Feng

Alibaba Group

MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Nov 03, 2024

Langlin Huang, Mengyu Bu, Yang Feng

Figure 1 for MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Figure 2 for MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Figure 3 for MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Figure 4 for MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Abstract:Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages, enabling broad language scalability. However, byte-level tokenization results in sequences that are hard to interpret due to limited semantic information per byte. Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension. Nevertheless, variations in encoding rules across languages necessitate an adaptive approach for effective contextualization. To this end, we propose Adaptive MultiScale-Headed Attention (Ada-MSHA), adaptively selecting and mixing attention heads, which are treated as contextualization experts. This enhances the flexibility of contextualization scales and improves the potential to discover a better strategy than previous methods. Experiment results show that our method outperforms existing methods without extensive manual adjustment of hyper-parameters and surpasses subword-based models with fewer parameters in Ted-59 dataset. Our code is available at https://github.com/ictnlp/MoCE.

Via

Access Paper or Ask Questions

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Sep 10, 2024

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng

Figure 1 for LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Figure 2 for LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Figure 3 for LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Figure 4 for LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Abstract:Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.

* Preprint. Project: https://github.com/ictnlp/LLaMA-Omni

Via

Access Paper or Ask Questions

Agent-SiMT: Agent-assisted Simultaneous Machine Translation with Large Language Models

Jun 12, 2024

Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

Abstract:Simultaneous Machine Translation (SiMT) generates target translations while reading the source sentence. It relies on a policy to determine the optimal timing for reading sentences and generating translations. Existing SiMT methods generally adopt the traditional Transformer architecture, which concurrently determines the policy and generates translations. While they excel at determining policies, their translation performance is suboptimal. Conversely, Large Language Models (LLMs), trained on extensive corpora, possess superior generation capabilities, but it is difficult for them to acquire translation policy through the training methods of SiMT. Therefore, we introduce Agent-SiMT, a framework combining the strengths of LLMs and traditional SiMT methods. Agent-SiMT contains the policy-decision agent and the translation agent. The policy-decision agent is managed by a SiMT model, which determines the translation policy using partial source sentence and translation. The translation agent, leveraging an LLM, generates translation based on the partial source sentence. The two agents collaborate to accomplish SiMT. Experiments demonstrate that Agent-SiMT attains state-of-the-art performance.

* 18 pages, 8 figures, 7 tables. v2 of arXiv:2402.13036

Via

Access Paper or Ask Questions

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Jun 11, 2024

Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

Abstract:Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a zero-shot manner. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind \name by only 0.7 ASR-BLEU and outperforms the cascaded models.

* ACL 2024 main conference. Project Page: https://ictnlp.github.io/ComSpeech-Site/

Via

Access Paper or Ask Questions

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Jun 11, 2024

Qingkai Fang, Zhengrui Ma, Yan Zhou, Min Zhang, Yang Feng

Figure 1 for CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Figure 2 for CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Figure 3 for CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Figure 4 for CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Abstract:Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81$\times$ decoding speedup.

* ACL 2024 Findings

Via

Access Paper or Ask Questions

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Jun 11, 2024

Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

Figure 1 for A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Figure 2 for A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Figure 3 for A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Figure 4 for A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Abstract:Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.

* ACL 2024; Codes and demos are at https://github.com/ictnlp/NAST-S2x

Via

Access Paper or Ask Questions

Decoder-only Streaming Transformer for Simultaneous Translation

Jun 06, 2024

Shoutao Guo, Shaolei Zhang, Yang Feng

Figure 1 for Decoder-only Streaming Transformer for Simultaneous Translation

Figure 2 for Decoder-only Streaming Transformer for Simultaneous Translation

Figure 3 for Decoder-only Streaming Transformer for Simultaneous Translation

Figure 4 for Decoder-only Streaming Transformer for Simultaneous Translation

Abstract:Simultaneous Machine Translation (SiMT) generates translation while reading source tokens, essentially producing the target prefix based on the source prefix. To achieve good performance, it leverages the relationship between source and target prefixes to exact a policy to guide the generation of translations. Although existing SiMT methods primarily focus on the Encoder-Decoder architecture, we explore the potential of Decoder-only architecture, owing to its superior performance in various tasks and its inherent compatibility with SiMT. However, directly applying the Decoder-only architecture to SiMT poses challenges in terms of training and inference. To alleviate the above problems, we propose the first Decoder-only SiMT model, named Decoder-only Streaming Transformer (DST). Specifically, DST separately encodes the positions of the source and target prefixes, ensuring that the position of the target prefix remains unaffected by the expansion of the source prefix. Furthermore, we propose a Streaming Self-Attention (SSA) mechanism tailored for the Decoder-only architecture. It is capable of obtaining translation policy by assessing the sufficiency of input source information and integrating with the soft-attention mechanism to generate translations. Experiments demonstrate that our approach achieves state-of-the-art performance on three translation tasks.

* Accepted to ACL 2024. 14 pages, 10 Tables, 5 Figures

Via

Access Paper or Ask Questions

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

Jun 05, 2024

Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng

Abstract:Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an "All-in-One" seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Besides, StreamSpeech is able to present high-quality intermediate results (i.e., ASR or translation results) during simultaneous translation process, offering a more comprehensive real-time communication experience.

* Accepted to ACL 2024 main conference, Project Page: https://ictnlp.github.io/StreamSpeech-site/

Via

Access Paper or Ask Questions

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

May 29, 2024

Langlin Huang, Yang Feng

Figure 1 for Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Figure 2 for Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Figure 3 for Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Figure 4 for Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Abstract:Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization.

* Accepted by ACL2024 Findings

Via

Access Paper or Ask Questions

Learnable Privacy Neurons Localization in Language Models

May 16, 2024

Ruizhe Chen, Tianxiang Hu, Yang Feng, Zuozhu Liu

Abstract:Concerns regarding Large Language Models (LLMs) to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.

* ACL 2024 main conference

Via

Access Paper or Ask Questions