Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tong Liu

Sherman

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Apr 21, 2025

Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Ashwath Aithal(+7 more)

Figure 1 for MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Figure 2 for MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Figure 3 for MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Figure 4 for MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Abstract:Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

Via

Access Paper or Ask Questions

Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification

Mar 14, 2025

Yingjie Zhang, Tong Liu, Zhe Zhao, Guozhu Meng, Kai Chen

Figure 1 for Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification

Figure 2 for Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification

Figure 3 for Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification

Figure 4 for Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification

Abstract:Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses. These attacks exploit LLMs' difficulty in dynamically detecting harmful intents during the generation process. Traditional safety alignment methods, often relying on the initial few generation steps, are ineffective due to limited computational budget. This paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to progressively detoxify generated content, significantly improving both the computational budget and effectiveness of mitigating harmful generation. Our approach uses a hybrid loss function operating on hidden states to directly improve LLMs' inherent awareness of toxity during generation. Furthermore, we redefine safe responses by generating semantically relevant answers to harmful queries, thereby increasing robustness against representation-mutation attacks. Evaluations across multiple LLMs demonstrate state-of-the-art defense performance against six different attack types, reducing Attack Success Rates by up to two orders of magnitude compared to previous state-of-the-art defense while preserving utility. This work advances LLM safety by addressing limitations of conventional alignment through dynamic, context-aware mitigation.

Via

Access Paper or Ask Questions

AVD2: Accident Video Diffusion for Accident Video Description

Feb 21, 2025

Cheng Li, Keyuan Zhou, Tong Liu, Yu Wang, Mingqiao Zhuang, Huan-ang Gao, Bu Jin, Hao Zhao

Abstract:Traffic accidents present complex challenges for autonomous driving, often featuring unpredictable scenarios that hinder accurate system interpretation and responses. Nonetheless, prevailing methodologies fall short in elucidating the causes of accidents and proposing preventive measures due to the paucity of training data specific to accident scenarios. In this work, we introduce AVD2 (Accident Video Diffusion for Accident Video Description), a novel framework that enhances accident scene understanding by generating accident videos that aligned with detailed natural language descriptions and reasoning, resulting in the contributed EMM-AU (Enhanced Multi-Modal Accident Video Understanding) dataset. Empirical results reveal that the integration of the EMM-AU dataset establishes state-of-the-art performance across both automated metrics and human evaluations, markedly advancing the domains of accident analysis and prevention. Project resources are available at https://an-answer-tree.github.io

* ICRA 2025, Project Page: https://an-answer-tree.github.io/

Via

Access Paper or Ask Questions

Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Feb 21, 2025

Gengyuan Zhang, Mingcong Ding, Tong Liu, Yao Zhang, Volker Tresp

Abstract:Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.

* Short paper (5 pages)

Via

Access Paper or Ask Questions

Multi-Class Traffic Assignment using Multi-View Heterogeneous Graph Attention Networks

Jan 15, 2025

Tong Liu, Hadi Meidani

Abstract:Solving traffic assignment problem for large networks is computationally challenging when conventional optimization-based methods are used. In our research, we develop an innovative surrogate model for a traffic assignment when multi-class vehicles are involved. We do so by employing heterogeneous graph neural networks which use a multiple-view graph attention mechanism tailored to different vehicle classes, along with additional links connecting origin-destination pairs. We also integrate the node-based flow conservation law into the loss function. As a result, our model adheres to flow conservation while delivering highly accurate predictions for link flows and utilization ratios. Through numerical experiments conducted on urban transportation networks, we demonstrate that our model surpasses traditional neural network approaches in convergence speed and predictive accuracy in both user equilibrium and system optimal versions of traffic assignment.

* 16 pages, 5 figures

Via

Access Paper or Ask Questions

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Jan 11, 2025

Tong Liu, Xiao Yu, Wenxuan Zhou, Jindong Gu, Volker Tresp

Figure 1 for FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Figure 2 for FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Figure 3 for FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Figure 4 for FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

Abstract:Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textit{down-weighs} misranked preference pairs and prioritizes enhancing the model's understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.

Via

Access Paper or Ask Questions

All-domain Moveline Evolution Network for Click-Through Rate Prediction

Nov 18, 2024

Chen Gao, Zixin Zhao, Lv Shao, Tong Liu

Abstract:E-commerce app users exhibit behaviors that are inherently logically consistent. A series of multi-scenario user behaviors interconnect to form the scene-level all-domain user moveline, which ultimately reveals the user's true intention. Traditional CTR prediction methods typically focus on the item-level interaction between the target item and the historically interacted items. However, the scene-level interaction between the target item and the user moveline remains underexplored. There are two challenges when modeling the interaction with preceding all-domain user moveline: (i) Heterogeneity between items and scenes: Unlike traditional user behavior sequences that utilize items as carriers, the user moveline utilizes scenes as carriers. The heterogeneity between items and scenes complicates the process of aligning interactions within a unified representation space. (ii) Temporal misalignment of linked scene-level and item-level behaviors: In the preceding user moveline with a fixed sampling length, certain critical scene-level behaviors are closely linked to subsequent item-level behaviors. However, it is impossible to establish a complete temporal alignment that clearly identifies which specific scene-level behaviors correspond to which item-level behaviors. To address these challenges and pioneer modeling user intent from the perspective of the all-domain moveline, we propose All-domain Moveline Evolution Network (AMEN). AMEN not only transfers interactions between items and scenes to homogeneous representation spaces, but also introduces a Temporal Sequential Pairwise (TSP) mechanism to understand the nuanced associations between scene-level and item-level behaviors, ensuring that the all-domain user moveline differentially influences CTR predictions for user's favored and unfavored items. Online A/B testing demonstrates that our method achieves a +11.6% increase in CTCVR.

Via

Access Paper or Ask Questions

Collaborative Contrastive Network for Click-Through Rate Prediction

Nov 18, 2024

Chen Gao, Zixin Zhao, Sihao Hu, Lv Shao, Tong Liu

Figure 1 for Collaborative Contrastive Network for Click-Through Rate Prediction

Figure 2 for Collaborative Contrastive Network for Click-Through Rate Prediction

Figure 3 for Collaborative Contrastive Network for Click-Through Rate Prediction

Figure 4 for Collaborative Contrastive Network for Click-Through Rate Prediction

Abstract:E-commerce platforms provide entrances for customers to enter mini-apps to meet their specific shopping needs. At the entrance of a mini-app, a trigger item recommended based on customers' historical preferences, is displayed to attract customers to enter the mini-app. Existing Click-Through Rate (CTR) prediction approaches have two significant weaknesses: (i) A portion of customer entries is driven by their interest in the mini-app itself rather than the trigger item. In such cases, approaches highly hinging on the trigger item tend to recommend similar items, thus misunderstanding the customers' real intention; (ii) Approaches that consider customers' intention toward mini-apps, require the regular existence of mini-apps for customers to cultivate routine shopping habits, making such approaches less robust for mini-apps that are available for only short periods (1 or 3 days) in Explosive Promotional Scenarios (EPS), such as the Black Friday and China's Double 11 Shopping Carnival. To address the above-mentioned issues, we introduce a more general and robust CTR prediction approach, dubbed Collaborative Contrastive Network (CCN). Given a user, CCN learns to identify two item clusters that can represent the user's interests and disinterests, via leveraging the collaborative relationship of co-click/co-non-click or the non-collaborative relationship of mono-click as the supervision signal for contrastive learning. This paradigm does not need to explicitly estimate user's binary entry intention and avoids amplifying the impact of the trigger item. Online A/B testing on large-scale real-world data demonstrates that CCN sets a new state-of-the-art performance on Taobao, boosting CTR by 12.3% and order volume by 12.7%.

Via

Access Paper or Ask Questions

Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

Oct 16, 2024

Tong Liu, Hadi Meidani

Figure 1 for Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

Figure 2 for Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

Figure 3 for Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

Figure 4 for Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models

Abstract:Supply chain networks are critical to the operational efficiency of industries, yet their increasing complexity presents significant challenges in mapping relationships and identifying the roles of various entities. Traditional methods for constructing supply chain networks rely heavily on structured datasets and manual data collection, limiting their scope and efficiency. In contrast, recent advancements in Natural Language Processing (NLP) and large language models (LLMs) offer new opportunities for discovering and analyzing supply chain networks using unstructured text data. This paper proposes a novel approach that leverages LLMs to extract and process raw textual information from publicly available sources to construct a comprehensive supply chain graph. We focus on the civil engineering sector as a case study, demonstrating how LLMs can uncover hidden relationships among companies, projects, and other entities. Additionally, we fine-tune an LLM to classify entities within the supply chain graph, providing detailed insights into their roles and relationships. The results show that domain-specific fine-tuning improves classification accuracy, highlighting the potential of LLMs for industry-specific supply chain analysis. Our contributions include the development of a supply chain graph for the civil engineering sector, as well as a fine-tuned LLM model that enhances entity classification and understanding of supply chain networks.

* 11 pages, 4 figures

Via

Access Paper or Ask Questions

Upcycling Large Language Models into Mixture of Experts

Oct 10, 2024

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, Bryan Catanzaro

Figure 1 for Upcycling Large Language Models into Mixture of Experts

Figure 2 for Upcycling Large Language Models into Mixture of Experts

Figure 3 for Upcycling Large Language Models into Mixture of Experts

Figure 4 for Upcycling Large Language Models into Mixture of Experts

Abstract:Upcycling pre-trained dense language models into sparse mixture-of-experts (MoE) models is an efficient approach to increase the model capacity of already trained models. However, optimal techniques for upcycling at scale remain unclear. In this work, we conduct an extensive study of upcycling methods and hyperparameters for billion-parameter scale language models. We propose a novel "virtual group" initialization scheme and weight scaling approach to enable upcycling into fine-grained MoE architectures. Through ablations, we find that upcycling outperforms continued dense model training. In addition, we show that softmax-then-topK expert routing improves over topK-then-softmax approach and higher granularity MoEs can help improve accuracy. Finally, we upcycled Nemotron-4 15B on 1T tokens and compared it to a continuously trained version of the same model on the same 1T tokens: the continuous trained model achieved 65.3% MMLU, whereas the upcycled model achieved 67.6%. Our results offer insights and best practices to effectively leverage upcycling for building MoE language models.

Via

Access Paper or Ask Questions