Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Huy Nguyen

On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

May 16, 2025

Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

Abstract:Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

* 100 pages

Via

Access Paper or Ask Questions

AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification

Mar 11, 2025

Huy Nguyen, Kien Nguyen, Akila Pemasiri, Feng Liu, Sridha Sridharan, Clinton Fookes

Abstract:We introduce AG-VPReID, a challenging large-scale benchmark dataset for aerial-ground video-based person re-identification (ReID), comprising 6,632 identities, 32,321 tracklets, and 9.6 million frames captured from drones (15-120m altitude), CCTV, and wearable cameras. This dataset presents a real-world benchmark to investigate the robustness of Person ReID approaches against the unique challenges of cross-platform aerial-ground settings. To address these challenges, we propose AG-VPReID-Net, an end-to-end framework combining three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and temporal feature learning, (2) a Normalized Appearance Stream using physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. Our approach integrates complementary visual-semantic information from all streams to generate robust, viewpoint-invariant person representations. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and other existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. The relatively lower performance of all state-of-the-art approaches, including our proposed approach, on our new dataset highlights its challenging nature. The AG-VPReID dataset, code and models are available at https://github.com/agvpreid25/AG-VPReID-Net.

* Accepted at Computer Vision and Pattern Recognition Conference (CVPR) 2025

Via

Access Paper or Ask Questions

Convergence Rates for Softmax Gating Mixture of Experts

Mar 05, 2025

Huy Nguyen, Nhat Ho, Alessandro Rinaldo

Figure 1 for Convergence Rates for Softmax Gating Mixture of Experts

Figure 2 for Convergence Rates for Softmax Gating Mixture of Experts

Figure 3 for Convergence Rates for Softmax Gating Mixture of Experts

Figure 4 for Convergence Rates for Softmax Gating Mixture of Experts

Abstract:Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. Despite its widespread use in practice, a comprehensive study on the effects of the softmax gating on the MoE has been lacking in the literature. To bridge this gap in this paper, we perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating, respectively. Furthermore, our theories also provide useful insights into the design of sample-efficient expert structures. In particular, we demonstrate that it requires polynomially many data points to estimate experts satisfying our proposed \emph{strong identifiability} condition, namely a commonly used two-layer feed-forward network. In stark contrast, estimating linear experts, which violate the strong identifiability condition, necessitates exponentially many data points as a result of intrinsic parameter interactions expressed in the language of partial differential equations. All the theoretical results are substantiated with a rigorous guarantee.

* Section 2 of this work comes from our previous paper titled "On Least Square Estimation in Softmax Gating Mixture of Experts" and published at the ICML 2024

Via

Access Paper or Ask Questions

On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Feb 05, 2025

Nghiem T. Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy M. H. Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho

Figure 1 for On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Figure 2 for On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Figure 3 for On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Figure 4 for On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Abstract:The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

* 43 pages, 5 tables, 6 figures

Via

Access Paper or Ask Questions

RepLoRA: Reparameterizing Low-Rank Adaptation via the Perspective of Mixture of Experts

Feb 05, 2025

Tuan Truong, Chau Nguyen, Huy Nguyen, Minh Le, Trung Le, Nhat Ho

Abstract:Low-rank adaptation (LoRA) has emerged as a powerful method for fine-tuning large-scale foundation models. Despite its popularity, the theoretical understanding of LoRA has remained limited. This paper presents a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models. Under this framework, we show that simple reparameterizations of the LoRA matrices can notably accelerate the low-rank matrix estimation process. In particular, we prove that reparameterization can reduce the data needed to achieve a desired estimation error from an exponential to a polynomial scale. Motivated by this insight, we propose Reparameterized Low-rank Adaptation (RepLoRA), which incorporates lightweight MLPs to reparameterize the LoRA matrices. Extensive experiments across multiple domains demonstrate that RepLoRA consistently outperforms vanilla LoRA. Notably, with limited data, RepLoRA surpasses LoRA by a margin of up to 40.0% and achieves LoRA's performance with only 30.0% of the training data, highlighting both the theoretical and empirical robustness of our PEFT method.

Via

Access Paper or Ask Questions

Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning

Jan 31, 2025

Minh Le, Anh Nguyen, Huy Nguyen, Chau Nguyen, Nhat Ho

Figure 1 for Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning

Figure 2 for Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning

Figure 3 for Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning

Figure 4 for Adaptive Prompt: Unlocking the Power of Visual Prompt Tuning

Abstract:Visual Prompt Tuning (VPT) has recently emerged as a powerful method for adapting pre-trained vision models to downstream tasks. By introducing learnable prompt tokens as task-specific instructions, VPT effectively guides pre-trained transformer models with minimal overhead. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on recent insights into the connection between mixture of experts and prompt-based approaches, we identify a key limitation in VPT: the restricted functional expressiveness in prompt formulation. To address this limitation, we propose Visual Adaptive Prompt Tuning (VAPT), a new generation of prompts that redefines prompts as adaptive functions of the input. Our theoretical analysis shows that this simple yet intuitive approach achieves optimal sample efficiency. Empirical results on VTAB-1K and FGVC further demonstrate VAPT's effectiveness, with performance gains of 7.34% and 1.04% over fully fine-tuning baselines, respectively. Notably, VAPT also surpasses VPT by a substantial margin while using fewer parameters. These results highlight both the effectiveness and efficiency of our method and pave the way for future research to explore the potential of adaptive prompts.

* 55 pages, 10 figures, 18 tables. arXiv admin note: text overlap with arXiv:2410.02200

Via

Access Paper or Ask Questions

RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Jan 27, 2025

Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan

Figure 1 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Figure 2 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Figure 3 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Figure 4 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Abstract:Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.

* Under review at SoICT'24

Via

Access Paper or Ask Questions

Understanding Expert Structures on Minimax Parameter Estimation in Contaminated Mixture of Experts

Oct 16, 2024

Fanqi Yan, Huy Nguyen, Dung Le, Pedram Akbarian, Nhat Ho

Abstract:We conduct the convergence analysis of parameter estimation in the contaminated mixture of experts. This model is motivated from the prompt learning problem where ones utilize prompts, which can be formulated as experts, to fine-tune a large-scaled pre-trained model for learning downstream tasks. There are two fundamental challenges emerging from the analysis: (i) the proportion in the mixture of the pre-trained model and the prompt may converge to zero where the prompt vanishes during the training; (ii) the algebraic interaction among parameters of the pre-trained model and the prompt can occur via some partial differential equation and decelerate the prompt learning. In response, we introduce a distinguishability condition to control the previous parameter interaction. Additionally, we also consider various types of expert structures to understand their effects on the parameter estimation. In each scenario, we provide comprehensive convergence rates of parameter estimation along with the corresponding minimax lower bounds.

* Fanqi Yan, Huy Nguyen, Dung Le contributed equally to this work. 70 pages, 6 figures, 1 table

Via

Access Paper or Ask Questions

Quadratic Gating Functions in Mixture of Experts: A Statistical Insight

Oct 15, 2024

Pedram Akbarian, Huy Nguyen, Xing Han, Nhat Ho

Figure 1 for Quadratic Gating Functions in Mixture of Experts: A Statistical Insight

Figure 2 for Quadratic Gating Functions in Mixture of Experts: A Statistical Insight

Figure 3 for Quadratic Gating Functions in Mixture of Experts: A Statistical Insight

Figure 4 for Quadratic Gating Functions in Mixture of Experts: A Statistical Insight

Abstract:Mixture of Experts (MoE) models are highly effective in scaling model capacity while preserving computational efficiency, with the gating network, or router, playing a central role by directing inputs to the appropriate experts. In this paper, we establish a novel connection between MoE frameworks and attention mechanisms, demonstrating how quadratic gating can serve as a more expressive and efficient alternative. Motivated by this insight, we explore the implementation of quadratic gating within MoE models, identifying a connection between the self-attention mechanism and the quadratic gating. We conduct a comprehensive theoretical analysis of the quadratic softmax gating MoE framework, showing improved sample efficiency in expert and parameter estimation. Our analysis provides key insights into optimal designs for quadratic gating and expert functions, further elucidating the principles behind widely used attention mechanisms. Through extensive evaluations, we demonstrate that the quadratic gating MoE outperforms the traditional linear gating MoE. Moreover, our theoretical insights have guided the development of a novel attention mechanism, which we validated through extensive experiments. The results demonstrate its favorable performance over conventional models across various tasks.

Via

Access Paper or Ask Questions

Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

Oct 03, 2024

Minh Le, Chau Nguyen, Huy Nguyen, Quyen Tran, Trung Le, Nhat Ho

Figure 1 for Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

Figure 2 for Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

Figure 3 for Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

Figure 4 for Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

Abstract:Prompt-based techniques, such as prompt-tuning and prefix-tuning, have gained prominence for their efficiency in fine-tuning large pre-trained models. Despite their widespread adoption, the theoretical foundations of these methods remain limited. For instance, in prefix-tuning, we observe that a key factor in achieving performance parity with full fine-tuning lies in the reparameterization strategy. However, the theoretical principles underpinning the effectiveness of this approach have yet to be thoroughly examined. Our study demonstrates that reparameterization is not merely an engineering trick but is grounded in deep theoretical foundations. Specifically, we show that the reparameterization strategy implicitly encodes a shared structure between prefix key and value vectors. Building on recent insights into the connection between prefix-tuning and mixture of experts models, we further illustrate that this shared structure significantly improves sample efficiency in parameter estimation compared to non-shared alternatives. The effectiveness of prefix-tuning across diverse tasks is empirically confirmed to be enhanced by the shared structure, through extensive experiments in both visual and language domains. Additionally, we uncover similar structural benefits in prompt-tuning, offering new perspectives on its success. Our findings provide theoretical and empirical contributions, advancing the understanding of prompt-based methods and their underlying mechanisms.

* Minh Le, Chau Nguyen, Huy Nguyen contributed equally to this work. 50 pages, 8 tables, 2 figures

Via

Access Paper or Ask Questions