Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tuan Nguyen

FLAT: Latent-Driven Arbitrary-Target Backdoor Attacks in Federated Learning

Aug 06, 2025

Tuan Nguyen, Khoa D Doan, Kok-Seng Wong

Abstract:Federated learning (FL) is vulnerable to backdoor attacks, yet most existing methods are limited by fixed-pattern or single-target triggers, making them inflexible and easier to detect. We propose FLAT (FL Arbitrary-Target Attack), a novel backdoor attack that leverages a latent-driven conditional autoencoder to generate diverse, target-specific triggers as needed. By introducing a latent code, FLAT enables the creation of visually adaptive and highly variable triggers, allowing attackers to select arbitrary targets without retraining and to evade conventional detection mechanisms. Our approach unifies attack success, stealth, and diversity within a single framework, introducing a new level of flexibility and sophistication to backdoor attacks in FL. Extensive experiments show that FLAT achieves high attack success and remains robust against advanced FL defenses. These results highlight the urgent need for new defense strategies to address latent-driven, multi-target backdoor threats in federated settings.

Via

Access Paper or Ask Questions

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR

Jun 17, 2025

Tuan Nguyen, Huy-Dat Tran

Abstract:Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.

* This work has been submitted to the IEEE for possible publication. This paper is a preprint version submitted to the 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025)

Via

Access Paper or Ask Questions

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages

Jun 17, 2025

Tuan Nguyen, Huy-Dat Tran

Abstract:Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding offers a cost-effective approach for CS-ASR development, benefiting research and industry.

* Accepted by Interspeech 2025

Via

Access Paper or Ask Questions

Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

Jun 17, 2025

Long-Vu Hoang, Tuan Nguyen, Tran Huy Dat

Abstract:This paper presents a novel non-invasive object classification approach using acoustic scattering, demonstrated through a case study on hair assessment. When an incident wave interacts with an object, it generates a scattered acoustic field encoding structural and material properties. By emitting acoustic stimuli and capturing the scattered signals from head-with-hair-sample objects, we classify hair type and moisture using AI-driven, deep-learning-based sound classification. We benchmark comprehensive methods, including (i) fully supervised deep learning, (ii) embedding-based classification, (iii) supervised foundation model fine-tuning, and (iv) self-supervised model fine-tuning. Our best strategy achieves nearly 90% classification accuracy by fine-tuning all parameters of a self-supervised model. These results highlight acoustic scattering as a privacy-preserving, non-contact alternative to visual classification, opening huge potential for applications in various industries.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Jun 16, 2025

Tuan Nguyen, Long-Vu Hoang, Huy-Dat Tran

Abstract:This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

* Technical report for Interspeech 2025 MLC-SLM Challenge

Via

Access Paper or Ask Questions

CAMME: Adaptive Deepfake Image Detection with Multi-Modal Cross-Attention

May 23, 2025

Naseem Khan, Tuan Nguyen, Amine Bermak, Issa Khalil

Abstract:The proliferation of sophisticated AI-generated deepfakes poses critical challenges for digital media authentication and societal security. While existing detection methods perform well within specific generative domains, they exhibit significant performance degradation when applied to manipulations produced by unseen architectures--a fundamental limitation as generative technologies rapidly evolve. We propose CAMME (Cross-Attention Multi-Modal Embeddings), a framework that dynamically integrates visual, textual, and frequency-domain features through a multi-head cross-attention mechanism to establish robust cross-domain generalization. Extensive experiments demonstrate CAMME's superiority over state-of-the-art methods, yielding improvements of 12.56% on natural scenes and 13.25% on facial deepfakes. The framework demonstrates exceptional resilience, maintaining (over 91%) accuracy under natural image perturbations and achieving 89.01% and 96.14% accuracy against PGD and FGSM adversarial attacks, respectively. Our findings validate that integrating complementary modalities through cross-attention enables more effective decision boundary realignment for reliable deepfake detection across heterogeneous generative architectures.

* 20 pages, 8 figures, 12 Tables

Via

Access Paper or Ask Questions

CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Apr 27, 2025

Tuan Nguyen, Naseem Khan, Issa Khalil

Figure 1 for CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Figure 2 for CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Figure 3 for CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Figure 4 for CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

Abstract:The rapid evolution of deepfake technology, particularly in instruction-guided image editing, threatens the integrity of digital images by enabling subtle, context-aware manipulations. Generated conditionally from real images and textual prompts, these edits are often imperceptible to both humans and existing detection systems, revealing significant limitations in current defenses. We propose a novel multimodal capsule network, CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities. High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision. Evaluated on diverse datasets, including MagicBrush, Unsplash Edits, Open Images Edits, and Multi-turn Edits, CapsFake outperforms state-of-the-art methods by up to 20% in detection accuracy. Ablation studies validate its robustness, achieving detection rates above 94% under natural perturbations and 96% against adversarial attacks, with excellent generalization to unseen editing scenarios. This approach establishes a powerful framework for countering sophisticated image manipulations.

* 20 pages

Via

Access Paper or Ask Questions

Communication Optimization for Decentralized Learning atop Bandwidth-limited Edge Networks

Apr 16, 2025

Tingyang Sun, Tuan Nguyen, Ting He

Abstract:Decentralized federated learning (DFL) is a promising machine learning paradigm for bringing artificial intelligence (AI) capabilities to the network edge. Running DFL on top of edge networks, however, faces severe performance challenges due to the extensive parameter exchanges between agents. Most existing solutions for these challenges were based on simplistic communication models, which cannot capture the case of learning over a multi-hop bandwidth-limited network. In this work, we address this problem by jointly designing the communication scheme for the overlay network formed by the agents and the mixing matrix that controls the communication demands between the agents. By carefully analyzing the properties of our problem, we cast each design problem into a tractable optimization and develop an efficient algorithm with guaranteed performance. Our evaluations based on real topology and data show that the proposed algorithm can reduce the total training time by over $80\%$ compared to the baseline without sacrificing accuracy, while significantly improving the computational efficiency over the state of the art.

* arXiv admin note: text overlap with arXiv:2408.04705

Via

Access Paper or Ask Questions

RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Jan 27, 2025

Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan

Figure 1 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Figure 2 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Figure 3 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Figure 4 for RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval

Abstract:Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.

* Under review at SoICT'24

Via

Access Paper or Ask Questions

Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Oct 10, 2024

Tuan Nguyen, Corinne Fredouille, Alain Ghio, Mathieu Balaguer, Virginie Woisard

Figure 1 for Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Figure 2 for Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Figure 3 for Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Figure 4 for Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Abstract:With the rise of SSL and ASR technologies, the Wav2Vec2 ASR-based model has been fine-tuned for automated speech disorder quality assessment tasks, yielding impressive results and setting a new baseline for Head and Neck Cancer speech contexts. This demonstrates that the ASR dimension from Wav2Vec2 closely aligns with assessment dimensions. Despite its effectiveness, this system remains a black box with no clear interpretation of the connection between the model ASR dimension and clinical assessments. This paper presents the first analysis of this baseline model for speech quality assessment, focusing on intelligibility and severity tasks. We conduct a layer-wise analysis to identify key layers and compare different SSL and ASR Wav2Vec2 models based on pre-trained data. Additionally, post-hoc XAI methods, including Canonical Correlation Analysis (CCA) and visualization techniques, are used to track model evolution and visualize embeddings for enhanced interpretability.

* Accepted at the Spoken Language Technology (SLT) Conference 2024

Via

Access Paper or Ask Questions