Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dat Quoc Nguyen

BamiBERT: A New BERT-based Language Model for Vietnamese

Jul 02, 2026

Dat Quoc Nguyen, Thinh Pham, Chi Tran, Linh The Nguyen

Abstract:In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnamese encoders and demonstrating strong cross-domain generalization. We release BamiBERT at: https://huggingface.co/Qualcomm-AI-Research/BamiBERT

Via

Access Paper or Ask Questions

Vietnamese Automatic Speech Recognition: A Revisit

Mar 16, 2026

Thi Vu, Linh The Nguyen, Dat Quoc Nguyen

Abstract:Automatic Speech Recognition (ASR) performance is heavily dependent on the availability of large-scale, high-quality datasets. For low-resource languages, existing open-source ASR datasets often suffer from insufficient quality and inconsistent annotation, hindering the development of robust models. To address these challenges, we propose a novel and generalizable data aggregation and preprocessing pipeline designed to construct high-quality ASR datasets from diverse, potentially noisy, open-source sources. Our pipeline incorporates rigorous processing steps to ensure data diversity, balance, and the inclusion of crucial features like word-level timestamps. We demonstrate the effectiveness of our methodology by applying it to Vietnamese, resulting in a unified, high-quality 500-hour dataset that provides a foundation for training and evaluating state-of-the-art Vietnamese ASR systems. Our project page is available at https://github.com/qualcomm-ai-research/PhoASR.

* Accepted to EACL 2026 Findings

Via

Access Paper or Ask Questions

AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Oct 02, 2025

Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Van-Cuong Pham, Hoang Ngo, Dat Quoc Nguyen

Figure 1 for AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Figure 2 for AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Figure 3 for AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Figure 4 for AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications

Abstract:We introduce AccurateRAG -- a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG). Our framework offers a pipeline for development efficiency with tools for raw dataset processing, fine-tuning data generation, text embedding & LLM fine-tuning, output evaluation, and building RAG systems locally. Experimental results show that our framework outperforms previous strong baselines and obtains new state-of-the-art question-answering performance on benchmark datasets.

Via

Access Paper or Ask Questions

Improving Table Understanding with LLMs and Entity-Oriented Search

Aug 23, 2025

Thi-Nhung Nguyen, Hoang Ngo, Dinh Phung, Thuy-Trang Vu, Dat Quoc Nguyen

Abstract:Our work addresses the challenges of understanding tables. Existing methods often struggle with the unpredictable nature of table content, leading to a reliance on preprocessing and keyword matching. They also face limitations due to the lack of contextual information, which complicates the reasoning processes of large language models (LLMs). To overcome these challenges, we introduce an entity-oriented search method to improve table understanding with LLMs. This approach effectively leverages the semantic similarities between questions and table data, as well as the implicit relationships between table cells, minimizing the need for data preprocessing and keyword matching. Additionally, it focuses on table entities, ensuring that table cells are semantically tightly bound, thereby enhancing contextual clarity. Furthermore, we pioneer the use of a graph query language for table understanding, establishing a new research direction. Experiments show that our approach achieves new state-of-the-art performances on standard benchmarks WikiTableQuestions and TabFact.

* Accepted to COLM 2025

Via

Access Paper or Ask Questions

Planning for Success: Exploring LLM Long-term Planning Capabilities in Table Understanding

Aug 23, 2025

Thi-Nhung Nguyen, Hoang Ngo, Dinh Phung, Thuy-Trang Vu, Dat Quoc Nguyen

Abstract:Table understanding is key to addressing challenging downstream tasks such as table-based question answering and fact verification. Recent works have focused on leveraging Chain-of-Thought and question decomposition to solve complex questions requiring multiple operations on tables. However, these methods often suffer from a lack of explicit long-term planning and weak inter-step connections, leading to miss constraints within questions. In this paper, we propose leveraging the long-term planning capabilities of large language models (LLMs) to enhance table understanding. Our approach enables the execution of a long-term plan, where the steps are tightly interconnected and serve the ultimate goal, an aspect that methods based on Chain-of-Thought and question decomposition lack. In addition, our method effectively minimizes the inclusion of unnecessary details in the process of solving the next short-term goals, a limitation of methods based on Chain-of-Thought. Extensive experiments demonstrate that our method outperforms strong baselines and achieves state-of-the-art performance on WikiTableQuestions and TabFact datasets.

* Accepted to CoNLL 2025

Via

Access Paper or Ask Questions

ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations

Jun 04, 2025

Quang Hieu Pham, Thuy Duong Nguyen, Tung Pham, Anh Tuan Luu, Dat Quoc Nguyen

Abstract:The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.

* Accepted to ACL 2025 Findings

Via

Access Paper or Ask Questions

Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Oct 21, 2024

Quang Hieu Pham, Hoang Ngo, Anh Tuan Luu, Dat Quoc Nguyen

Figure 1 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Figure 2 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Figure 3 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Figure 4 for Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Abstract:Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.

* Accepted to EMNLP 2024 Findings

Via

Access Paper or Ask Questions

RecGPT: Generative Pre-training for Text-based Recommendation

May 21, 2024

Hoang Ngo, Dat Quoc Nguyen

Figure 1 for RecGPT: Generative Pre-training for Text-based Recommendation

Figure 2 for RecGPT: Generative Pre-training for Text-based Recommendation

Figure 3 for RecGPT: Generative Pre-training for Text-based Recommendation

Figure 4 for RecGPT: Generative Pre-training for Text-based Recommendation

Abstract:We present the first domain-adapted and fully-trained large language model, RecGPT-7B, and its instruction-following variant, RecGPT-7B-Instruct, for text-based recommendation. Experimental results on rating prediction and sequential recommendation tasks show that our model, RecGPT-7B-Instruct, outperforms previous strong baselines. We are releasing our RecGPT models as well as their pre-training and fine-tuning datasets to facilitate future research and downstream applications in text-based recommendation. Public "huggingface" links to our RecGPT models and datasets are available at: https://github.com/VinAIResearch/RecGPT

* Accepted to the ACL 2024 main conference

Via

Access Paper or Ask Questions

Improving Vietnamese-English Medical Machine Translation

Mar 28, 2024

Nhu Vo, Dat Quoc Nguyen, Dung D. Le, Massimo Piccardi, Wray Buntine

Figure 1 for Improving Vietnamese-English Medical Machine Translation

Figure 2 for Improving Vietnamese-English Medical Machine Translation

Figure 3 for Improving Vietnamese-English Medical Machine Translation

Figure 4 for Improving Vietnamese-English Medical Machine Translation

Abstract:Machine translation for Vietnamese-English in the medical domain is still an under-explored research area. In this paper, we introduce MedEV -- a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs. We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset. Experimental results show that the best performance is achieved by fine-tuning "vinai-translate" for each translation direction. We publicly release our dataset to promote further research.

* To appear in Proceedings of LREC-COLING 2024

Via

Access Paper or Ask Questions

JPIS: A Joint Model for Profile-based Intent Detection and Slot Filling with Slot-to-Intent Attention

Dec 16, 2023

Thinh Pham, Dat Quoc Nguyen

Abstract:Profile-based intent detection and slot filling are important tasks aimed at reducing the ambiguity in user utterances by leveraging user-specific supporting profile information. However, research in these two tasks has not been extensively explored. To fill this gap, we propose a joint model, namely JPIS, designed to enhance profile-based intent detection and slot filling. JPIS incorporates the supporting profile information into its encoder and introduces a slot-to-intent attention mechanism to transfer slot information representations to intent detection. Experimental results show that our JPIS substantially outperforms previous profile-based models, establishing a new state-of-the-art performance in overall accuracy on the Chinese benchmark dataset ProSLU.

* To appear in Proceedings of ICASSP 2024 (Camera-ready version)

Via

Access Paper or Ask Questions