Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dat Quoc Nguyen

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Sep 20, 2021
Nguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen

Figure 1 for BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

We present BARTpho with two versions -- BARTpho_word and BARTpho_syllable -- the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the "large" architecture and pre-training scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future research and applications of generative Vietnamese NLP tasks. Our BARTpho models are available at: https://github.com/VinAIResearch/BARTpho

Via

Access Paper or Ask Questions

Node Co-occurrence based Graph Neural Networks for Knowledge Graph Link Prediction

Apr 15, 2021
Dai Quoc Nguyen, Vinh Tong, Dinh Phung, Dat Quoc Nguyen

Figure 1 for Node Co-occurrence based Graph Neural Networks for Knowledge Graph Link Prediction

Figure 2 for Node Co-occurrence based Graph Neural Networks for Knowledge Graph Link Prediction

Figure 3 for Node Co-occurrence based Graph Neural Networks for Knowledge Graph Link Prediction

Figure 4 for Node Co-occurrence based Graph Neural Networks for Knowledge Graph Link Prediction

We introduce a novel embedding model, named NoKE, which aims to integrate co-occurrence among entities and relations into graph neural networks to improve knowledge graph completion (i.e., link prediction). Given a knowledge graph, NoKE constructs a single graph considering entities and relations as individual nodes. NoKE then computes weights for edges among nodes based on the co-occurrence of entities and relations. Next, NoKE utilizes vanilla GNNs to update vector representations for entity and relation nodes and then adopts a score function to produce the triple scores. Comprehensive experimental results show that our NoKE obtains state-of-the-art results on three new, challenging, and difficult benchmark datasets CoDEx for knowledge graph completion, demonstrating the power of its simplicity and effectiveness.

Via

Access Paper or Ask Questions

COVID-19 Named Entity Recognition for Vietnamese

Apr 08, 2021
Thinh Hung Truong, Mai Hoang Dao, Dat Quoc Nguyen

Figure 1 for COVID-19 Named Entity Recognition for Vietnamese

Figure 2 for COVID-19 Named Entity Recognition for Vietnamese

Figure 3 for COVID-19 Named Entity Recognition for Vietnamese

Figure 4 for COVID-19 Named Entity Recognition for Vietnamese

The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained by fine-tuning pre-trained language models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) produces higher results than the multilingual model XLM-R (Conneau et al., 2020). We publicly release our dataset at: https://github.com/VinAIResearch/PhoNER_COVID19

* To appear in Proceedings of NAACL 2021

Via

Access Paper or Ask Questions

Intent detection and slot filling for Vietnamese

Apr 05, 2021
Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen

Figure 1 for Intent detection and slot filling for Vietnamese

Figure 2 for Intent detection and slot filling for Vietnamese

Figure 3 for Intent detection and slot filling for Vietnamese

Figure 4 for Intent detection and slot filling for Vietnamese

Intent detection and slot filling are important tasks in spoken and natural language understanding. However, Vietnamese is a low-resource language in these research topics. In this paper, we present the first public intent detection and slot filling dataset for Vietnamese. In addition, we also propose a joint model for intent detection and slot filling, that extends the recent state-of-the-art JointBERT+CRF model with an intent-slot attention layer in order to explicitly incorporate intent context information into slot filling via "soft" intent label embedding. Experimental results on our Vietnamese dataset show that our proposed model significantly outperforms JointBERT+CRF. We publicly release our dataset and the implementation of our model at: https://github.com/VinAIResearch/JointIDSF

Via

Access Paper or Ask Questions

PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

Jan 05, 2021
Linh The Nguyen, Dat Quoc Nguyen

Figure 1 for PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

Figure 2 for PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

Figure 3 for PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

Figure 4 for PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

We present the first multi-task learning model -- named PhoNLP -- for joint Vietnamese part-of-speech tagging, named entity recognition and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT (Nguyen and Nguyen, 2020) for each task independently. We publicly release PhoNLP as an open-source toolkit under the MIT License. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future research and applications in Vietnamese NLP. Our PhoNLP is available at https://github.com/VinAIResearch/PhoNLP

* 7 pages, 3 figures, 3 tables

Via

Access Paper or Ask Questions

WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets

Oct 16, 2020
Dat Quoc Nguyen, Thanh Vu, Afshin Rahimi, Mai Hoang Dao, Linh The Nguyen, Long Doan

Figure 1 for WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets

Figure 2 for WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets

In this paper, we provide an overview of the WNUT-2020 shared task on the identification of informative COVID-19 English Tweets. We describe how we construct a corpus of 10K Tweets and organize the development and evaluation phases for this task. In addition, we also present a brief summary of results obtained from the final system evaluation submissions of 55 teams, finding that (i) many systems obtain very high performance, up to 0.91 F1 score, (ii) the majority of the submissions achieve substantially higher results than the baseline fastText (Joulin et al., 2017), and (iii) fine-tuning pre-trained language models on relevant language data followed by supervised training performs well in this task.

* In Proceedings of the 6th Workshop on Noisy User-generated Text

Via

Access Paper or Ask Questions

A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese

Oct 05, 2020
Anh Tuan Nguyen, Mai Hoang Dao, Dat Quoc Nguyen

Figure 1 for A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese

Figure 2 for A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese

Figure 3 for A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese

Semantic parsing is an important NLP task. However, Vietnamese is a low-resource language in this research area. In this paper, we present the first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese. We extend and evaluate two strong semantic parsing baselines EditSQL (Zhang et al., 2019) and IRNet (Guo et al., 2019) on our dataset. We compare the two baselines with key configurations and find that: automatic Vietnamese word segmentation improves the parsing results of both baselines; the normalized pointwise mutual information (NPMI) score (Bouma, 2009) is useful for schema linking; latent syntactic features extracted from a neural dependency parser for Vietnamese also improve the results; and the monolingual language model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) helps produce higher performances than the recent best multilingual language model XLM-R (Conneau et al., 2020).

* EMNLP 2020 (Findings)

Via

Access Paper or Ask Questions

A Label Attention Model for ICD Coding from Clinical Text

Jul 13, 2020
Thanh Vu, Dat Quoc Nguyen, Anthony Nguyen

Figure 1 for A Label Attention Model for ICD Coding from Clinical Text

Figure 2 for A Label Attention Model for ICD Coding from Clinical Text

Figure 3 for A Label Attention Model for ICD Coding from Clinical Text

Figure 4 for A Label Attention Model for ICD Coding from Clinical Text

ICD coding is a process of assigning the International Classification of Disease diagnosis codes to clinical/medical notes documented by health professionals (e.g. clinicians). This process requires significant human resources, and thus is costly and prone to error. To handle the problem, machine learning has been utilized for automatic ICD coding. Previous state-of-the-art models were based on convolutional neural networks, using a single/several fixed window sizes. However, the lengths and interdependence between text fragments related to ICD codes in clinical text vary significantly, leading to the difficulty of deciding what the best window sizes are. In this paper, we propose a new label attention model for automatic ICD coding, which can handle both the various lengths and the interdependence of the ICD code related text fragments. Furthermore, as the majority of ICD codes are not frequently used, leading to the extremely imbalanced data issue, we additionally propose a hierarchical joint learning mechanism extending our label attention model to handle the issue, using the hierarchical relationships among the codes. Our label attention model achieves new state-of-the-art results on three benchmark MIMIC datasets, and the joint learning mechanism helps improve the performances for infrequent codes.

* In Proceedings of IJCAI 2020 (Main Track)

Via

Access Paper or Ask Questions

BERTweet: A pre-trained language model for English Tweets

May 20, 2020
Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen

Figure 1 for BERTweet: A pre-trained language model for English Tweets

Figure 2 for BERTweet: A pre-trained language model for English Tweets

Figure 3 for BERTweet: A pre-trained language model for English Tweets

Figure 4 for BERTweet: A pre-trained language model for English Tweets

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet is trained using the RoBERTa pre-training procedure (Liu et al., 2019), with the same model configuration as BERT-base (Devlin et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet to facilitate future research and downstream applications on Tweet data. Our BERTweet is available at: https://github.com/VinAIResearch/BERTweet

Via

Access Paper or Ask Questions

PhoBERT: Pre-trained language models for Vietnamese

Mar 02, 2020
Dat Quoc Nguyen, Anh Tuan Nguyen

Figure 1 for PhoBERT: Pre-trained language models for Vietnamese

We present PhoBERT with two versions of "base" and "large"--the first public large-scale monolingual language models pre-trained for Vietnamese. We show that PhoBERT improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT is released at: https://github.com/VinAIResearch/PhoBERT

Via

Access Paper or Ask Questions