Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bipesh Subedi

NepTam: A Nepali-Tamang Parallel Corpus and Baseline Machine Translation Experiments

Mar 14, 2026

Rupak Raj Ghimire, Bipesh Subedi, Balaram Prasain, Prakash Poudyal, Praveen Acharya, Nischal Karki, Rupak Tiwari, Rishikesh Kumar Sharma, Jenny Poudel, Bal Krishna Bal

Abstract:Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).

* Accepted in LREC 2026

Via

Access Paper or Ask Questions

Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language

Feb 27, 2026

Nischal Karki, Bipesh Subedi, Prakash Poudyal, Rupak Raj Ghimire, Bal Krishna Bal

Abstract:Transformer-based models such as BERT have significantly advanced Natural Language Processing (NLP) across many languages. However, Nepali, a low-resource language written in Devanagari script, remains relatively underexplored. This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification. Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested on the balanced Nepali dataset containing 25,006 sentences across five conceptual domains and the performance was evaluated using accuracy, weighted precision, recall, F1-score, and AUROC metrics. The results reveal that Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models. NepBERTa also performed competitively with an F1-score of 88.26%. Overall, these findings establish a robust baseline for future document-level classification and broader Nepali NLP applications.

* 5 pages, 2 figures. Accepted and presented at the Regional International Conference on Natural Language Processing (RegICON 2025), Gauhati University, Guwahati, India, November 27-29, 2025. To appear in the conference proceedings. Accepted papers list available at: https://www.regicon2025.in/accepted-papers

Via

Access Paper or Ask Questions

Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

Nov 12, 2023

Sujan Poudel, Nabin Ghimire, Bipesh Subedi, Saugat Singh

Figure 1 for Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

Figure 2 for Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

Figure 3 for Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

Figure 4 for Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

Abstract:The field of Natural Language Processing which involves the use of artificial intelligence to support human languages has seen tremendous growth due to its high-quality features. Its applications such as language translation, chatbots, virtual assistants, search autocomplete, and autocorrect are widely used in various domains including healthcare, advertising, customer service, and target advertising. To provide pregnancy-related information a health domain chatbot has been proposed and this work explores two different NLP-based approaches for developing the chatbot. The first approach is a multiclass classification-based retrieval approach using BERTbased multilingual BERT and multilingual DistilBERT while the other approach employs a transformer-based generative chatbot for pregnancy-related information. The performance of both stemmed and non-stemmed datasets in Nepali language has been analyzed for each approach. The experimented results indicate that BERT-based pre-trained models perform well on non-stemmed data whereas scratch transformer models have better performance on stemmed data. Among the models tested the DistilBERT model achieved the highest training and validation accuracy and testing accuracy of 0.9165 on the retrieval-based model architecture implementation on the non-stemmed dataset. Similarly, in the generative approach architecture implementation with transformer 1 gram BLEU and 2 gram BLEU scores of 0.3570 and 0.1413 respectively were achieved.

* International Conference on Technologies for Computer, Electrical, Electronics & Communication (ICT-CEEL 2023), Bhaktapur, Nepal
* 7 pages, 5 figures, 4 tables. In proceedings of the International Conference on Technologies for Computer, Electrical, Electronics & Communication (ICT-CEEL 2023), Bhaktapur, Nepal

Via

Access Paper or Ask Questions

Nepali Video Captioning using CNN-RNN Architecture

Nov 05, 2023

Bipesh Subedi, Saugat Singh, Bal Krishna Bal

Figure 1 for Nepali Video Captioning using CNN-RNN Architecture

Figure 2 for Nepali Video Captioning using CNN-RNN Architecture

Figure 3 for Nepali Video Captioning using CNN-RNN Architecture

Figure 4 for Nepali Video Captioning using CNN-RNN Architecture

Abstract:This article presents a study on Nepali video captioning using deep neural networks. Through the integration of pre-trained CNNs and RNNs, the research focuses on generating precise and contextually relevant captions for Nepali videos. The approach involves dataset collection, data preprocessing, model implementation, and evaluation. By enriching the MSVD dataset with Nepali captions via Google Translate, the study trains various CNN-RNN architectures. The research explores the effectiveness of CNNs (e.g., EfficientNetB0, ResNet101, VGG16) paired with different RNN decoders like LSTM, GRU, and BiLSTM. Evaluation involves BLEU and METEOR metrics, with the best model being EfficientNetB0 + BiLSTM with 1024 hidden dimensions, achieving a BLEU-4 score of 17 and METEOR score of 46. The article also outlines challenges and future directions for advancing Nepali video captioning, offering a crucial resource for further research in this area.

* In proceedings of the International Conference on Technologies for Computer, Electrical, Electronics & Communication (ICT-CEEL 2023), Bhaktapur, Nepal. Part-1 94-99
* 6 pages, 5 figures, 3 tables. Presented in the International Conference on Technologies for Computer, Electrical, Electronics & Communication (ICT-CEEL 2023), Bhaktapur, Nepal

Via

Access Paper or Ask Questions