Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Niloy Ganguly

Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization

Aug 05, 2024

Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Pawan Goyal, Niloy Ganguly, Prasenjit Dey, Ravi Kokku

Figure 1 for Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization

Figure 2 for Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization

Figure 3 for Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization

Figure 4 for Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization

Abstract:The ever-increasing volume of digital information necessitates efficient methods for users to extract key insights from lengthy documents. Aspect-based summarization offers a targeted approach, generating summaries focused on specific aspects within a document. Despite advancements in aspect-based summarization research, there is a continuous quest for improved model performance. Given that large language models (LLMs) have demonstrated the potential to revolutionize diverse tasks within natural language processing, particularly in the problem of summarization, this paper explores the potential of fine-tuning LLMs for the aspect-based summarization task. We evaluate the impact of fine-tuning open-source foundation LLMs, including Llama2, Mistral, Gemma and Aya, on a publicly available domain-specific aspect based summary dataset. We hypothesize that this approach will enable these models to effectively identify and extract aspect-related information, leading to superior quality aspect-based summaries compared to the state-of-the-art. We establish a comprehensive evaluation framework to compare the performance of fine-tuned LLMs against competing aspect-based summarization methods and vanilla counterparts of the fine-tuned LLMs. Our work contributes to the field of aspect-based summarization by demonstrating the efficacy of fine-tuning LLMs for generating high-quality aspect-based summaries. Furthermore, it opens doors for further exploration of using LLMs for targeted information extraction tasks across various NLP domains.

Via

Access Paper or Ask Questions

On The Persona-based Summarization of Domain-Specific Documents

Jun 06, 2024

Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Pawan Goyal, Niloy Ganguly, Prasenjit Dey, Ravi Kokku

Figure 1 for On The Persona-based Summarization of Domain-Specific Documents

Figure 2 for On The Persona-based Summarization of Domain-Specific Documents

Figure 3 for On The Persona-based Summarization of Domain-Specific Documents

Figure 4 for On The Persona-based Summarization of Domain-Specific Documents

Abstract:In an ever-expanding world of domain-specific knowledge, the increasing complexity of consuming, and storing information necessitates the generation of summaries from large information repositories. However, every persona of a domain has different requirements of information and hence their summarization. For example, in the healthcare domain, a persona-based (such as Doctor, Nurse, Patient etc.) approach is imperative to deliver targeted medical information efficiently. Persona-based summarization of domain-specific information by humans is a high cognitive load task and is generally not preferred. The summaries generated by two different humans have high variability and do not scale in cost and subject matter expertise as domains and personas grow. Further, AI-generated summaries using generic Large Language Models (LLMs) may not necessarily offer satisfactory accuracy for different domains unless they have been specifically trained on domain-specific data and can also be very expensive to use in day-to-day operations. Our contribution in this paper is two-fold: 1) We present an approach to efficiently fine-tune a domain-specific small foundation LLM using a healthcare corpus and also show that we can effectively evaluate the summarization quality using AI-based critiquing. 2) We further show that AI-based critiquing has good concordance with Human-based critiquing of the summaries. Hence, such AI-based pipelines to generate domain-specific persona-based summaries can be easily scaled to other domains such as legal, enterprise documents, education etc. in a very efficient and cost-effective manner.

* ACL 2024 Findings (Association for Computational Linguistics)

Via

Access Paper or Ask Questions

Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

May 15, 2024

Subhendu Khatuya, Rajdeep Mukherjee, Akash Ghosh, Manjunath Hegde, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal

Figure 1 for Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Figure 2 for Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Figure 3 for Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Figure 4 for Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Abstract:We study the problem of automatically annotating relevant numerals (GAAP metrics) occurring in the financial documents with their corresponding XBRL tags. Different from prior works, we investigate the feasibility of solving this extreme classification problem using a generative paradigm through instruction tuning of Large Language Models (LLMs). To this end, we leverage metric metadata information to frame our target outputs while proposing a parameter efficient solution for the task using LoRA. We perform experiments on two recently released financial numeric labeling datasets. Our proposed model, FLAN-FinXC, achieves new state-of-the-art performances on both the datasets, outperforming several strong baselines. We explain the better scores of our proposed model by demonstrating its capability for zero-shot as well as the least frequently occurring tags. Also, even when we fail to predict the XBRL tags correctly, our generated output has substantial overlap with the ground-truth in majority of the cases.

* This work has been accepted to appear at North American Chapter of the Association for Computational Linguistics (NAACL), 2024

Via

Access Paper or Ask Questions

MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

May 07, 2024

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

Figure 1 for MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Figure 2 for MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Figure 3 for MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Figure 4 for MEDVOC: Vocabulary Adaptation for Fine-tuning Pre-trained Language Models on Medical Text Summarization

Abstract:This work presents a dynamic vocabulary adaptation strategy, MEDVOC, for fine-tuning pre-trained language models (PLMs) like BertSumAbs, BART, and PEGASUS for improved medical text summarization. In contrast to existing domain adaptation approaches in summarization, MEDVOC treats vocabulary as an optimizable parameter and optimizes the PLM vocabulary based on fragment score conditioned only on the downstream task's reference summaries. Unlike previous works on vocabulary adaptation (limited only to classification tasks), optimizing vocabulary based on summarization tasks requires an extremely costly intermediate fine-tuning step on large summarization datasets. To that end, our novel fragment score-based hyperparameter search very significantly reduces this fine-tuning time -- from 450 days to less than 2 days on average. Furthermore, while previous works on vocabulary adaptation are often primarily tied to single PLMs, MEDVOC is designed to be deployable across multiple PLMs (with varying model vocabulary sizes, pre-training objectives, and model sizes) -- bridging the limited vocabulary overlap between the biomedical literature domain and PLMs. MEDVOC outperforms baselines by 15.74% in terms of Rouge-L in zero-shot setting and shows gains of 17.29% in high Out-Of-Vocabulary (OOV) concentrations. Our human evaluation shows MEDVOC generates more faithful medical summaries (88% compared to 59% in baselines). We make the codebase publicly available at https://github.com/gb-kgp/MEDVOC.

* 13 pages, Accepted to the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024 (Main) Track

Via

Access Paper or Ask Questions

Instruction-Guided Bullet Point Summarization of Long Financial Earnings Call Transcripts

May 03, 2024

Subhendu Khatuya, Koushiki Sinha, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal

Figure 1 for Instruction-Guided Bullet Point Summarization of Long Financial Earnings Call Transcripts

Figure 2 for Instruction-Guided Bullet Point Summarization of Long Financial Earnings Call Transcripts

Figure 3 for Instruction-Guided Bullet Point Summarization of Long Financial Earnings Call Transcripts

Figure 4 for Instruction-Guided Bullet Point Summarization of Long Financial Earnings Call Transcripts

Abstract:While automatic summarization techniques have made significant advancements, their primary focus has been on summarizing short news articles or documents that have clear structural patterns like scientific articles or government reports. There has not been much exploration into developing efficient methods for summarizing financial documents, which often contain complex facts and figures. Here, we study the problem of bullet point summarization of long Earning Call Transcripts (ECTs) using the recently released ECTSum dataset. We leverage an unsupervised question-based extractive module followed by a parameter efficient instruction-tuned abstractive module to solve this task. Our proposed model FLAN-FinBPS achieves new state-of-the-art performances outperforming the strongest baseline with 14.88% average ROUGE score gain, and is capable of generating factually consistent bullet point summaries that capture the important facts discussed in the ECTs.

* Accepted in SIGIR 2024

Via

Access Paper or Ask Questions

TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

Apr 26, 2024

Hailay Teklehaymanot, Dren Fazlija, Niloy Ganguly, Gourab K. Patro, Wolfgang Nejdl

Abstract:The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

* LREC-COLING 2024
* 9 pages,3 figures, 7 tables,2 listings

Via

Access Paper or Ask Questions

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Apr 20, 2024

Soumyadeep Roy, Aparup Khatua, Fatemeh Ghoochani, Uwe Hadler, Wolfgang Nejdl, Niloy Ganguly

Figure 1 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Figure 2 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Figure 3 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Figure 4 for Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Abstract:GPT-4 demonstrates high accuracy in medical QA tasks, leading with an accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of errors remain. Additionally, current works use GPT-4 to only predict the correct option without providing any explanation and thus do not provide any insight into the thinking process and reasoning used by GPT-4 or other LLMs. Therefore, we introduce a new domain-specific error taxonomy derived from collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset comprises 4153 GPT-4 correct responses and 919 incorrect responses to the United States Medical Licensing Examination (USMLE) respectively. These responses are quite long (258 words on average), containing detailed explanations from GPT-4 justifying the selected option. We then launch a large-scale annotation study using the Potato annotation platform and recruit 44 medical experts through Prolific, a well-known crowdsourcing platform. We annotated 300 out of these 919 incorrect data points at a granular level for different classes and created a multi-label span to identify the reasons behind the error. In our annotated dataset, a substantial portion of GPT-4's incorrect responses is categorized as a "Reasonable response by GPT-4," by annotators. This sheds light on the challenge of discerning explanations that may lead to incorrect options, even among trained medical professionals. We also provide medical concepts and medical semantic predications extracted using the SemRep tool for every data point. We believe that it will aid in evaluating the ability of LLMs to answer complex medical questions. We make the resources available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy .

* 10 pages, 4 figures. Accepted for publication at the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)

Via

Access Paper or Ask Questions

Order-Based Pre-training Strategies for Procedural Text Understanding

Apr 06, 2024

Abhilash Nandy, Yash Kulkarni, Pawan Goyal, Niloy Ganguly

Figure 1 for Order-Based Pre-training Strategies for Procedural Text Understanding

Figure 2 for Order-Based Pre-training Strategies for Procedural Text Understanding

Figure 3 for Order-Based Pre-training Strategies for Procedural Text Understanding

Figure 4 for Order-Based Pre-training Strategies for Procedural Text Understanding

Abstract:In this paper, we propose sequence-based pretraining methods to enhance procedural understanding in natural language processing. Procedural text, containing sequential instructions to accomplish a task, is difficult to understand due to the changing attributes of entities in the context. We focus on recipes, which are commonly represented as ordered instructions, and use this order as a supervision signal. Our work is one of the first to compare several 'order as-supervision' transformer pre-training methods, including Permutation Classification, Embedding Regression, and Skip-Clip, and shows that these methods give improved results compared to the baselines and SoTA LLMs on two downstream Entity-Tracking datasets: NPN-Cooking dataset in recipe domain and ProPara dataset in open domain. Our proposed methods address the non-trivial Entity Tracking Task that requires prediction of entity states across procedure steps, which requires understanding the order of steps. These methods show an improvement over the best baseline by 1.6% and 7-9% on NPN-Cooking and ProPara Datasets respectively across metrics.

* 8 pages (Accepted for publication at NAACL 2024 (Main Conference))

Via

Access Paper or Ask Questions

How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Mar 30, 2024

Akash Ghosh, B Venkata Sahith, Niloy Ganguly, Pawan Goyal, Mayank Singh

Figure 1 for How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Figure 2 for How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Figure 3 for How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Figure 4 for How Robust are the Tabular QA Models for Scientific Tables? A Study using Customized Dataset

Abstract:Question-answering (QA) on hybrid scientific tabular and textual data deals with scientific information, and relies on complex numerical reasoning. In recent years, while tabular QA has seen rapid progress, understanding their robustness on scientific information is lacking due to absence of any benchmark dataset. To investigate the robustness of the existing state-of-the-art QA models on scientific hybrid tabular data, we propose a new dataset, "SciTabQA", consisting of 822 question-answer pairs from scientific tables and their descriptions. With the help of this dataset, we assess the state-of-the-art Tabular QA models based on their ability (i) to use heterogeneous information requiring both structured data (table) and unstructured data (text) and (ii) to perform complex scientific reasoning tasks. In essence, we check the capability of the models to interpret scientific tables and text. Our experiments show that "SciTabQA" is an innovative dataset to study question-answering over scientific heterogeneous data. We benchmark three state-of-the-art Tabular QA models, and find that the best F1 score is only 0.462.

Via

Access Paper or Ask Questions

Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs

Mar 08, 2024

Arijit Nag, Animesh Mukherjee, Niloy Ganguly, Soumen Chakrabarti

Figure 1 for Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs

Figure 2 for Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs

Figure 3 for Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs

Figure 4 for Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs

Abstract:Large Language Models (LLMs) exhibit impressive zero/few-shot inference and generation quality for high-resource languages(HRLs). A few of them have been trained in low-resource languages (LRLs) and give decent performance. Owing to the prohibitive costs of training LLMs, they are usually used as a network service, with the client charged by the count of input and output tokens. The number of tokens strongly depends on the script and language, as well as the LLM's sub-word vocabulary. We show that LRLs are at a pricing disadvantage, because the well-known LLMs produce more tokens for LRLs than HRLs. This is because most currently popular LLMs are optimized for HRL vocabularies. Our objective is to level the playing field: reduce the cost of processing LRLs in contemporary LLMs while ensuring that predictive and generative qualities are not compromised. As means to reduce the number of tokens processed by the LLM, we consider code-mixing, translation, and transliteration of LRLs to HRLs. We perform an extensive study using the IndicXTREME dataset, covering 15 Indian languages, while using GPT-4 (one of the costliest LLM services released so far) as a commercial LLM. We observe and analyze interesting patterns involving token count, cost,and quality across a multitude of languages and tasks. We show that choosing the best policy to interact with the LLM can reduce cost by 90% while giving better or comparable performance, compared to communicating with the LLM in the original LRL.

Via

Access Paper or Ask Questions