Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anthony Rios

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

Jun 02, 2026

Erfan Nourbakhsh, Rocky Slavin, Ke Yang, Anthony Rios

Abstract:Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1-2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model's limited ability to use retrieved evidence effectively.

* 9 Pages, accepted to BioNLP Workshop at ACL 2026

Via

Access Paper or Ask Questions

Prompting Underestimates LLM Capability for Time Series Classification

Jan 06, 2026

Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios

Abstract:Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.

* 8 pages + Appendix and References, 9 figures

Via

Access Paper or Ask Questions

Can We Reliably Rank Model Performance across Domains without Labeled Data?

Oct 10, 2025

Veronica Rammouz, Aaron Gonzalez, Carlos Cruzportillo, Adrian Tan, Nicole Beebe, Anthony Rios

Abstract:Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model's predictions align with the base model's true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.

* 8 pages + references and Appendix

Via

Access Paper or Ask Questions

Reflective Agreement: Combining Self-Mixture of Agents with a Sequence Tagger for Robust Event Extraction

Aug 26, 2025

Fatemeh Haji, Mazal Bethany, Cho-Yu Jason Chiang, Anthony Rios, Peyman Najafirad

Abstract:Event Extraction (EE) involves automatically identifying and extracting structured information about events from unstructured text, including triggers, event types, and arguments. Traditional discriminative models demonstrate high precision but often exhibit limited recall, particularly for nuanced or infrequent events. Conversely, generative approaches leveraging Large Language Models (LLMs) provide higher semantic flexibility and recall but suffer from hallucinations and inconsistent predictions. To address these challenges, we propose Agreement-based Reflective Inference System (ARIS), a hybrid approach combining a Self Mixture of Agents with a discriminative sequence tagger. ARIS explicitly leverages structured model consensus, confidence-based filtering, and an LLM reflective inference module to reliably resolve ambiguities and enhance overall event prediction quality. We further investigate decomposed instruction fine-tuning for enhanced LLM event extraction understanding. Experiments demonstrate our approach outperforms existing state-of-the-art event extraction methods across three benchmark datasets.

Via

Access Paper or Ask Questions

Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Jun 06, 2025

James Ford, Anthony Rios

Figure 1 for Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Figure 2 for Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Figure 3 for Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Figure 4 for Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Abstract:Large language models can translate natural-language chart descriptions into runnable code, yet approximately 15\% of the generated scripts still fail to execute, even after supervised fine-tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single-prompt design. To explore this, we propose a lightweight multi-agent pipeline that separates drafting, execution, repair, and judgment, using only an off-the-shelf GPT-4o-mini model. On the \textsc{Text2Chart31} benchmark, our system reduces execution errors to 4.5\% within three repair iterations, outperforming the strongest fine-tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the \textsc{ChartX} benchmark, with an error rate of 4.6\%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM-based accessibility audit shows that only 33.3\% (\textsc{Text2Chart31}) and 7.2\% (\textsc{ChartX}) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.

* 8 pages

Via

Access Paper or Ask Questions

UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting

Jun 05, 2025

Sara Shields-Menard, Zach Reimers, Joshua Gardner, David Perry, Anthony Rios

Abstract:We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician's question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.

* Accepted to BioNLP 2025

Via

Access Paper or Ask Questions

Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

Sep 27, 2024

James Ford, Xingmeng Zhao, Dan Schumacher, Anthony Rios

Figure 1 for Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

Figure 2 for Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

Figure 3 for Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

Figure 4 for Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

Abstract:We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI's GPT-3.5 Turbo and Meta's Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.

Via

Access Paper or Ask Questions

Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Sep 17, 2024

Fatemeh Haji, Mazal Bethany, Maryam Tabar, Jason Chiang, Anthony Rios, Peyman Najafirad

Figure 1 for Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Figure 2 for Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Figure 3 for Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Figure 4 for Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

Abstract:Multi-agent strategies have emerged as a promising approach to enhance the reasoning abilities of Large Language Models (LLMs) by assigning specialized roles in the problem-solving process. Concurrently, Tree of Thoughts (ToT) methods have shown potential in improving reasoning for complex question-answering tasks by exploring diverse reasoning paths. A critical limitation in multi-agent reasoning is the 'Reasoner' agent's shallow exploration of reasoning paths. While ToT strategies could help mitigate this problem, they may generate flawed reasoning branches, which could harm the trustworthiness of the final answer. To leverage the strengths of both multi-agent reasoning and ToT strategies, we introduce a novel approach combining ToT-based Reasoner agents with a Thought Validator agent. Multiple Reasoner agents operate in parallel, employing ToT to explore diverse reasoning paths. The Thought Validator then scrutinizes these paths, considering a Reasoner's conclusion only if its reasoning is valid. This method enables a more robust voting strategy by discarding faulty reasoning paths, enhancing the system's ability to tackle tasks requiring systematic and trustworthy reasoning. Our method demonstrates superior performance compared to existing techniques when evaluated on the GSM8K dataset, outperforming the standard ToT strategy by an average 5.6\% across four LLMs.

Via

Access Paper or Ask Questions

Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs

Aug 30, 2024

Mazal Bethany, Emet Bethany, Brandon Wherry, Cho-Yu Chiang, Nishant Vishwamitra, Anthony Rios, Peyman Najafirad

Figure 1 for Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs

Figure 2 for Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs

Figure 3 for Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs

Figure 4 for Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs

Abstract:Event detection and text reasoning have become critical applications across various domains. While LLMs have recently demonstrated impressive progress in reasoning abilities, they often struggle with event detection, particularly due to the absence of training methods that consider causal relationships between event triggers and types. To address this challenge, we propose a novel approach for instruction fine-tuning LLMs for event detection. Our method introduces Semantic Causal Graphs (SCGs) to capture both causal relationships and contextual information within text. Building off of SCGs, we propose SCG Instructions for fine-tuning LLMs by focusing on event triggers and their relationships to event types, and employ Low-Rank Adaptation (LoRA) to help preserve the general reasoning abilities of LLMs. Our evaluations demonstrate that training LLMs with SCG Instructions outperforms standard instruction fine-tuning by an average of 35.69\% on Event Trigger Classification. Notably, our fine-tuned Mistral 7B model also outperforms GPT-4 on key event detection metrics by an average of 31.01\% on Event Trigger Identification, 37.40\% on Event Trigger Classification, and 16.43\% on Event Classification. We analyze the retention of general capabilities, observing only a minimal average drop of 2.03 points across six benchmarks. This comprehensive study investigates multiple LLMs for the event detection task across various datasets, prompting strategies, and training approaches.

Via

Access Paper or Ask Questions

Context Matters: An Empirical Study of the Impact of Contextual Information in Temporal Question Answering Systems

Jun 27, 2024

Dan Schumacher, Fatemeh Haji, Tara Grey, Niharika Bandlamudi, Nupoor Karnik, Gagana Uday Kumar, Jason Cho-Yu Chiang, Paul Rad, Nishant Vishwamitra, Anthony Rios

Abstract:Large language models (LLMs) often struggle with temporal reasoning, crucial for tasks like historical event analysis and time-sensitive information retrieval. Despite advancements, state-of-the-art models falter in handling temporal information, especially when faced with irrelevant or noisy contexts. This paper addresses this gap by empirically examining the robustness of temporal question-answering (TQA) systems trained on various context types, including relevant, irrelevant, slightly altered, and no context. Our findings indicate that training with a mix of these contexts enhances model robustness and accuracy. Additionally, we show that the position of context relative to the question significantly impacts performance, with question-first positioning yielding better results. We introduce two new context-rich TQA datasets, ContextAQA and ContextTQE, and provide comprehensive evaluations and guidelines for training robust TQA models. Our work lays the foundation for developing reliable and context-aware temporal QA systems, with broader implications for enhancing LLM robustness against diverse and potentially adversarial information.

Via

Access Paper or Ask Questions