Abstract:Reducing hallucinations in Large Language Models (LLMs) is essential for improving the accuracy of data extraction from large text corpora. Current methods, like prompt engineering and chain-of-thought prompting, focus on individual documents but fail to consider relationships across a corpus. This paper introduces Peer Context Outlier Detection (P-COD), a novel approach that uses the relationships between documents to improve extraction accuracy. Our application domain is in scientific literature summarization, where papers with similar experiment settings should draw similar conclusions. By comparing extracted data to validated peer information within the corpus, we adjust confidence scores and flag low-confidence results for expert review. High-confidence results, supported by peer validation, are considered reliable. Our experiments demonstrate up to 98% precision in outlier detection across 6 domains of science, demonstrating that our design reduces hallucinations, enhances trust in automated systems, and allows researchers to focus on ambiguous cases, streamlining the data extraction workflows.
Abstract:Scientific discovery is slowed by fragmented literature that requires excessive human effort to gather, analyze, and understand. AI tools, including autonomous summarization and question answering, have been developed to aid in understanding scientific literature. However, these tools lack the structured, multi-step approach necessary for extracting deep insights from scientific literature. Large Language Models (LLMs) offer new possibilities for literature analysis, but remain unreliable due to hallucinations and incomplete extraction. We introduce Elhuyar, a multi-agent, human-in-the-loop system that integrates LLMs, structured AI, and human scientists to extract, analyze, and iteratively refine insights from scientific literature. The framework distributes tasks among specialized agents for filtering papers, extracting data, fitting models, and summarizing findings, with human oversight ensuring reliability. The system generates structured reports with extracted data, visualizations, model equations, and text summaries, enabling deeper inquiry through iterative refinement. Deployed in materials science, it analyzed literature on tungsten under helium-ion irradiation, showing experimentally correlated exponential helium bubble growth with irradiation dose and temperature, offering insight for plasma-facing materials (PFMs) in fusion reactors. This demonstrates how AI-assisted literature review can uncover scientific patterns and accelerate discovery.
Abstract:In this work, we introduce Contextual Analog Logic with Multimodality (CALM). CALM unites symbolic reasoning with neural generation, enabling systems to make context-sensitive decisions grounded in real-world multi-modal data. Background: Classic bivalent logic systems cannot capture the nuance of human decision-making. They also require human grounding in multi-modal environments, which can be ad-hoc, rigid, and brittle. Neural networks are good at extracting rich contextual information from multi-modal data, but lack interpretable structures for reasoning. Objectives: CALM aims to bridge the gap between logic and neural perception, creating an analog logic that can reason over multi-modal inputs. Without this integration, AI systems remain either brittle or unstructured, unable to generalize robustly to real-world tasks. In CALM, symbolic predicates evaluate to analog truth values computed by neural networks and constrained search. Methods: CALM represents each predicate using a domain tree, which iteratively refines its analog truth value when the contextual groundings of its entities are determined. The iterative refinement is predicted by neural networks capable of capturing multi-modal information and is filtered through a symbolic reasoning module to ensure constraint satisfaction. Results: In fill-in-the-blank object placement tasks, CALM achieved 92.2% accuracy, outperforming classical logic (86.3%) and LLM (59.4%) baselines. It also demonstrated spatial heatmap generation aligned with logical constraints and delicate human preferences, as shown by a human study. Conclusions: CALM demonstrates the potential to reason with logic structure while aligning with preferences in multi-modal environments. It lays the foundation for next-gen AI systems that require the precision and interpretation of logic and the multimodal information processing of neural networks.
Abstract:Approximately 1.25 million people in the United States are treated each year for burn injuries. Precise burn injury classification is an important aspect of the medical AI field. In this work, we propose an explainable human-in-the-loop framework for improving burn ultrasound classification models. Our framework leverages an explanation system based on the LIME classification explainer to corroborate and integrate a burn expert's knowledge -- suggesting new features and ensuring the validity of the model. Using this framework, we discover that B-mode ultrasound classifiers can be enhanced by supplying textural features. More specifically, we confirm that texture features based on the Gray Level Co-occurance Matrix (GLCM) of ultrasound frames can increase the accuracy of transfer learned burn depth classifiers. We test our hypothesis on real data from porcine subjects. We show improvements in the accuracy of burn depth classification -- from ~88% to ~94% -- once modified according to our framework.