Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanjay Chawla

Qatar Computing Research Institute

Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Aug 11, 2025

Tianyi Zhou, Johanne Medina, Sanjay Chawla

Abstract:Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

Via

Access Paper or Ask Questions

Solving Normalized Cut Problem with Constrained Action Space

May 20, 2025

Qize Jiang, Linsey Pang, Alice Gatti, Mahima Aggarwa, Giovanna Vantin, Xiaosong Ma, Weiwei Sun, Sanjay Chawla

Figure 1 for Solving Normalized Cut Problem with Constrained Action Space

Figure 2 for Solving Normalized Cut Problem with Constrained Action Space

Figure 3 for Solving Normalized Cut Problem with Constrained Action Space

Figure 4 for Solving Normalized Cut Problem with Constrained Action Space

Abstract:Reinforcement Learning (RL) has emerged as an important paradigm to solve combinatorial optimization problems primarily due to its ability to learn heuristics that can generalize across problem instances. However, integrating external knowledge that will steer combinatorial optimization problem solutions towards domain appropriate outcomes remains an extremely challenging task. In this paper, we propose the first RL solution that uses constrained action spaces to guide the normalized cut problem towards pre-defined template instances. Using transportation networks as an example domain, we create a Wedge and Ring Transformer that results in graph partitions that are shaped in form of Wedges and Rings and which are likely to be closer to natural optimal partitions. However, our approach is general as it is based on principles that can be generalized to other domains.

Via

Access Paper or Ask Questions

aiXamine: Simplified LLM Safety and Security

Apr 23, 2025

Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil

Figure 1 for aiXamine: Simplified LLM Safety and Security

Figure 2 for aiXamine: Simplified LLM Safety and Security

Figure 3 for aiXamine: Simplified LLM Safety and Security

Figure 4 for aiXamine: Simplified LLM Safety and Security

Abstract:Evaluating Large Language Models (LLMs) for safety and security remains a complex task, often requiring users to navigate a fragmented landscape of ad hoc benchmarks, datasets, metrics, and reporting formats. To address this challenge, we present aiXamine, a comprehensive black-box evaluation platform for LLM safety and security. aiXamine integrates over 40 tests (i.e., benchmarks) organized into eight key services targeting specific dimensions of safety and security: adversarial robustness, code security, fairness and bias, hallucination, model and data privacy, out-of-distribution (OOD) robustness, over-refusal, and safety alignment. The platform aggregates the evaluation results into a single detailed report per model, providing a detailed breakdown of model performance, test examples, and rich visualizations. We used aiXamine to assess over 50 publicly available and proprietary LLMs, conducting over 2K examinations. Our findings reveal notable vulnerabilities in leading models, including susceptibility to adversarial attacks in OpenAI's GPT-4o, biased outputs in xAI's Grok-3, and privacy weaknesses in Google's Gemini 2.0. Additionally, we observe that open-source models can match or exceed proprietary models in specific services such as safety alignment, fairness and bias, and OOD robustness. Finally, we identify trade-offs between distillation strategies, model size, training methods, and architectural choices.

Via

Access Paper or Ask Questions

aiXamine: LLM Safety and Security Simplified

Apr 21, 2025

Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, Issa Khalil

Figure 1 for aiXamine: LLM Safety and Security Simplified

Figure 2 for aiXamine: LLM Safety and Security Simplified

Figure 3 for aiXamine: LLM Safety and Security Simplified

Figure 4 for aiXamine: LLM Safety and Security Simplified

Via

Access Paper or Ask Questions

DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data

Mar 27, 2025

Dorde Popovic, Amin Sadeghi, Ting Yu, Sanjay Chawla, Issa Khalil

Figure 1 for DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data

Figure 2 for DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data

Figure 3 for DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data

Figure 4 for DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data

Abstract:Backdoor attacks are among the most effective, practical, and stealthy attacks in deep learning. In this paper, we consider a practical scenario where a developer obtains a deep model from a third party and uses it as part of a safety-critical system. The developer wants to inspect the model for potential backdoors prior to system deployment. We find that most existing detection techniques make assumptions that are not applicable to this scenario. In this paper, we present a novel framework for detecting backdoors under realistic restrictions. We generate candidate triggers by deductively searching over the space of possible triggers. We construct and optimize a smoothed version of Attack Success Rate as our search objective. Starting from a broad class of template attacks and just using the forward pass of a deep model, we reverse engineer the backdoor attack. We conduct extensive evaluation on a wide range of attacks, models, and datasets, with our technique performing almost perfectly across these settings.

Via

Access Paper or Ask Questions

A Perspective on Symbolic Machine Learning in Physical Sciences

Feb 25, 2025

Nour Makke, Sanjay Chawla

Abstract:Machine learning is rapidly making its pathway across all of the natural sciences, including physical sciences. The rate at which ML is impacting non-scientific disciplines is incomparable to that in the physical sciences. This is partly due to the uninterpretable nature of deep neural networks. Symbolic machine learning stands as an equal and complementary partner to numerical machine learning in speeding up scientific discovery in physics. This perspective discusses the main differences between the ML and scientific approaches. It stresses the need to develop and apply symbolic machine learning to physics problems equally, in parallel to numerical machine learning, because of the dual nature of physics research.

* Machine Learning and the Physical Sciences Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Inferring Interpretable Models of Fragmentation Functions using Symbolic Regression

Jan 13, 2025

Nour Makke, Sanjay Chawla

Abstract:Machine learning is rapidly making its path into natural sciences, including high-energy physics. We present the first study that infers, directly from experimental data, a functional form of fragmentation functions. The latter represent a key ingredient to describe physical observables measured in high-energy physics processes that involve hadron production, and predict their values at different energy. Fragmentation functions can not be calculated in theory and have to be determined instead from data. Traditional approaches rely on global fits of experimental data using a pre-assumed functional form inspired from phenomenological models to learn its parameters. This novel approach uses a ML technique, namely symbolic regression, to learn an analytical model from measured charged hadron multiplicities. The function learned by symbolic regression resembles the Lund string function and describes the data well, thus representing a potential candidate for use in global FFs fits. This study represents an approach to follow in such QCD-related phenomenology studies and more generally in sciences.

Via

Access Paper or Ask Questions

Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training

May 27, 2024

Enes Altinisik, Safa Messaoud, Husrev Taha Sencar, Hassan Sajjad, Sanjay Chawla

Figure 1 for Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training

Figure 2 for Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training

Figure 3 for Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training

Figure 4 for Exploiting the Layered Intrinsic Dimensionality of Deep Models for Practical Adversarial Training

Abstract:Despite being a heavily researched topic, Adversarial Training (AT) is rarely, if ever, deployed in practical AI systems for two primary reasons: (i) the gained robustness is frequently accompanied by a drop in generalization and (ii) generating adversarial examples (AEs) is computationally prohibitively expensive. To address these limitations, we propose SMAAT, a new AT algorithm that leverages the manifold conjecture, stating that off-manifold AEs lead to better robustness while on-manifold AEs result in better generalization. Specifically, SMAAT aims at generating a higher proportion of off-manifold AEs by perturbing the intermediate deepnet layer with the lowest intrinsic dimension. This systematically results in better scalability compared to classical AT as it reduces the PGD chains length required for generating the AEs. Additionally, our study provides, to the best of our knowledge, the first explanation for the difference in the generalization and robustness trends between vision and language models, ie., AT results in a drop in generalization in vision models whereas, in encoder-based language models, generalization either improves or remains unchanged. We show that vision transformers and decoder-based models tend to have low intrinsic dimensionality in the earlier layers of the network (more off-manifold AEs), while encoder-based models have low intrinsic dimensionality in the later layers. We demonstrate the efficacy of SMAAT; on several tasks, including robustifying (i) sentiment classifiers, (ii) safety filters in decoder-based models, and (iii) retrievers in RAG setups. SMAAT requires only 25-33% of the GPU time compared to standard AT, while significantly improving robustness across all applications and maintaining comparable generalization.

Via

Access Paper or Ask Questions

S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

May 02, 2024

Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang, Bo An, Haipeng Chen, Sanjay Chawla

Abstract:Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity, and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, resulting in high computational complexity and variance (SQL), or follow a variational inference procedure that fits simplified actor distributions (e.g., Gaussian) for tractability (SAC). We propose Stein Soft Actor-Critic (S$^2$AC), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, S$^2$AC uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that S$^2$AC yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: https://github.com/SafaMessaoud/S2AC-Energy-Based-RL-with-Stein-Soft-Actor-Critic

* Accepted for publication at ICLR 2024

Via

Access Paper or Ask Questions

Out-of-Distribution Data: An Acquaintance of Adversarial Examples -- A Survey

Apr 08, 2024

Naveen Karunanayake, Ravin Gunawardena, Suranga Seneviratne, Sanjay Chawla

Figure 1 for Out-of-Distribution Data: An Acquaintance of Adversarial Examples -- A Survey

Figure 2 for Out-of-Distribution Data: An Acquaintance of Adversarial Examples -- A Survey

Figure 3 for Out-of-Distribution Data: An Acquaintance of Adversarial Examples -- A Survey

Figure 4 for Out-of-Distribution Data: An Acquaintance of Adversarial Examples -- A Survey

Abstract:Deep neural networks (DNNs) deployed in real-world applications can encounter out-of-distribution (OOD) data and adversarial examples. These represent distinct forms of distributional shifts that can significantly impact DNNs' reliability and robustness. Traditionally, research has addressed OOD detection and adversarial robustness as separate challenges. This survey focuses on the intersection of these two areas, examining how the research community has investigated them together. Consequently, we identify two key research directions: robust OOD detection and unified robustness. Robust OOD detection aims to differentiate between in-distribution (ID) data and OOD data, even when they are adversarially manipulated to deceive the OOD detector. Unified robustness seeks a single approach to make DNNs robust against both adversarial attacks and OOD inputs. Accordingly, first, we establish a taxonomy based on the concept of distributional shifts. This framework clarifies how robust OOD detection and unified robustness relate to other research areas addressing distributional shifts, such as OOD detection, open set recognition, and anomaly detection. Subsequently, we review existing work on robust OOD detection and unified robustness. Finally, we highlight the limitations of the existing work and propose promising research directions that explore adversarial and OOD inputs within a unified framework.

Via

Access Paper or Ask Questions