Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cody Fleming

LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

Feb 19, 2026

Hsin-Jung Yang, Zhanhong Jiang, Prajwal Koirala, Qisai Liu, Cody Fleming, Soumik Sarkar

Abstract:Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.

* 17th ACM/IEEE International Conference on Cyber-Physical Systems

Via

Access Paper or Ask Questions

LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation

Feb 07, 2026

Nitesh Subedi, Adam Haroon, Samuel Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar

Abstract:We propose LCLA (Language-Conditioned Latent Alignment), a framework for vision-language navigation that learns modular perception-action interfaces by aligning sensory observations to a latent representation of an expert policy. The expert is first trained with privileged state information, inducing a latent space sufficient for control, after which its latent interface and action head are frozen. A lightweight adapter is then trained to map raw visual-language observations, via a frozen vision-language model, into the expert's latent space, reducing the problem of visuomotor learning to supervised latent alignment rather than end-to-end policy optimization. This decoupling enforces a stable contract between perception and control, enabling expert behavior to be reused across sensing modalities and environmental variations. We instantiate LCLA and evaluate it on a vision-language indoor navigation task, where aligned latent spaces yield strong in-distribution performance and robust zero-shot generalization to unseen environments, lighting conditions, and viewpoints while remaining lightweight at inference time.

Via

Access Paper or Ask Questions

Toward Faithful and Complete Answer Construction from a Single Document

Feb 05, 2026

Zhaoyang Chen, Cody Fleming

Abstract:Modern large language models (LLMs) are powerful generators driven by statistical next-token prediction. While effective at producing fluent text, this design biases models toward high-probability continuations rather than exhaustive and faithful answers grounded in source content. As a result, directly applying LLMs lacks systematic mechanisms to ensure both completeness (avoiding omissions) and faithfulness (avoiding unsupported content), which fundamentally conflicts with core AI safety principles. To address this limitation, we present EVE, a structured framework for document-grounded reasoning. Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration. Empirically, this design enables consistent and simultaneous improvements in recall, precision, and F1-score: recall and precision increase by up to 24\% and 29\%, respectively, with a corresponding 31\% gain in F1-score. This effectively breaks the long-standing trade-off between coverage and accuracy typical of single-pass LLM generation, while also mitigating generation truncation caused by length limitations. At the same time, we emphasize that EVE exhibits performance saturation due to the inherent ambiguity of natural language, reflecting fundamental limits of language-based reasoning.

Via

Access Paper or Ask Questions

Neural CDEs as Correctors for Learned Time Series Models

Dec 19, 2025

Muhammad Bilal Shahid, Prajwal Koirla, Cody Fleming

Abstract:Learned time-series models, whether continuous- or discrete-time, are widely used to forecast the states of a dynamical system. Such models generate multi-step forecasts either directly, by predicting the full horizon at once, or iteratively, by feeding back their own predictions at each step. In both cases, the multi-step forecasts are prone to errors. To address this, we propose a Predictor-Corrector mechanism where the Predictor is any learned time-series model and the Corrector is a neural controlled differential equation. The Predictor forecasts, and the Corrector predicts the errors of the forecasts. Adding these errors to the forecasts improves forecast performance. The proposed Corrector works with irregularly sampled time series and continuous- and discrete-time Predictors. Additionally, we introduce two regularization strategies to improve the extrapolation performance of the Corrector with accelerated training. We evaluate our Corrector with diverse Predictors, e.g., neural ordinary differential equations, Contiformer, and DLinear, on synthetic, physics simulation, and real-world forecasting datasets. The experiments demonstrate that the Predictor-Corrector mechanism consistently improves the performance compared to Predictor alone.

Via

Access Paper or Ask Questions

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Jun 26, 2025

Prajwal Koirala, Cody Fleming

Abstract:Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

Via

Access Paper or Ask Questions

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Jun 17, 2025

Nitesh Subedi, Adam Haroon, Shreyan Ganguly, Samuel T. K. Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar

Figure 1 for Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Figure 2 for Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Figure 3 for Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Figure 4 for Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Abstract:Foundation models have revolutionized robotics by providing rich semantic representations without task-specific training. While many approaches integrate pretrained vision-language models (VLMs) with specialized navigation architectures, the fundamental question remains: can these pretrained embeddings alone successfully guide navigation without additional fine-tuning or specialized modules? We present a minimalist framework that decouples this question by training a behavior cloning policy directly on frozen vision-language embeddings from demonstrations collected by a privileged expert. Our approach achieves a 74% success rate in navigation to language-specified targets, compared to 100% for the state-aware expert, though requiring 3.2 times more steps on average. This performance gap reveals that pretrained embeddings effectively support basic language grounding but struggle with long-horizon planning and spatial reasoning. By providing this empirical baseline, we highlight both the capabilities and limitations of using foundation models as drop-in representations for embodied tasks, offering critical insights for robotics researchers facing practical design tradeoffs between system complexity and performance in resource-constrained scenarios. Our code is available at https://github.com/oadamharoon/text2nav

* 6 figures, 2 tables, Accepted to Robotics: Science and Systems (RSS) 2025 Workshop on Robot Planning in the Era of Foundation Models (FM4RoboPlan)

Via

Access Paper or Ask Questions

HopCast: Calibration of Autoregressive Dynamics Models

Jan 27, 2025

Muhammad Bilal Shahid, Cody Fleming

Figure 1 for HopCast: Calibration of Autoregressive Dynamics Models

Figure 2 for HopCast: Calibration of Autoregressive Dynamics Models

Figure 3 for HopCast: Calibration of Autoregressive Dynamics Models

Figure 4 for HopCast: Calibration of Autoregressive Dynamics Models

Abstract:Deep learning models are often trained to approximate dynamical systems that can be modeled using differential equations. These models are optimized to predict one step ahead and produce calibrated predictions if the predictive model can quantify uncertainty, such as deep ensembles. At inference time, multi-step predictions are generated via autoregression, which needs a sound uncertainty propagation method (e.g., Trajectory Sampling) to produce calibrated multi-step predictions. This paper introduces an approach named HopCast that uses the Modern Hopfield Network (MHN) to learn the residuals of a deterministic model that approximates the dynamical system. The MHN predicts the density of residuals based on a context vector at any timestep during autoregression. This approach produces calibrated multi-step predictions without uncertainty propagation and turns a deterministic model into a calibrated probabilistic model. This work is also the first to benchmark existing uncertainty propagation methods based on calibration errors with deep ensembles for multi-step predictions.

Via

Access Paper or Ask Questions

FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Dec 12, 2024

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming

Figure 1 for FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Figure 2 for FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Figure 3 for FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Figure 4 for FAWAC: Feasibility Informed Advantage Weighted Regression for Persistent Safety in Offline Reinforcement Learning

Abstract:Safe offline reinforcement learning aims to learn policies that maximize cumulative rewards while adhering to safety constraints, using only offline data for training. A key challenge is balancing safety and performance, particularly when the policy encounters out-of-distribution (OOD) states and actions, which can lead to safety violations or overly conservative behavior during deployment. To address these challenges, we introduce Feasibility Informed Advantage Weighted Actor-Critic (FAWAC), a method that prioritizes persistent safety in constrained Markov decision processes (CMDPs). FAWAC formulates policy optimization with feasibility conditions derived specifically for offline datasets, enabling safe policy updates in non-parametric policy space, followed by projection into parametric space for constrained actor training. By incorporating a cost-advantage term into Advantage Weighted Regression (AWR), FAWAC ensures that the safety constraints are respected while maximizing performance. Additionally, we propose a strategy to address a more challenging class of problems that involves tempting datasets where trajectories are predominantly high-rewarded but unsafe. Empirical evaluations on standard benchmarks demonstrate that FAWAC achieves strong results, effectively balancing safety and performance in learning policies from the static datasets.

Via

Access Paper or Ask Questions

Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Dec 11, 2024

Prajwal Koirala, Zhanhong Jiang, Soumik Sarkar, Cody Fleming

Figure 1 for Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Figure 2 for Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Figure 3 for Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Figure 4 for Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

Abstract:In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical analysis, including bounds on policy performance and sample complexity. Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.

Via

Access Paper or Ask Questions

F1tenth Autonomous Racing With Offline Reinforcement Learning Methods

Aug 08, 2024

Prajwal Koirala, Cody Fleming

Figure 1 for F1tenth Autonomous Racing With Offline Reinforcement Learning Methods

Figure 2 for F1tenth Autonomous Racing With Offline Reinforcement Learning Methods

Figure 3 for F1tenth Autonomous Racing With Offline Reinforcement Learning Methods

Figure 4 for F1tenth Autonomous Racing With Offline Reinforcement Learning Methods

Abstract:Autonomous racing serves as a critical platform for evaluating automated driving systems and enhancing vehicle mobility intelligence. This work investigates offline reinforcement learning methods to train agents within the dynamic F1tenth racing environment. The study begins by exploring the challenges of online training in the Austria race track environment, where agents consistently fail to complete the laps. Consequently, this research pivots towards an offline strategy, leveraging `expert' demonstration dataset to facilitate agent training. A waypoint-based suboptimal controller is developed to gather data with successful lap episodes. This data is then employed to train offline learning-based algorithms, with a subsequent analysis of the agents' cross-track performance, evaluating their zero-shot transferability from seen to unseen scenarios and their capacity to adapt to changes in environment dynamics. Beyond mere algorithm benchmarking in autonomous racing scenarios, this study also introduces and describes the machinery of our return-conditioned decision tree-based policy, comparing its performance with methods that employ fully connected neural networks, Transformers, and Diffusion Policies and highlighting some insights into method selection for training autonomous agents in driving interactions.

Via

Access Paper or Ask Questions