Alert button
Picture for Jeff Schneider

Jeff Schneider

Alert button

Reasoning with Latent Diffusion in Offline Reinforcement Learning

Sep 12, 2023
Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, Glen Berseth

Figure 1 for Reasoning with Latent Diffusion in Offline Reinforcement Learning
Figure 2 for Reasoning with Latent Diffusion in Offline Reinforcement Learning
Figure 3 for Reasoning with Latent Diffusion in Offline Reinforcement Learning
Figure 4 for Reasoning with Latent Diffusion in Offline Reinforcement Learning

Offline reinforcement learning (RL) holds promise as a means to learn high-reward policies from a static dataset, without the need for further environment interactions. However, a key challenge in offline RL lies in effectively stitching portions of suboptimal trajectories from the static dataset while avoiding extrapolation errors arising due to a lack of support in the dataset. Existing approaches use conservative methods that are tricky to tune and struggle with multi-modal data (as we show) or rely on noisy Monte Carlo return-to-go samples for reward conditioning. In this work, we propose a novel approach that leverages the expressiveness of latent diffusion to model in-support trajectory sequences as compressed latent skills. This facilitates learning a Q-function while avoiding extrapolation error via batch-constraining. The latent space is also expressive and gracefully copes with multi-modal data. We show that the learned temporally-abstract latent space encodes richer task-specific information for offline RL tasks as compared to raw state-actions. This improves credit assignment and facilitates faster reward propagation during Q-learning. Our method demonstrates state-of-the-art performance on the D4RL benchmarks, particularly excelling in long-horizon, sparse-reward tasks.

Viaarxiv icon

Kernelized Offline Contextual Dueling Bandits

Jul 21, 2023
Viraj Mehta, Ojash Neopane, Vikramjeet Das, Sen Lin, Jeff Schneider, Willie Neiswanger

Figure 1 for Kernelized Offline Contextual Dueling Bandits
Figure 2 for Kernelized Offline Contextual Dueling Bandits
Figure 3 for Kernelized Offline Contextual Dueling Bandits

Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.

Viaarxiv icon

Data Cross-Segmentation for Improved Generalization in Reinforcement Learning Based Algorithmic Trading

Jul 18, 2023
Vikram Duvvur, Aashay Mehta, Edward Sun, Bo Wu, Ken Yew Chan, Jeff Schneider

The use of machine learning in algorithmic trading systems is increasingly common. In a typical set-up, supervised learning is used to predict the future prices of assets, and those predictions drive a simple trading and execution strategy. This is quite effective when the predictions have sufficient signal, markets are liquid, and transaction costs are low. However, those conditions often do not hold in thinly traded financial markets and markets for differentiated assets such as real estate or vehicles. In these markets, the trading strategy must consider the long-term effects of taking positions that are relatively more difficult to change. In this work, we propose a Reinforcement Learning (RL) algorithm that trades based on signals from a learned predictive model and addresses these challenges. We test our algorithm on 20+ years of equity data from Bursa Malaysia.

Viaarxiv icon

PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks

Jul 12, 2023
Ian Char, Jeff Schneider

Figure 1 for PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks
Figure 2 for PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks
Figure 3 for PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks
Figure 4 for PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks

Deep reinforcement learning (RL) has shown immense potential for learning to control systems through data alone. However, one challenge deep RL faces is that the full state of the system is often not observable. When this is the case, the policy needs to leverage the history of observations to infer the current state. At the same time, differences between the training and testing environments makes it critical for the policy not to overfit to the sequence of observations it sees at training time. As such, there is an important balancing act between having the history encoder be flexible enough to extract relevant information, yet be robust to changes in the environment. To strike this balance, we look to the PID controller for inspiration. We assert the PID controller's success shows that only summing and differencing are needed to accumulate information over time for many control tasks. Following this principle, we propose two architectures for encoding history: one that directly uses PID features and another that extends these core ideas and can be used in arbitrary control tasks. When compared with prior approaches, our encoders produce policies that are often more robust and achieve better performance on a variety of tracking tasks. Going beyond tracking tasks, our policies achieve 1.7x better performance on average over previous state-of-the-art methods on a suite of high dimensional control tasks.

Viaarxiv icon

Enhancing Visual Domain Adaptation with Source Preparation

Jun 16, 2023
Anirudha Ramesh, Anurag Ghosh, Christoph Mertz, Jeff Schneider

Figure 1 for Enhancing Visual Domain Adaptation with Source Preparation
Figure 2 for Enhancing Visual Domain Adaptation with Source Preparation
Figure 3 for Enhancing Visual Domain Adaptation with Source Preparation
Figure 4 for Enhancing Visual Domain Adaptation with Source Preparation

Robotic Perception in diverse domains such as low-light scenarios, where new modalities like thermal imaging and specialized night-vision sensors are increasingly employed, remains a challenge. Largely, this is due to the limited availability of labeled data. Existing Domain Adaptation (DA) techniques, while promising to leverage labels from existing well-lit RGB images, fail to consider the characteristics of the source domain itself. We holistically account for this factor by proposing Source Preparation (SP), a method to mitigate source domain biases. Our Almost Unsupervised Domain Adaptation (AUDA) framework, a label-efficient semi-supervised approach for robotic scenarios -- employs Source Preparation (SP), Unsupervised Domain Adaptation (UDA) and Supervised Alignment (SA) from limited labeled data. We introduce CityIntensified, a novel dataset comprising temporally aligned image pairs captured from a high-sensitivity camera and an intensifier camera for semantic segmentation and object detection in low-light settings. We demonstrate the effectiveness of our method in semantic segmentation, with experiments showing that SP enhances UDA across a range of visual domains, with improvements up to 40.64% in mIoU over baseline, while making target models more robust to real-world shifts within the target domain. We show that AUDA is a label-efficient framework for effective DA, significantly improving target domain performance with only tens of labeled samples from the target domain.

Viaarxiv icon

GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent Active Search

Apr 04, 2023
Nikhil Angad Bakshi, Tejus Gupta, Ramina Ghods, Jeff Schneider

Figure 1 for GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent Active Search
Figure 2 for GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent Active Search
Figure 3 for GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent Active Search
Figure 4 for GUTS: Generalized Uncertainty-Aware Thompson Sampling for Multi-Agent Active Search

Robotic solutions for quick disaster response are essential to ensure minimal loss of life, especially when the search area is too dangerous or too vast for human rescuers. We model this problem as an asynchronous multi-agent active-search task where each robot aims to efficiently seek objects of interest (OOIs) in an unknown environment. This formulation addresses the requirement that search missions should focus on quick recovery of OOIs rather than full coverage of the search region. Previous approaches fail to accurately model sensing uncertainty, account for occlusions due to foliage or terrain, or consider the requirement for heterogeneous search teams and robustness to hardware and communication failures. We present the Generalized Uncertainty-aware Thompson Sampling (GUTS) algorithm, which addresses these issues and is suitable for deployment on heterogeneous multi-robot systems for active search in large unstructured environments. We show through simulation experiments that GUTS consistently outperforms existing methods such as parallelized Thompson Sampling and exhaustive search, recovering all OOIs in 80% of all runs. In contrast, existing approaches recover all OOIs in less than 40% of all runs. We conduct field tests using our multi-robot system in an unstructured environment with a search area of approximately 75,000 sq. m. Our system demonstrates robustness to various failure modes, achieving full recovery of OOIs (where feasible) in every field run, and significantly outperforming our baseline.

* 7 pages, 5 figures, 1 table, for associated video see: https://youtu.be/K0jkzdQ_j2E , to appear in International Conference on Robotics and Automation (ICRA) 2023 
Viaarxiv icon

Near-optimal Policy Identification in Active Reinforcement Learning

Dec 19, 2022
Xiang Li, Viraj Mehta, Johannes Kirschner, Ian Char, Willie Neiswanger, Jeff Schneider, Andreas Krause, Ilija Bogunovic

Figure 1 for Near-optimal Policy Identification in Active Reinforcement Learning
Figure 2 for Near-optimal Policy Identification in Active Reinforcement Learning
Figure 3 for Near-optimal Policy Identification in Active Reinforcement Learning
Figure 4 for Near-optimal Policy Identification in Active Reinforcement Learning

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.

Viaarxiv icon

Exploration via Planning for Information about the Optimal Trajectory

Oct 06, 2022
Viraj Mehta, Ian Char, Joseph Abbate, Rory Conlin, Mark D. Boyer, Stefano Ermon, Jeff Schneider, Willie Neiswanger

Figure 1 for Exploration via Planning for Information about the Optimal Trajectory
Figure 2 for Exploration via Planning for Information about the Optimal Trajectory
Figure 3 for Exploration via Planning for Information about the Optimal Trajectory
Figure 4 for Exploration via Planning for Information about the Optimal Trajectory

Many potential applications of reinforcement learning (RL) are stymied by the large numbers of samples required to learn an effective policy. This is especially true when applying RL to real-world control tasks, e.g. in the sciences or robotics, where executing a policy in the environment is costly. In popular RL algorithms, agents typically explore either by adding stochasticity to a reward-maximizing policy or by attempting to gather maximal information about environment dynamics without taking the given task into account. In this work, we develop a method that allows us to plan for exploration while taking both the task and the current knowledge about the dynamics into account. The key insight to our approach is to plan an action sequence that maximizes the expected information gain about the optimal trajectory for the task at hand. We demonstrate that our method learns strong policies with 2x fewer samples than strong exploration baselines and 200x fewer samples than model free methods on a diverse set of low-to-medium dimensional control tasks in both the open-loop and closed-loop control settings.

* Conference paper at Neurips 2022. Code available at https://github.com/fusion-ml/trajectory-information-rl. arXiv admin note: text overlap with arXiv:2112.05244 
Viaarxiv icon

Cost Aware Asynchronous Multi-Agent Active Search

Oct 05, 2022
Arundhati Banerjee, Ramina Ghods, Jeff Schneider

Figure 1 for Cost Aware Asynchronous Multi-Agent Active Search
Figure 2 for Cost Aware Asynchronous Multi-Agent Active Search
Figure 3 for Cost Aware Asynchronous Multi-Agent Active Search
Figure 4 for Cost Aware Asynchronous Multi-Agent Active Search

Multi-agent active search requires autonomous agents to choose sensing actions that efficiently locate targets. In a realistic setting, agents also must consider the costs that their decisions incur. Previously proposed active search algorithms simplify the problem by ignoring uncertainty in the agent's environment, using myopic decision making, and/or overlooking costs. In this paper, we introduce an online active search algorithm to detect targets in an unknown environment by making adaptive cost-aware decisions regarding the agent's actions. Our algorithm combines principles from Thompson Sampling (for search space exploration and decentralized multi-agent decision making), Monte Carlo Tree Search (for long horizon planning) and pareto-optimal confidence bounds (for multi-objective optimization in an unknown environment) to propose an online lookahead planner that removes all the simplifications. We analyze the algorithm's performance in simulation to show its efficacy in cost aware active search.

Viaarxiv icon

Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

Jul 21, 2022
Adam Villaflor, Zhe Huang, Swapnil Pande, John Dolan, Jeff Schneider

Figure 1 for Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning
Figure 2 for Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning
Figure 3 for Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning
Figure 4 for Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We demonstrate our method's superior performance on a variety of autonomous driving tasks in simulation.

Viaarxiv icon