Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peter Stone

UT Austin, Sony AI

A Champion-level Vision-based Reinforcement Learning Agent for Competitive Racing in Gran Turismo 7

Apr 12, 2025

Hojoon Lee, Takuma Seno, Jun Jet Tai, Kaushik Subramanian, Kenta Kawamoto, Peter Stone, Peter R. Wurman

Abstract:Deep reinforcement learning has achieved superhuman racing performance in high-fidelity simulators like Gran Turismo 7 (GT7). It typically utilizes global features that require instrumentation external to a car, such as precise localization of agents and opponents, limiting real-world applicability. To address this limitation, we introduce a vision-based autonomous racing agent that relies solely on ego-centric camera views and onboard sensor data, eliminating the need for precise localization during inference. This agent employs an asymmetric actor-critic framework: the actor uses a recurrent neural network with the sensor data local to the car to retain track layouts and opponent positions, while the critic accesses the global features during training. Evaluated in GT7, our agent consistently outperforms GT7's built-drivers. To our knowledge, this work presents the first vision-based autonomous racing agent to demonstrate champion-level performance in competitive racing scenarios.

* Accepted for Publication at the IEEE Robotics and Automation Letters (RA-L) 2025

Via

Access Paper or Ask Questions

Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

Mar 26, 2025

Alexander Levine, Peter Stone, Amy Zhang

Figure 1 for Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

Figure 2 for Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

Figure 3 for Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

Figure 4 for Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

Abstract:While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these "noise" features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT's performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.

Via

Access Paper or Ask Questions

Hyperspherical Normalization for Scalable Deep Reinforcement Learning

Feb 21, 2025

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo

Abstract:Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at https://dojeon-ai.github.io/SimbaV2.

* 50 pages. Preprint

Via

Access Paper or Ask Questions

The Essentials of AI for Life and Society: An AI Literacy Course for the University Community

Jan 13, 2025

Joydeep Biswas, Don Fussell, Peter Stone, Kristin Patterson, Kristen Procko, Lea Sabatini, Zifan Xu

Figure 1 for The Essentials of AI for Life and Society: An AI Literacy Course for the University Community

Figure 2 for The Essentials of AI for Life and Society: An AI Literacy Course for the University Community

Figure 3 for The Essentials of AI for Life and Society: An AI Literacy Course for the University Community

Figure 4 for The Essentials of AI for Life and Society: An AI Literacy Course for the University Community

Abstract:We describe the development of a one-credit course to promote AI literacy at The University of Texas at Austin. In response to a call for the rapid deployment of class to serve a broad audience in Fall of 2023, we designed a 14-week seminar-style course that incorporated an interdisciplinary group of speakers who lectured on topics ranging from the fundamentals of AI to societal concerns including disinformation and employment. University students, faculty, and staff, and even community members outside of the University, were invited to enroll in this online offering: The Essentials of AI for Life and Society. We collected feedback from course participants through weekly reflections and a final survey. Satisfyingly, we found that attendees reported gains in their AI literacy. We sought critical feedback through quantitative and qualitative analysis, which uncovered challenges in designing a course for this general audience. We utilized the course feedback to design a three-credit version of the course that is being offered in Fall of 2024. The lessons we learned and our plans for this new iteration may serve as a guide to instructors designing AI courses for a broad audience.

* Accepted to EAAI-25: The 15th Symposium on Educational Advances in Artificial Intelligence, collocated with AAAI-25

Via

Access Paper or Ask Questions

Influencing Humans to Conform to Preference Models for RLHF

Jan 11, 2025

Stephane Hatgis-Kessell, W. Bradley Knox, Serena Booth, Scott Niekum, Peter Stone

Figure 1 for Influencing Humans to Conform to Preference Models for RLHF

Figure 2 for Influencing Humans to Conform to Preference Models for RLHF

Figure 3 for Influencing Humans to Conform to Preference Models for RLHF

Figure 4 for Influencing Humans to Conform to Preference Models for RLHF

Abstract:Designing a reinforcement learning from human feedback (RLHF) algorithm to approximate a human's unobservable reward function requires assuming, implicitly or explicitly, a model of human preferences. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human's reward function. In this paper, we conduct three human studies to asses whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human's unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of the learned reward functions. Overall we establish a novel research direction in model alignment: designing interfaces and training interventions to increase human conformance with the modeling assumptions of the algorithm that will learn from their input.

Via

Access Paper or Ask Questions

Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer

Dec 12, 2024

Adam Labiosa, Zhihan Wang, Siddhant Agarwal, William Cong, Geethika Hemkumar, Abhinav Narayan Harish, Benjamin Hong, Josh Kelle, Chen Li, Yuhao Li(+3 more)

Figure 1 for Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer

Figure 2 for Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer

Figure 3 for Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer

Figure 4 for Reinforcement Learning Within the Classical Robotics Stack: A Case Study in Robot Soccer

Abstract:Robot decision-making in partially observable, real-time, dynamic, and multi-agent environments remains a difficult and unsolved challenge. Model-free reinforcement learning (RL) is a promising approach to learning decision-making in such domains, however, end-to-end RL in complex environments is often intractable. To address this challenge in the RoboCup Standard Platform League (SPL) domain, we developed a novel architecture integrating RL within a classical robotics stack, while employing a multi-fidelity sim2real approach and decomposing behavior into learned sub-behaviors with heuristic selection. Our architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division. In this work, we fully describe our system's architecture and empirically analyze key design decisions that contributed to its success. Our approach demonstrates how RL-based behaviors can be integrated into complete robot behavior architectures.

* Submitted to ICRA 2025

Via

Access Paper or Ask Questions

RL Zero: Zero-Shot Language to Behaviors without any Supervision

Dec 07, 2024

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, Scott Niekum

Abstract:Rewards remain an uninterpretable way to specify tasks for Reinforcement Learning, as humans are often unable to predict the optimal behavior of any given reward function, leading to poor reward design and reward hacking. Language presents an appealing way to communicate intent to agents and bypass reward design, but prior efforts to do so have been limited by costly and unscalable labeling efforts. In this work, we propose a method for a completely unsupervised alternative to grounding language instructions in a zero-shot manner to obtain policies. We present a solution that takes the form of imagine, project, and imitate: The agent imagines the observation sequence corresponding to the language description of a task, projects the imagined sequence to our target domain, and grounds it to a policy. Video-language models allow us to imagine task descriptions that leverage knowledge of tasks learned from internet-scale video-text mappings. The challenge remains to ground these generations to a policy. In this work, we show that we can achieve a zero-shot language-to-behavior policy by first grounding the imagined sequences in real observations of an unsupervised RL agent and using a closed-form solution to imitation learning that allows the RL agent to mimic the grounded observations. Our method, RLZero, is the first to our knowledge to show zero-shot language to behavior generation abilities without any supervision on a variety of tasks on simulated domains. We further show that RLZero can also generate policies zero-shot from cross-embodied videos such as those scraped from YouTube.

* 27 pages

Via

Access Paper or Ask Questions

Proto Successor Measure: Representing the Space of All Possible Solutions of Reinforcement Learning

Nov 29, 2024

Siddhant Agarwal, Harshit Sikchi, Peter Stone, Amy Zhang

Abstract:Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment. Referred to as "zero-shot learning," this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present \emph{Proto Successor Measure}: the basis set for all possible solutions of Reinforcement Learning in a dynamical system. We provably show that any possible policy can be represented using an affine combination of these policy independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these basis corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using only interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions. Project page: https://agarwalsiddhant10.github.io/projects/psm.html.

* Under submission, 23 pages

Via

Access Paper or Ask Questions

Learning Memory Mechanisms for Decision Making through Demonstrations

Nov 13, 2024

William Yue, Bo Liu, Peter Stone

Figure 1 for Learning Memory Mechanisms for Decision Making through Demonstrations

Figure 2 for Learning Memory Mechanisms for Decision Making through Demonstrations

Figure 3 for Learning Memory Mechanisms for Decision Making through Demonstrations

Figure 4 for Learning Memory Mechanisms for Decision Making through Demonstrations

Abstract:In Partially Observable Markov Decision Processes, integrating an agent's history into memory poses a significant challenge for decision-making. Traditional imitation learning, relying on observation-action pairs for expert demonstrations, fails to capture the expert's memory mechanisms used in decision-making. To capture memory processes as demonstrations, we introduce the concept of memory dependency pairs $(p, q)$ indicating that events at time $p$ are recalled for decision-making at time $q$. We introduce AttentionTuner to leverage memory dependency pairs in Transformers and find significant improvements across several tasks compared to standard Transformers when evaluated on Memory Gym and the Long-term Memory Benchmark. Code is available at https://github.com/WilliamYue37/AttentionTuner.

Via

Access Paper or Ask Questions

PACER: Preference-conditioned All-terrain Costmap Generation

Oct 30, 2024

Luisa Mao, Garrett Warnell, Peter Stone, Joydeep Biswas

Figure 1 for PACER: Preference-conditioned All-terrain Costmap Generation

Figure 2 for PACER: Preference-conditioned All-terrain Costmap Generation

Figure 3 for PACER: Preference-conditioned All-terrain Costmap Generation

Figure 4 for PACER: Preference-conditioned All-terrain Costmap Generation

Abstract:In autonomous robot navigation, terrain cost assignment is typically performed using a semantics-based paradigm in which terrain is first labeled using a pre-trained semantic classifier and costs are then assigned according to a user-defined mapping between label and cost. While this approach is rapidly adaptable to changing user preferences, only preferences over the types of terrain that are already known by the semantic classifier can be expressed. In this paper, we hypothesize that a machine-learning-based alternative to the semantics-based paradigm above will allow for rapid cost assignment adaptation to preferences expressed over new terrains at deployment time without the need for additional training. To investigate this hypothesis, we introduce and study PACER, a novel approach to costmap generation that accepts as input a single birds-eye view (BEV) image of the surrounding area along with a user-specified preference context and generates a corresponding BEV costmap that aligns with the preference context. Using both real and synthetic data along with a combination of proposed training tasks, we find that PACER is able to adapt quickly to new user preferences while also exhibiting better generalization to novel terrains compared to both semantics-based and representation-learning approaches.

Via

Access Paper or Ask Questions