Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Utsav Singh, Souradip Chakraborty, Wesley A. Suttle, Brian M. Sadler, Vinay P Namboodiri, Amrit Singh Bedi

Learning control policies to perform complex robotics tasks from human preference data presents significant challenges. On the one hand, the complexity of such tasks typically requires learning policies to perform a variety of subtasks, then combining them to achieve the overall goal. At the same time, comprehensive, well-engineered reward functions are typically unavailable in such problems, while limited human preference data often is; making efficient use of such data to guide learning is therefore essential. Methods for learning to perform complex robotics tasks from human preference data must overcome both these challenges simultaneously. In this work, we introduce DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning, an efficient hierarchical approach that leverages direct preference optimization to learn a higher-level policy and reinforcement learning to learn a lower-level policy. DIPPER enjoys improved computational efficiency due to its use of direct preference optimization instead of standard preference-based approaches such as reinforcement learning from human feedback, while it also mitigates the well-known hierarchical reinforcement learning issues of non-stationarity and infeasible subgoal generation due to our use of primitive-informed regularization inspired by a novel bi-level optimization formulation of the hierarchical reinforcement learning problem. To validate our approach, we perform extensive experimental analysis on a variety of challenging robotics tasks, demonstrating that DIPPER outperforms hierarchical and non-hierarchical baselines, while ameliorating the non-stationarity and infeasible subgoal generation issues of hierarchical reinforcement learning.

Via

In this work, we introduce PIPER: Primitive-Informed Preference-based Hierarchical reinforcement learning via Hindsight Relabeling, a novel approach that leverages preference-based learning to learn a reward model, and subsequently uses this reward model to relabel higher-level replay buffers. Since this reward is unaffected by lower primitive behavior, our relabeling-based approach is able to mitigate non-stationarity, which is common in existing hierarchical approaches, and demonstrates impressive performance across a range of challenging sparse-reward tasks. Since obtaining human feedback is typically impractical, we propose to replace the human-in-the-loop approach with our primitive-in-the-loop approach, which generates feedback using sparse rewards provided by the environment. Moreover, in order to prevent infeasible subgoal prediction and avoid degenerate solutions, we propose primitive-informed regularization that conditions higher-level policies to generate feasible subgoals for lower-level policies. We perform extensive experiments to show that PIPER mitigates non-stationarity in hierarchical reinforcement learning and achieves greater than 50$\%$ success rates in challenging, sparse-reward robotic environments, where most other baselines fail to achieve any significant progress.

Via

Bhrij Patel, Wesley A. Suttle, Alec Koppel, Vaneet Aggarwal, Brian M. Sadler, Amrit Singh Bedi, Dinesh Manocha

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution-poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest-available dependence of $\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$ relative to prior work. With a 2D gridworld goal-reaching navigation experiment, we demonstrate that MAC achieves higher reward than a previous PG-based method for average reward, Parameterized Policy Gradient with Advantage Estimation (PPGAE), especially in cases with relatively small training sample budget restricting trajectory length.

Via

Wesley A. Suttle, Vipul K. Sharma, Krishna C. Kosaraju, S. Sivaranjani, Ji Liu, Vijay Gupta, Brian M. Sadler

We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.

Via

Deceptive path planning (DPP) is the problem of designing a path that hides its true goal from an outside observer. Existing methods for DPP rely on unrealistic assumptions, such as global state observability and perfect model knowledge, and are typically problem-specific, meaning that even minor changes to a previously solved problem can force expensive computation of an entirely new solution. Given these drawbacks, such methods do not generalize to unseen problem instances, lack scalability to realistic problem sizes, and preclude both on-the-fly tunability of deception levels and real-time adaptivity to changing environments. In this paper, we propose a reinforcement learning (RL)-based scheme for training policies to perform DPP over arbitrary weighted graphs that overcomes these issues. The core of our approach is the introduction of a local perception model for the agent, a new state space representation distilling the key components of the DPP problem, the use of graph neural network-based policies to facilitate generalization and scaling, and the introduction of new deception bonuses that translate the deception objectives of classical methods to the RL setting. Through extensive experimentation we show that, without additional fine-tuning, at test time the resulting policies successfully generalize, scale, enjoy tunable levels of deception, and adapt in real-time to changes in the environment.

Via

Integrated sensing and communications (ISAC) systems have gained significant interest because of their ability to jointly and efficiently access, utilize, and manage the scarce electromagnetic spectrum. The co-existence approach toward ISAC focuses on the receiver processing of overlaid radar and communications signals coming from independent transmitters. A specific ISAC coexistence problem is dual-blind deconvolution (DBD), wherein the transmit signals and channels of both radar and communications are unknown to the receiver. Prior DBD works ignore the evolution of the signal model over time. In this work, we consider a dynamic DBD scenario using a linear state space model (LSSM) such that, apart from the transmit signals and channels of both systems, the LSSM parameters are also unknown. We employ a factor graph representation to model these unknown variables. We avoid the conventional matrix inversion approach to estimate the unknown variables by using an efficient expectation-maximization algorithm, where each iteration employs a Gaussian message passing over the factor graph structure. Numerical experiments demonstrate the accurate estimation of radar and communications channels, including in the presence of noise.

Via

Hypercomplex signal processing (HSP) provides state-of-the-art tools to handle multidimensional signals by harnessing intrinsic correlation of the signal dimensions through Clifford algebra. Recently, the hypercomplex representation of the phase retrieval (PR) problem, wherein a complex-valued signal is estimated through its intensity-only projections, has attracted significant interest. The hypercomplex PR (HPR) arises in many optical imaging and computational sensing applications that usually comprise quaternion and octonion-valued signals. Analogous to the traditional PR, measurements in HPR may involve complex, hypercomplex, Fourier, and other sensing matrices. This set of problems opens opportunities for developing novel HSP tools and algorithms. This article provides a synopsis of the emerging areas and applications of HPR with a focus on optical imaging.

Via

Higher spectral and energy efficiencies are the envisioned defining characteristics of high data-rate sixth-generation (6G) wireless networks. One of the enabling technologies to meet these requirements is index modulation (IM), which transmits information through permutations of indices of spatial, frequency, or temporal media. In this paper, we propose novel electromagnetics-compliant designs of reconfigurable intelligent surface (RIS) apertures for realizing IM in 6G transceivers. We consider RIS modeling and implementation of spatial and subcarrier IMs, including beam steering, spatial multiplexing, and phase modulation capabilities. Numerical experiments for our proposed implementations show that the bit error rates obtained via RIS-aided IM outperform traditional implementations. We further establish the programmability of these transceivers to vary the reflection phase and generate frequency harmonics for IM through full-wave electromagnetic analyses of a specific reflect-array metasurface implementation.

Via

Signal processing over hypercomplex numbers arises in many optical imaging applications. In particular, spectral image or color stereo data are often processed using octonion algebra. Recently, the eight-band multispectral image phase recovery has gained salience, wherein it is desired to recover the eight bands from the phaseless measurements. In this paper, we tackle this hitherto unaddressed hypercomplex variant of the popular phase retrieval (PR) problem. We propose octonion Wirtinger flow (OWF) to recover an octonion signal from its intensity-only observation. However, contrary to the complex-valued Wirtinger flow, the non-associative nature of octonion algebra and the consequent lack of octonion derivatives make the extension to OWF non-trivial. We resolve this using the pseudo-real-matrix representation of octonion to perform the derivatives in each OWF update. We demonstrate that our approach recovers the octonion signal up to a right-octonion phase factor. Numerical experiments validate OWF-based PR with high accuracy under both noiseless and noisy measurements.

Via

Bhrij Patel, Kasun Weerakoon, Wesley A. Suttle, Alec Koppel, Brian M. Sadler, Amrit Singh Bedi, Dinesh Manocha

Reinforcement learning methods, while effective for learning robotic navigation strategies, are known to be highly sample inefficient. This sample inefficiency comes in part from not suitably balancing the explore-exploit dilemma, especially in the presence of non-stationarity, during policy optimization. To incorporate a balance of exploration-exploitation for sample efficiency, we propose Ada-NAV, an adaptive trajectory length scheme where the length grows as a policy's randomness, represented by its Shannon or differential entropy, decreases. Our adaptive trajectory length scheme emphasizes exploration at the beginning of training due to more frequent gradient updates and emphasizes exploitation later on with longer trajectories. In gridworld, simulated robotic environments, and real-world robotic experiments, we demonstrate the merits of the approach over constant and randomly sampled trajectory lengths in terms of performance and sample efficiency. For a fixed sample budget, Ada-NAV results in an 18% increase in navigation success rate, a 20-38% decrease in the navigation path length, and 9.32% decrease in the elevation cost compared to the policies obtained by the other methods. We also demonstrate that Ada-NAV can be transferred and integrated into a Clearpath Husky robot without significant performance degradation.

Via