Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Amirhossein Afsharrad, Ahmadreza Moradipari, Sanjay Lall

Recently, bandit optimization has received significant attention in real-world safety-critical systems that involve repeated interactions with humans. While there exist various algorithms with performance guarantees in the literature, practical implementation of the algorithms has not received as much attention. This work presents a comprehensive study on the computational aspects of safe bandit algorithms, specifically safe linear bandits, by introducing a framework that leverages convex programming tools to create computationally efficient policies. In particular, we first characterize the properties of the optimal policy for safe linear bandit problem and then propose an end-to-end pipeline of safe linear bandit algorithms that only involves solving convex problems. We also numerically evaluate the performance of our proposed methods.

Via

Ahmadreza Moradipari, Mohammad Pedramfar, Modjtaba Shokrian Zini, Vaneet Aggarwal

In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.

Via

Mahyar Abbasian, Taha Rajabzadeh, Ahmadreza Moradipari, Seyed Amir Hossein Aqajari, Hongsheng Lu, Amir Rahmani

Generative Adversarial Networks (GAN) have emerged as a formidable AI tool to generate realistic outputs based on training datasets. However, the challenge of exerting control over the generation process of GANs remains a significant hurdle. In this paper, we propose a novel methodology to address this issue by integrating a reinforcement learning (RL) agent with a latent-space GAN (l-GAN), thereby facilitating the generation of desired outputs. More specifically, we have developed an actor-critic RL agent with a meticulously designed reward policy, enabling it to acquire proficiency in navigating the latent space of the l-GAN and generating outputs based on specified tasks. To substantiate the efficacy of our approach, we have conducted a series of experiments employing the MNIST dataset, including arithmetic addition as an illustrative task. The outcomes of these experiments serve to validate our methodology. Our pioneering integration of an RL agent with a GAN model represents a novel advancement, holding great potential for enhancing generative networks in the future.

Via

Ahmadreza Moradipari, Sangjae Bae, Mahnoosh Alizadeh, Ehsan Moradi Pari, David Isele

Accurately modeling the behavior of traffic participants is essential for safely and efficiently navigating an autonomous vehicle through heavy traffic. We propose a method, based on the intelligent driver model, that allows us to accurately model individual driver behaviors from only a small number of frames using easily observable features. On average, this method makes prediction errors that have less than 1 meter difference from an oracle with full-information when analyzed over a 10-second horizon of highway driving. We then validate the efficiency of our method through extensive analysis against a competitive data-driven method such as Reinforcement Learning that may be of independent interest.

Via

Ahmadreza Moradipari, Mohammad Ghavamzadeh, Mahnoosh Alizadeh

We study a collaborative multi-agent stochastic linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret. In this setting, each agent has its own linear bandit problem (its own reward parameter) and the goal is to select the best global action w.r.t. the average of their reward parameters. At each round, each agent proposes an action, and one action is randomly selected and played as the network action. All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents. We propose a distributed upper confidence bound (UCB) algorithm and prove a high probability bound on its $T$-round regret in which we include a linear growth of regret associated with each communication round. Our regret bound is of order $\mathcal{O}\Big(\sqrt{\frac{T}{N \log(1/|\lambda_2|)}}\cdot (\log T)^2\Big)$, where $\lambda_2$ is the second largest (in absolute value) eigenvalue of the communication matrix.

Via

Ahmadreza Moradipari, Mohammad Ghavamzadeh, Taha Rajabzadeh, Christos Thrampoulidis, Mahnoosh Alizadeh

In this work we investigate meta-learning (or learning-to-learn) approaches in multi-task linear stochastic bandit problems that can originate from multiple environments. Inspired by the work of [1] on meta-learning in a sequence of linear bandit problems whose parameters are sampled from a single distribution (i.e., a single environment), here we consider the feasibility of meta-learning when task parameters are drawn from a mixture distribution instead. For this problem, we propose a regularized version of the OFUL algorithm that, when trained on tasks with labeled environments, achieves low regret on a new task without requiring knowledge of the environment from which the new task originates. Specifically, our regret bound for the new algorithm captures the effect of environment misclassification and highlights the benefits over learning each task separately or meta-learning without recognition of the distinct mixture components.

Via

Ahmadreza Moradipari, Yasin Abbasi-Yadkori, Mahnoosh Alizadeh, Mohammad Ghavamzadeh

We study two model selection settings in stochastic linear bandits (LB). In the first setting, the reward parameter of the LB problem is arbitrarily selected from $M$ models represented as (possibly) overlapping balls in $\mathbb R^d$. However, the agent only has access to misspecified models, i.e., estimates of the centers and radii of the balls. We refer to this setting as parameter selection. In the second setting, which we refer to as feature selection, the expected reward of the LB problem is in the linear span of at least one of $M$ feature maps (models). For each setting, we develop and analyze an algorithm that is based on a reduction from bandits to full-information problems. This allows us to obtain regret bounds that are not worse (up to a $\sqrt{\log M}$ factor) than the case where the true model is known. Our parameter selection algorithm is OFUL-style and the one for feature selection is based on the SquareCB algorithm. We also show that the regret of our parameter selection algorithm scales logarithmically with model misspecification.

Via

Ahmadreza Moradipari, Christos Thrampoulidis, Mahnoosh Alizadeh

We study stage-wise conservative linear stochastic bandits: an instance of bandit optimization, which accounts for (unknown) safety constraints that appear in applications such as online advertising and medical trials. At each stage, the learner must choose actions that not only maximize cumulative reward across the entire time horizon but further satisfy a linear baseline constraint that takes the form of a lower bound on the instantaneous reward. For this problem, we present two novel algorithms, stage-wise conservative linear Thompson Sampling (SCLTS) and stage-wise conservative linear UCB (SCLUCB), that respect the baseline constraints and enjoy probabilistic regret bounds of order O(\sqrt{T} \log^{3/2}T) and O(\sqrt{T} \log T), respectively. Notably, the proposed algorithms can be adjusted with only minor modifications to tackle different problem variations, such as constraints with bandit-feedback, or an unknown sequence of baseline actions. We discuss these and other improvements over the state-of-the-art. For instance, compared to existing solutions, we show that SCLTS plays the (non-optimal) baseline action at most O(\log{T}) times (compared to O(\sqrt{T})). Finally, we make connections to another studied form of safety constraints that takes the form of an upper bound on the instantaneous reward. While this incurs additional complexity to the learning process as the optimal action is not guaranteed to belong to the safe set at each round, we show that SCLUCB can properly adjust in this setting via a simple modification.

Via

Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, Christos Thrampoulidis

The design and performance analysis of bandit algorithms in the presence of stage-wise safety or reliability constraints has recently garnered significant interest. In this work, we consider the linear stochastic bandit problem under additional \textit{linear safety constraints} that need to be satisfied at each round. We provide a new safe algorithm based on linear Thompson Sampling (TS) for this problem and show a frequentist regret of order $\mathcal{O} (d^{3/2}\log^{1/2}d \cdot T^{1/2}\log^{3/2}T)$, which remarkably matches the results provided by [Abeille et al., 2017] for the standard linear TS algorithm in the absence of safety constraints. We compare the performance of our algorithm with a UCB-based safe algorithm and highlight how the inherently randomized nature of TS leads to a superior performance in expanding the set of safe actions the algorithm has access to at each round.

Via

Ahmadreza Moradipari, Sina Shahsavari, Ashkan Esmaeili, Farokh Marvasti

Inference and Estimation in Missing Information (MI) scenarios are important topics in Statistical Learning Theory and Machine Learning (ML). In ML literature, attempts have been made to enhance prediction through precise feature selection methods. In sparse linear models, LASSO is well-known in extracting the desired support of the signal and resisting against noisy systems. When sparse models are also suffering from MI, the sparse recovery and inference of the missing models are taken into account simultaneously. In this paper, we will introduce an approach which enjoys sparse regression and covariance matrix estimation to improve matrix completion accuracy, and as a result enhancing feature selection preciseness which leads to reduction in prediction Mean Squared Error (MSE). We will compare the effect of employing covariance matrix in enhancing estimation accuracy to the case it is not used in feature selection. Simulations show the improvement in the performance as compared to the case where the covariance matrix estimation is not used.

Via