The outbreak of the novel coronavirus (COVID-19) is unfolding as a major international crisis whose influence extends to every aspect of our daily lives. Effective testing allows infected individuals to be quarantined, thus reducing the spread of COVID-19, saving countless lives, and helping to restart the economy safely and securely. Developing a good testing strategy can be greatly aided by contact tracing that provides health care providers information about the whereabouts of infected patients in order to determine whom to test. Countries that have been more successful in corralling the virus typically use a ``test, treat, trace, test'' strategy that begins with testing individuals with symptoms, traces contacts of positively tested individuals via a combinations of patient memory, apps, WiFi, GPS, etc., followed by testing their contacts, and repeating this procedure. The problem is that such strategies are myopic and do not efficiently use the testing resources. This is especially the case with COVID-19, where symptoms may show up several days after the infection (or not at all, there is evidence to suggest that many COVID-19 carriers are asymptotic, but may spread the virus). Such greedy strategies, miss out population areas where the virus may be dormant and flare up in the future. In this paper, we show that the testing problem can be cast as a sequential learning-based resource allocation problem with constraints, where the input to the problem is provided by a time-varying social contact graph obtained through various contact tracing tools. We then develop efficient learning strategies that minimize the number of infected individuals. These strategies are based on policy iteration and look-ahead rules. We investigate fundamental performance bounds, and ensure that our solution is robust to errors in the input graph as well as in the tests themselves.
We consider incremental inference problems from aggregate data for collective dynamics. In particular, we address the problem of estimating the aggregate marginals of a Markov chain from noisy aggregate observations in an incremental (online) fashion. We propose a sliding window Sinkhorn belief propagation (SW-SBP) algorithm that utilizes a sliding window filter of the most recent noisy aggregate observations along with encoded information from discarded observations. Our algorithm is built upon the recently proposed multi-marginal optimal transport based SBP algorithm that leverages standard belief propagation and Sinkhorn algorithm to solve inference problems from aggregate data. We demonstrate the performance of our algorithm on applications such as inferring population flow from aggregate observations.
We study multi-marginal optimal transport problems from a probabilistic graphical model perspective. We point out an elegant connection between the two when the underlying cost for optimal transport allows a graph structure. In particular, an entropy regularized multi-marginal optimal transport is equivalent to a Bayesian marginal inference problem for probabilistic graphical models with the additional requirement that some of the marginal distributions are specified. This relation on the one hand extends the optimal transport as well as the probabilistic graphical model theories, and on the other hand leads to fast algorithms for multi-marginal optimal transport by leveraging the well-developed algorithms in Bayesian inference. Several numerical examples are provided to highlight the results.
We investigate contextual bandits in the presence of side-observations across arms in order to design recommendation algorithms for users connected via social networks. Users in social networks respond to their friends' activity, and hence provide information about each other's preferences. In our model, when a learning algorithm recommends an article to a user, not only does it observe his/her response (e.g. an ad click), but also the side-observations, i.e., the response of his neighbors if they were presented with the same article. We model these observation dependencies by a graph $\mathcal{G}$ in which nodes correspond to users, and edges correspond to social links. We derive a problem/instance-dependent lower-bound on the regret of any consistent algorithm. We propose an optimization (linear programming) based data-driven learning algorithm that utilizes the structure of $\mathcal{G}$ in order to make recommendations to users and show that it is asymptotically optimal, in the sense that its regret matches the lower-bound as the number of rounds $T\to\infty$. We show that this asymptotically optimal regret is upper-bounded as $O\left(|\chi(\mathcal{G})|\log T\right)$, where $|\chi(\mathcal{G})|$ is the domination number of $\mathcal{G}$. In contrast, a naive application of the existing learning algorithms results in $O\left(N\log T\right)$ regret, where $N$ is the number of users.
One major obstacle that precludes the success of reinforcement learning in real-world applications is the lack of robustness, either to model uncertainties or external disturbances, of the trained policies. Robustness is critical when the policies are trained in simulations instead of real world environment. In this work, we propose a risk-aware algorithm to learn robust policies in order to bridge the gap between simulation training and real-world implementation. Our algorithm is based on recently discovered distributional RL framework. We incorporate CVaR risk measure in sample based distributional policy gradients (SDPG) for learning risk-averse policies to achieve robustness against a range of system disturbances. We validate the robustness of risk-aware SDPG on multiple environments.
We consider inference problems over probabilistic graphical models with aggregate data. In particular, we propose a new efficient belief propagation type algorithm over tree-structured graphs with polynomial computational complexity as well as a global convergence guarantee. This is in contrast to previous methods that either exhibit prohibitive complexity as the population grows or do not guarantee convergence. Our method is based on optimal transport, or more specifically, multi-marginal optimal transport theory. In particular, the inference problem with aggregate observations we consider in this paper can be seen as a structured multi-marginal optimal transport problem, where the cost function decomposes according to the underlying graph. Consequently, the celebrated Sinkhorn algorithm for multi-marginal optimal transport can be leveraged, together with the standard belief propagation algorithm to establish an efficient inference scheme. We demonstrate the performance of our algorithm on applications such as inferring population flow from aggregate observations.
We design adaptive controller (learning rule) for a networked control system (NCS) in which data packets containing control information are transmitted across a lossy wireless channel. We propose Upper Confidence Bounds for Networked Control Systems (UCB-NCS), a learning rule that maintains confidence intervals for the estimates of plant parameters $(A_{(\star)},B_{(\star)})$, and channel reliability $p_{(\star)}$, and utilizes the principle of optimism in the face of uncertainty while making control decisions. We provide non-asymptotic performance guarantees for UCB-NCS by analyzing its "regret", i.e., performance gap from the scenario when $(A_{(\star)},B_{(\star)},p_{(\star)})$ are known to the controller. We show that with a high probability the regret can be upper-bounded as $\tilde{O}\left(C\sqrt{T}\right)$\footnote{Here $\tilde{O}$ hides logarithmic factors.}, where $T$ is the operating time horizon of the system, and $C$ is a problem dependent constant.
We consider reinforcement learning (RL) in Markov Decision Processes (MDPs) in which at each time step the agent, in addition to earning a reward, also incurs an $M$ dimensional vector of costs. The objective is to design a learning rule that maximizes the cumulative reward earned over a finite time horizon of $T$ steps, while simultaneously ensuring that the cumulative cost expenditures are bounded appropriately. The considerations on the cumulative cost expenditures is in departure from the existing RL literature, in that the agent now additionally needs to balance the cost expenses in an \emph{online manner}, while simultaneously performing optimally the exploration-exploitation trade-off typically encountered in RL tasks. This is challenging since either of the duo objectives of exploration and exploitation necessarily require the agent to expend resources. When the constraints are placed on the average costs, we present a version of UCB algorithm and prove that its reward as well as cost regrets are upper-bounded as $O\left(T_{M}S\sqrt{AT\log(T)}\right)$, where $T_{M}$ is the mixing time of the MDP, $S$ is the number of states, $A$ is the number of actions, and $T$ is the time horizon. We further show how to modify the algorithm in order to reduce regrets of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. We then consider RL under the constraint that the vector comprising of the cumulative cost expenditures until each time $t\le T$ must be less than $\mathbf{c}^{ub}t$. We propose a "finite ($B$)-state" algorithm and show that its average reward is within $O\left(e^{-B}\right)$ of $r^{\star}$, the latter being the optimal average reward under average cost constraints.
Distributional reinforcement learning (DRL) is a recent reinforcement learning framework whose success has been supported by various empirical studies. It relies on the key idea of replacing the expected return with the return distribution, which captures the intrinsic randomness of the long term rewards. Most of the existing literature on DRL focuses on problems with discrete action space and value based methods. In this work, motivated by applications in robotics with continuous action space control settings, we propose sample-based distributional policy gradient (SDPG) algorithm. It models the return distribution using samples via a reparameterization technique widely used in generative modeling and inference. We compare SDPG with the state-of-art policy gradient method in DRL, distributed distributional deterministic policy gradients (D4PG), which has demonstrated state-of-art performance. We apply SDPG and D4PG to multiple OpenAI Gym environments and observe that our algorithm shows better sample efficiency as well as higher reward for most tasks.
Instrumental variable identification is a concept in causal statistics for estimating the counterfactual effect of treatment D on output Y controlling for covariates X using observational data. Even when measurements of (Y,D) are confounded, the treatment effect on the subpopulation of compliers can nonetheless be identified if an instrumental variable Z is available, which is independent of (Y,D) conditional on X and the unmeasured confounder. We introduce a de-biased machine learning (DML) approach to estimating complier parameters with high-dimensional data. Complier parameters include local average treatment effect, average complier characteristics, and complier counterfactual outcome distributions. In our approach, the de-biasing is itself performed by machine learning, a variant called de-biased machine learning via regularized Riesz representers (DML-RRR). We prove our estimator is consistent, asymptotically normal, and semi-parametrically efficient. In experiments, our estimator outperforms state of the art alternatives. We use it to estimate the effect of 401(k) participation on the distribution of net financial assets.