Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shie Mannor

Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Jun 07, 2016
Daniel J. Mankowitz, Timothy A. Mann, Shie Mannor

Figure 1 for Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Figure 2 for Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Figure 3 for Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

Figure 4 for Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)

For complex, high-dimensional Markov Decision Processes (MDPs), it may be necessary to represent the policy with function approximation. A problem is misspecified whenever, the representation cannot express any policy with acceptable performance. We introduce IHOMP : an approach for solving misspecified problems. IHOMP iteratively learns a set of context specialized options and combines these options to solve an otherwise misspecified problem. Our main contribution is proving that IHOMP enjoys theoretical convergence guarantees. In addition, we extend IHOMP to exploit Option Interruption (OI) enabling it to decide where the learned options can be reused. Our experiments demonstrate that IHOMP can find near-optimal solutions to otherwise misspecified problems and that OI can further improve the solutions.

* arXiv admin note: text overlap with arXiv:1506.03624

Via

Access Paper or Ask Questions

Adaptive Skills, Adaptive Partitions (ASAP)

Jun 07, 2016
Daniel J. Mankowitz, Timothy A. Mann, Shie Mannor

Figure 1 for Adaptive Skills, Adaptive Partitions (ASAP)

Figure 2 for Adaptive Skills, Adaptive Partitions (ASAP)

Figure 3 for Adaptive Skills, Adaptive Partitions (ASAP)

We introduce the Adaptive Skills, Adaptive Partitions (ASAP) framework that (1) learns skills (i.e., temporally extended actions or options) as well as (2) where to apply them. We believe that both (1) and (2) are necessary for a truly general skill learning framework, which is a key building block needed to scale up to lifelong learning agents. The ASAP framework can also solve related new tasks simply by adapting where it applies its existing learned skills. We prove that ASAP converges to a local optimum under natural conditions. Finally, our experimental results, which include a RoboCup domain, demonstrate the ability of ASAP to learn where to reuse skills as well as solve multiple tasks with considerably less experience than solving each task from scratch.

Via

Access Paper or Ask Questions

Bending the Curve: Improving the ROC Curve Through Error Redistribution

May 21, 2016
Oran Richman, Shie Mannor

Figure 1 for Bending the Curve: Improving the ROC Curve Through Error Redistribution

Figure 2 for Bending the Curve: Improving the ROC Curve Through Error Redistribution

Figure 3 for Bending the Curve: Improving the ROC Curve Through Error Redistribution

Figure 4 for Bending the Curve: Improving the ROC Curve Through Error Redistribution

Classification performance is often not uniform over the data. Some areas in the input space are easier to classify than others. Features that hold information about the "difficulty" of the data may be non-discriminative and are therefore disregarded in the classification process. We propose a meta-learning approach where performance may be improved by post-processing. This improvement is done by establishing a dynamic threshold on the base-classifier results. Since the base-classifier is treated as a "black box" the method presented can be used on any state of the art classifier in order to try an improve its performance. We focus our attention on how to better control the true-positive/false-positive trade-off known as the ROC curve. We propose an algorithm for the derivation of optimal thresholds by redistributing the error depending on features that hold information about difficulty. We demonstrate the resulting benefit on both synthetic and real-life data.

Via

Access Paper or Ask Questions

A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients

May 13, 2016
Irit Hochberg, Guy Feraru, Mark Kozdoba, Shie Mannor, Moshe Tennenholtz, Elad Yom-Tov

Figure 1 for A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients

Figure 2 for A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients

Figure 3 for A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients

Figure 4 for A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients

Regular physical activity is known to be beneficial to people suffering from diabetes type 2. Nevertheless, most such people are sedentary. Smartphones create new possibilities for helping people to adhere to their physical activity goals, through continuous monitoring and communication, coupled with personalized feedback. We provided 27 sedentary diabetes type 2 patients with a smartphone-based pedometer and a personal plan for physical activity. Patients were sent SMS messages to encourage physical activity between once a day and once per week. Messages were personalized through a Reinforcement Learning (RL) algorithm which optimized messages to improve each participant's compliance with the activity regimen. The RL algorithm was compared to a static policy for sending messages and to weekly reminders. Our results show that participants who received messages generated by the RL algorithm increased the amount of activity and pace of walking, while the control group patients did not. Patients assigned to the RL algorithm group experienced a superior reduction in blood glucose levels (HbA1c) compared to control policies, and longer participation caused greater reductions in blood glucose levels. The learning algorithm improved gradually in predicting which messages would lead participants to exercise. Our results suggest that a mobile phone application coupled with a learning algorithm can improve adherence to exercise in diabetic patients. As a learning algorithm is automated, and delivers personalized messages, it could be used in large populations of diabetic patients to improve health and glycemic control. Our results can be expanded to other areas where computer-led health coaching of humans may have a positive impact.

Via

Access Paper or Ask Questions

Hierarchical Decision Making In Electricity Grid Management

Mar 06, 2016
Gal Dalal, Elad Gilboa, Shie Mannor

Figure 1 for Hierarchical Decision Making In Electricity Grid Management

Figure 2 for Hierarchical Decision Making In Electricity Grid Management

Figure 3 for Hierarchical Decision Making In Electricity Grid Management

Figure 4 for Hierarchical Decision Making In Electricity Grid Management

The power grid is a complex and vital system that necessitates careful reliability management. Managing the grid is a difficult problem with multiple time scales of decision making and stochastic behavior due to renewable energy generations, variable demand and unplanned outages. Solving this problem in the face of uncertainty requires a new methodology with tractable algorithms. In this work, we introduce a new model for hierarchical decision making in complex systems. We apply reinforcement learning (RL) methods to learn a proxy, i.e., a level of abstraction, for real-time power grid reliability. We devise an algorithm that alternates between slow time-scale policy improvement, and fast time-scale value function approximation. We compare our results to prevailing heuristics, and show the strength of our method.

Via

Access Paper or Ask Questions

Multi-user lax communications: a multi-armed bandit approach

Dec 02, 2015
Orly Avner, Shie Mannor

Figure 1 for Multi-user lax communications: a multi-armed bandit approach

Figure 2 for Multi-user lax communications: a multi-armed bandit approach

Figure 3 for Multi-user lax communications: a multi-armed bandit approach

Figure 4 for Multi-user lax communications: a multi-armed bandit approach

Inspired by cognitive radio networks, we consider a setting where multiple users share several channels modeled as a multi-user multi-armed bandit (MAB) problem. The characteristics of each channel are unknown and are different for each user. Each user can choose between the channels, but her success depends on the particular channel chosen as well as on the selections of other users: if two users select the same channel their messages collide and none of them manages to send any data. Our setting is fully distributed, so there is no central control. As in many communication systems, the users cannot set up a direct communication protocol, so information exchange must be limited to a minimum. We develop an algorithm for learning a stable configuration for the multi-user MAB problem. We further offer both convergence guarantees and experiments inspired by real communication networks, including comparison to state-of-the-art algorithms.

Via

Access Paper or Ask Questions

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Nov 27, 2015
Assaf Hallak, Aviv Tamar, Remi Munos, Shie Mannor

Figure 1 for Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Figure 2 for Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Figure 3 for Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD($\lambda$), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter $\beta$ controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling $\beta$, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.

* arXiv admin note: text overlap with arXiv:1508.03411

Via

Access Paper or Ask Questions

Learn on Source, Refine on Target:A Model Transfer Learning Framework with Random Forests

Nov 08, 2015
Noam Segev, Maayan Harel, Shie Mannor, Koby Crammer, Ran El-Yaniv

Figure 1 for Learn on Source, Refine on Target:A Model Transfer Learning Framework with Random Forests

Figure 2 for Learn on Source, Refine on Target:A Model Transfer Learning Framework with Random Forests

Figure 3 for Learn on Source, Refine on Target:A Model Transfer Learning Framework with Random Forests

Figure 4 for Learn on Source, Refine on Target:A Model Transfer Learning Framework with Random Forests

We propose novel model transfer-learning methods that refine a decision forest model M learned within a "source" domain using a training set sampled from a "target" domain, assumed to be a variation of the source. We present two random forest transfer algorithms. The first algorithm searches greedily for locally optimal modifications of each tree structure by trying to locally expand or reduce the tree around individual nodes. The second algorithm does not modify structure, but only the parameter (thresholds) associated with decision nodes. We also propose to combine both methods by considering an ensemble that contains the union of the two forests. The proposed methods exhibit impressive experimental results over a range of problems.

* IEEE transactions on pattern analysis and machine intelligence 39 (2017) 1811-1824
* 2 columns, 14 pages, TPAMI submitted

Via

Access Paper or Ask Questions

Emphatic TD Bellman Operator is a Contraction

Aug 23, 2015
Assaf Hallak, Aviv Tamar, Shie Mannor

Recently, \citet{SuttonMW15} introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a $\sqrt{\gamma}$-contraction modulus (where $\gamma$ is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.

Via

Access Paper or Ask Questions

Reinforcement Learning for the Unit Commitment Problem

Jul 19, 2015
Gal Dalal, Shie Mannor

Figure 1 for Reinforcement Learning for the Unit Commitment Problem

Figure 2 for Reinforcement Learning for the Unit Commitment Problem

Figure 3 for Reinforcement Learning for the Unit Commitment Problem

Figure 4 for Reinforcement Learning for the Unit Commitment Problem

In this work we solve the day-ahead unit commitment (UC) problem, by formulating it as a Markov decision process (MDP) and finding a low-cost policy for generation scheduling. We present two reinforcement learning algorithms, and devise a third one. We compare our results to previous work that uses simulated annealing (SA), and show a 27% improvement in operation costs, with running time of 2.5 minutes (compared to 2.5 hours of existing state-of-the-art).

* Accepted and presented in IEEE PES PowerTech, Eindhoven 2015, paper ID 462731

Via

Access Paper or Ask Questions