A key feature of intelligent behavior is the ability to learn abstract strategies that transfer to unfamiliar problems. Therefore, we present a novel architecture, based on memory-augmented networks, that is inspired by the von Neumann and Harvard architectures of modern computers. This architecture enables the learning of abstract algorithmic solutions via Evolution Strategies in a reinforcement learning setting. Applied to Sokoban, sliding block puzzle and robotic manipulation tasks, we show that the architecture can learn algorithmic solutions with strong generalization and abstraction: scaling to arbitrary task configurations and complexities, and being independent of both the data representation and the task domain.
Learning optimal feedback control laws capable of executing optimal trajectories is essential for many robotic applications. Such policies can be learned using reinforcement learning or planned using optimal control. While reinforcement learning is sample inefficient, optimal control only plans an optimal trajectory from a specific starting configuration. In this paper we propose deep optimal feedback control to learn an optimal feedback policy rather than a single trajectory. By exploiting the inherent structure of the robot dynamics and strictly convex action cost, we can derive principled cost functions such that the optimal policy naturally obeys the action limits, is globally optimal and stable on the training domain given the optimal value function. The corresponding optimal value function is learned end-to-end by embedding a deep differential network in the Hamilton-Jacobi-Bellmann differential equation and minimizing the error of this equality while simultaneously decreasing the discounting from short- to far-sighted to enable the learning. Our proposed approach enables us to learn an optimal feedback control law in continuous time, that in contrast to existing approaches generates an optimal trajectory from any point in state-space without the need of replanning. The resulting approach is evaluated on non-linear systems and achieves optimal feedback control, where standard optimal control methods require frequent replanning.
Sample-efficient exploration is crucial not only for discovering rewarding experiences but also for adapting to environment changes in a task-agnostic fashion. A principled treatment of the problem of optimal input synthesis for system identification is provided within the framework of sequential Bayesian experimental design. In this paper, we present an effective trajectory-optimization-based approximate solution of this otherwise intractable problem that models optimal exploration in an unknown Markov decision process (MDP). By interleaving episodic exploration with Bayesian nonlinear system identification, our algorithm takes advantage of the inductive bias to explore in a directed manner, without assuming prior knowledge of the MDP. Empirical evaluations indicate a clear advantage of the proposed algorithm in terms of the rate of convergence and the final model fidelity when compared to intrinsic-motivation-based algorithms employing exploration bonuses such as prediction error and information gain. Moreover, our method maintains a computational advantage over a recent model-based active exploration (MAX) algorithm, by focusing on the information gain along trajectories instead of seeking a global exploration policy. A reference implementation of our algorithm and the conducted experiments is publicly available.
Optimal control of stochastic nonlinear dynamical systems is a major challenge in the domain of robot learning. Given the intractability of the global control problem, state-of-the-art algorithms focus on approximate sequential optimization techniques, that heavily rely on heuristics for regularization in order to achieve stable convergence. By building upon the duality between inference and control, we develop the view of Optimal Control as Input Estimation, devising a probabilistic stochastic optimal control formulation that iteratively infers the optimal input distributions by minimizing an upper bound of the control cost. Inference is performed through Expectation Maximization and message passing on a probabilistic graphical model of the dynamical system, and time-varying linear Gaussian feedback controllers are extracted from the joint state-action distribution. This perspective incorporates uncertainty quantification, effective initialization through priors, and the principled regularization inherent to the Bayesian treatment. Moreover, it can be shown that for deterministic linearized systems, our framework derives the maximum entropy linear quadratic optimal control law. We provide a complete and detailed derivation of our probabilistic approach and highlight its advantages in comparison to other deterministic and probabilistic solvers.
Generalization and adaptation of learned skills to novel situations is a core requirement for intelligent autonomous robots. Although contextual reinforcement learning provides a principled framework for learning and generalization of behaviors across related tasks, it generally relies on uninformed sampling of environments from an unknown, uncontrolled context distribution, thus missing the benefits of structured, sequential learning. We introduce a novel relative entropy reinforcement learning algorithm that gives the agent the freedom to control the intermediate task distribution, allowing for its gradual progression towards the target context distribution. Empirical evaluation shows that the proposed curriculum learning scheme drastically improves sample efficiency and enables learning in scenarios with both broad and sharp target context distributions in which classical approaches perform sub-optimally.
Camera-based tactile sensors are emerging as a promising inexpensive solution for tactile-enhanced manipulation tasks. A recently introduced FingerVision sensor was shown capable of generating reliable signals for force estimation, object pose estimation, and slip detection. In this paper, we build upon the FingerVision design, improving already existing control algorithms, and, more importantly, expanding its range of applicability to more challenging tasks by utilizing raw skin deformation data for control. In contrast to previous approaches that rely on the average deformation of the whole sensor surface, we directly employ local deviations of each spherical marker immersed in the silicone body of the sensor for feedback control and as input to learning tasks. We show that with such input, substances of varying texture and viscosity can be distinguished on the basis of tactile sensations evoked while stirring them. As another application, we learn a mapping between skin deformation and force applied to an object. To demonstrate the full range of capabilities of the proposed controllers, we deploy them in a challenging architectural assembly task that involves inserting a load-bearing element underneath a bendable plate at the point of maximum load.
Data driven methods for time series forecasting that quantify uncertainty open new important possibilities for robot tasks with hard real time constraints, allowing the robot system to make decisions that trade off between reaction time and accuracy in the predictions. Despite the recent advances in deep learning, it is still challenging to make long term accurate predictions with the low latency required by real time robotic systems. In this paper, we propose a deep conditional generative model for trajectory prediction that is learned from a data set of collected trajectories. Our method uses an encoder and decoder deep networks that maps complete or partial trajectories to a Gaussian distributed latent space and back, allowing for fast inference of the future values of a trajectory given previous observations. The encoder and decoder networks are trained using stochastic gradient variational Bayes. In the experiments, we show that our model provides more accurate long term predictions with a lower latency that popular models for trajectory forecasting like recurrent neural networks or physical models based on differential equations. Finally, we test our proposed approach in a robot table tennis scenario to evaluate the performance of the proposed method in a robotic task with hard real time constraints.
Robot table tennis systems require a vision system that can track the ball position with low latency and high sampling rate. Altering the ball to simplify the tracking using for instance infrared coating changes the physics of the ball trajectory. As a result, table tennis systems use custom tracking systems to track the ball based on heuristic algorithms respecting the real time constrains applied to RGB images captured with a set of cameras. However, these heuristic algorithms often report erroneous ball positions, and the table tennis policies typically need to incorporate additional heuristics to detect and possibly correct outliers. In this paper, we propose a vision system for object detection and tracking that focus on reliability while providing real time performance. Our assumption is that by using multiple cameras, we can find and discard the errors obtained in the object detection phase by checking for consistency with the positions reported by other cameras. We provide an open source implementation of the proposed tracking system to simplify future research in robot table tennis or related tracking applications with strong real time requirements. We evaluate the proposed system thoroughly in simulation and in the real system, outperforming previous work. Furthermore, we show that the accuracy and robustness of the proposed system increases as more cameras are added. Finally, we evaluate the table tennis playing performance of an existing method in the real robot using the proposed vision system. We measure a slight increase in performance compared to a previous vision system even after removing all the heuristics previously present to filter out erroneous ball observations.
Model-based Reinforcement Learning (MBRL) allows data-efficient learning which is required in real world applications such as robotics. However, despite the impressive data-efficiency, MBRL does not achieve the final performance of state-of-the-art Model-free Reinforcement Learning (MFRL) methods. We leverage the strengths of both realms and propose an approach that obtains high performance with a small amount of data. In particular, we combine MFRL and Model Predictive Control (MPC). While MFRL's strength in exploration allows us to train a better forward dynamics model for MPC, MPC improves the performance of the MFRL policy by sampling-based planning. The experimental results in standard continuous control benchmarks show that our approach can achieve MFRL`s level of performance while being as data-efficient as MBRL.