Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Julian Ibarz

QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Jul 02, 2018
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, Sergey Levine

Figure 1 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Figure 2 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Figure 3 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

Figure 4 for QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation

In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.

* 22 pages, 14 figures

Via

Access Paper or Ask Questions

Discrete Sequential Prediction of Continuous Actions for Deep RL

Jun 16, 2018
Luke Metz, Julian Ibarz, Navdeep Jaitly, James Davidson

Figure 1 for Discrete Sequential Prediction of Continuous Actions for Deep RL

Figure 2 for Discrete Sequential Prediction of Continuous Actions for Deep RL

Figure 3 for Discrete Sequential Prediction of Continuous Actions for Deep RL

Figure 4 for Discrete Sequential Prediction of Continuous Actions for Deep RL

It has long been assumed that high dimensional continuous control problems cannot be solved effectively by discretizing individual dimensions of the action space due to the exponentially large number of bins over which policies would have to be learned. In this paper, we draw inspiration from the recent success of sequence-to-sequence models for structured prediction problems to develop policies over discretized spaces. Central to this method is the realization that complex functions over high dimensional spaces can be modeled by neural networks that predict one dimension at a time. Specifically, we show how Q-values and policies over continuous spaces can be modeled using a next step prediction model over discretized dimensions. With this parameterization, it is possible to both leverage the compositional structure of action spaces during learning, as well as compute maxima over action spaces (approximately). On a simple example task we demonstrate empirically that our method can perform global search, which effectively gets around the local optimization issues that plague DDPG. We apply the technique to off-policy (Q-learning) methods and show that our method can achieve the state-of-the-art for off-policy methods on several continuous control tasks.

Via

Access Paper or Ask Questions

Guided Attention for Large Scale Scene Text Verification

Apr 23, 2018
Dafang He, Yeqing Li, Alexander Gorban, Derrall Heath, Julian Ibarz, Qian Yu, Daniel Kifer, C. Lee Giles

Figure 1 for Guided Attention for Large Scale Scene Text Verification

Figure 2 for Guided Attention for Large Scale Scene Text Verification

Figure 3 for Guided Attention for Large Scale Scene Text Verification

Figure 4 for Guided Attention for Large Scale Scene Text Verification

Many tasks are related to determining if a particular text string exists in an image. In this work, we propose a new framework that learns this task in an end-to-end way. The framework takes an image and a text string as input and then outputs the probability of the text string being present in the image. This is the first end-to-end framework that learns such relationships between text and images in scene text area. The framework does not require explicit scene text detection or recognition and thus no bounding box annotations are needed for it. It is also the first work in scene text area that tackles suh a weakly labeled problem. Based on this framework, we developed a model called Guided Attention. Our designed model achieves much better results than several state-of-the-art scene text reading based solutions for a challenging Street View Business Matching task. The task tries to find correct business names for storefront images and the dataset we collected for it is substantially larger, and more challenging than existing scene text dataset. This new real-world task provides a new perspective for studying scene text related problems. We also demonstrate the uniqueness of our task via a comparison between our problem and a typical Visual Question Answering problem.

* 18 pages

Via

Access Paper or Ask Questions

Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods

Mar 28, 2018
Deirdre Quillen, Eric Jang, Ofir Nachum, Chelsea Finn, Julian Ibarz, Sergey Levine

Figure 1 for Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods

Figure 2 for Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods

Figure 3 for Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods

Figure 4 for Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods

In this paper, we explore deep reinforcement learning algorithms for vision-based robotic grasping. Model-free deep reinforcement learning (RL) has been successfully applied to a range of challenging environments, but the proliferation of algorithms makes it difficult to discern which particular approach would be best suited for a rich, diverse task like grasping. To answer this question, we propose a simulated benchmark for robotic grasping that emphasizes off-policy learning and generalization to unseen objects. Off-policy learning enables utilization of grasping data over a wide variety of objects, and diversity is important to enable the method to generalize to new objects that were not seen during training. We evaluate the benchmark tasks against a variety of Q-function estimation methods, a method previously proposed for robotic grasping with deep neural network models, and a novel approach based on a combination of Monte Carlo return estimation and an off-policy correction. Our results indicate that several simple methods provide a surprisingly strong competitor to popular algorithms such as double Q-learning, and our analysis of stability sheds light on the relative tradeoffs between the algorithms.

* 8 pages

Via

Access Paper or Ask Questions

Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Nov 18, 2017
Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, Sergey Levine

Figure 1 for Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Figure 2 for Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Figure 3 for Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Figure 4 for Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning

Deep reinforcement learning algorithms can learn complex behavioral skills, but real-world application of these methods requires a large amount of experience to be collected by the agent. In practical settings, such as robotics, this involves repeatedly attempting a task, resetting the environment between each attempt. However, not all tasks are easily or automatically reversible. In practice, this learning process requires extensive human intervention. In this work, we propose an autonomous method for safe and efficient reinforcement learning that simultaneously learns a forward and reset policy, with the reset policy resetting the environment for a subsequent attempt. By learning a value function for the reset policy, we can automatically determine when the forward policy is about to enter a non-reversible state, providing for uncertainty-aware safety aborts. Our experiments illustrate that proper use of the reset policy can greatly reduce the number of manual resets required to learn a task, can reduce the number of unsafe actions that lead to non-reversible states, and can automatically induce a curriculum.

* Videos of our experiments are available at: https://sites.google.com/site/mlleavenotrace/

Via

Access Paper or Ask Questions

End-to-End Learning of Semantic Grasping

Nov 09, 2017
Eric Jang, Sudheendra Vijayanarasimhan, Peter Pastor, Julian Ibarz, Sergey Levine

Figure 1 for End-to-End Learning of Semantic Grasping

Figure 2 for End-to-End Learning of Semantic Grasping

Figure 3 for End-to-End Learning of Semantic Grasping

Figure 4 for End-to-End Learning of Semantic Grasping

We consider the task of semantic robotic grasping, in which a robot picks up an object of a user-specified class using only monocular images. Inspired by the two-stream hypothesis of visual reasoning, we present a semantic grasping framework that learns object detection, classification, and grasp planning in an end-to-end fashion. A "ventral stream" recognizes object class while a "dorsal stream" simultaneously interprets the geometric relationships necessary to execute successful grasps. We leverage the autonomous data collection capabilities of robots to obtain a large self-supervised dataset for training the dorsal stream, and use semi-supervised label propagation to train the ventral stream with only a modest amount of human supervision. We experimentally show that our approach improves upon grasping systems whose components are not learned end-to-end, including a baseline method that uses bounding box detection. Furthermore, we show that jointly training our model with auxiliary data consisting of non-semantic grasping data, as well as semantically labeled images without grasp actions, has the potential to substantially improve semantic grasping performance.

* 14 pages

Via

Access Paper or Ask Questions

Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Sep 25, 2017
Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai, Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs, Julian Ibarz, Peter Pastor, Kurt Konolige, Sergey Levine, Vincent Vanhoucke

Figure 1 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Figure 2 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Figure 3 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Figure 4 for Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms can be extremely time-consuming and expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real world. We study how randomized simulated environments and domain adaptation methods can be extended to train a grasping system to grasp novel objects from raw monocular RGB images. We extensively evaluate our approaches with a total of more than 25,000 physical test grasps, studying a range of simulation conditions and domain adaptation methods, including a novel extension of pixel-level domain adaptation that we term the GraspGAN. We show that, by using synthetic data and domain adaptation, we are able to reduce the number of real-world samples needed to achieve a given level of performance by up to 50 times, using only randomly generated simulated objects. We also show that by using only unlabeled real-world data and our GraspGAN methodology, we obtain real-world grasping performance without any real-world labels that is similar to that achieved with 939,777 labeled real-world samples.

* 9 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

Attention-based Extraction of Structured Information from Street View Imagery

Aug 20, 2017
Zbigniew Wojna, Alex Gorban, Dar-Shyang Lee, Kevin Murphy, Qian Yu, Yeqing Li, Julian Ibarz

Figure 1 for Attention-based Extraction of Structured Information from Street View Imagery

Figure 2 for Attention-based Extraction of Structured Information from Street View Imagery

Figure 3 for Attention-based Extraction of Structured Information from Street View Imagery

Figure 4 for Attention-based Extraction of Structured Information from Street View Imagery

We present a neural network model - based on CNNs, RNNs and a novel attention mechanism - which achieves 84.2% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly outperforming the previous state of the art (Smith'16), which achieved 72.46%. Furthermore, our new method is much simpler and more general than the previous approach. To demonstrate the generality of our model, we show that it also performs well on an even more challenging dataset derived from Google Street View, in which the goal is to extract business names from store fronts. Finally, we study the speed/accuracy tradeoff that results from using CNN feature extractors of different depths. Surprisingly, we find that deeper is not always better (in terms of accuracy, as well as speed). Our resulting model is simple, accurate and fast, allowing it to be used at scale on a variety of challenging real-world text extraction problems.

* Updated references, added link to the source code

Via

Access Paper or Ask Questions

End-to-End Interpretation of the French Street Name Signs Dataset

Feb 13, 2017
Raymond Smith, Chunhui Gu, Dar-Shyang Lee, Huiyi Hu, Ranjith Unnikrishnan, Julian Ibarz, Sacha Arnoud, Sophia Lin

Figure 1 for End-to-End Interpretation of the French Street Name Signs Dataset

Figure 2 for End-to-End Interpretation of the French Street Name Signs Dataset

Figure 3 for End-to-End Interpretation of the French Street Name Signs Dataset

Figure 4 for End-to-End Interpretation of the French Street Name Signs Dataset

We introduce the French Street Name Signs (FSNS) Dataset consisting of more than a million images of street name signs cropped from Google Street View images of France. Each image contains several views of the same street name sign. Every image has normalized, title case folded ground-truth text as it would appear on a map. We believe that the FSNS dataset is large and complex enough to train a deep network of significant complexity to solve the street name extraction problem "end-to-end" or to explore the design trade-offs between a single complex engineered network and multiple sub-networks designed and trained to solve sub-problems. We present such an "end-to-end" network/graph for Tensor Flow and its results on the FSNS dataset.

* Computer Vision - ECCV 2016 Workshops Volume 9913 of the series Lecture Notes in Computer Science pp 411-426
* Presented at the IWRR workshop at ECCV 2016

Via

Access Paper or Ask Questions

Large Scale Business Discovery from Street Level Imagery

Feb 02, 2016
Qian Yu, Christian Szegedy, Martin C. Stumpe, Liron Yatziv, Vinay Shet, Julian Ibarz, Sacha Arnoud

Figure 1 for Large Scale Business Discovery from Street Level Imagery

Figure 2 for Large Scale Business Discovery from Street Level Imagery

Figure 3 for Large Scale Business Discovery from Street Level Imagery

Figure 4 for Large Scale Business Discovery from Street Level Imagery

Search with local intent is becoming increasingly useful due to the popularity of the mobile device. The creation and maintenance of accurate listings of local businesses worldwide is time consuming and expensive. In this paper, we propose an approach to automatically discover businesses that are visible on street level imagery. Precise business store front detection enables accurate geo-location of businesses, and further provides input for business categorization, listing generation, etc. The large variety of business categories in different countries makes this a very challenging problem. Moreover, manual annotation is prohibitive due to the scale of this problem. We propose the use of a MultiBox based approach that takes input image pixels and directly outputs store front bounding boxes. This end-to-end learning approach instead preempts the need for hand modeling either the proposal generation phase or the post-processing phase, leveraging large labelled training datasets. We demonstrate our approach outperforms the state of the art detection techniques with a large margin in terms of performance and run-time efficiency. In the evaluation, we show this approach achieves human accuracy in the low-recall settings. We also provide an end-to-end evaluation of business discovery in the real world.

Via

Access Paper or Ask Questions