Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website $\href{https://robotics-transformer-x.github.io}{\text{robotics-transformer-x.github.io}}$.
Enabling robots to learn novel visuomotor skills in a data-efficient manner remains an unsolved problem with myriad challenges. A popular paradigm for tackling this problem is through leveraging large unlabeled datasets that have many behaviors in them and then adapting a policy to a specific task using a small amount of task-specific human supervision (i.e. interventions or demonstrations). However, how best to leverage the narrow task-specific supervision and balance it with offline data remains an open question. Our key insight in this work is that task-specific data not only provides new data for an agent to train on but can also inform the type of prior data the agent should use for learning. Concretely, we propose a simple approach that uses a small amount of downstream expert data to selectively query relevant behaviors from an offline, unlabeled dataset (including many sub-optimal behaviors). The agent is then jointly trained on the expert and queried data. We observe that our method learns to query only the relevant transitions to the task, filtering out sub-optimal or task-irrelevant data. By doing so, it is able to learn more effectively from the mix of task-specific and offline data compared to naively mixing the data or only using the task-specific data. Furthermore, we find that our simple querying approach outperforms more complex goal-conditioned methods by 20% across simulated and real robotic manipulation tasks from images. See https://sites.google.com/view/behaviorretrieval for videos and code.
Humans are capable of completing a range of challenging manipulation tasks that require reasoning jointly over modalities such as vision, touch, and sound. Moreover, many such tasks are partially-observed; for example, taking a notebook out of a backpack will lead to visual occlusion and require reasoning over the history of audio or tactile information. While robust tactile sensing can be costly to capture on robots, microphones near or on a robot's gripper are a cheap and easy way to acquire audio feedback of contact events, which can be a surprisingly valuable data source for perception in the absence of vision. Motivated by the potential for sound to mitigate visual occlusion, we aim to learn a set of challenging partially-observed manipulation tasks from visual and audio inputs. Our proposed system learns these tasks by combining offline imitation learning from a modest number of tele-operated demonstrations and online finetuning using human provided interventions. In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by ~20%. Finally, we find that our system can complete a set of challenging, partially-observed tasks on a Franka Emika Panda robot, like extracting keys from a bag, with a 70% success rate, 50% higher than a policy that does not use audio.
This paper improves wind power prediction via weather forecast-contextualized Long Short-Term Memory Neural Network (LSTM) models. Initially, only wind power data was fed to a generic LSTM, but this model performed poorly, with erratic and naive behavior observed on even low-variance data sections. To address this issue, weather forecast data was added to better contextualize the power data, and LSTM modifications were made to address specific model shortcomings. These models were tested through both a Normalized Mean Absolute Error and the Naive Ratio (NR), which is a score introduced by this paper to quantify the unwanted presence of naive character in trained models. Results showed an increased accuracy with the addition of weather forecast data on the modified models, as well as a decrease in naive character. Key contributions include making improved LSTM variants, usage of weather forecast data, and the introduction of a new model performance index.
This project develops and trains a Recurrent Neural Network (RNN) that monitors sleeping infants from an auxiliary microphone for cases of Sudden Infant Death Syndrome (SIDS), manifested in sudden or gradual respiratory arrest. To minimize invasiveness and maximize economic viability, an electret microphone, and parabolic concentrator, paired with a specially designed and tuned amplifier circuit, was used as a very sensitive audio monitoring device, which fed data to the RNN model. This RNN was trained and operated in the frequency domain, where the respiratory activity is most unique from noise. In both training and operation, a Fourier transform and an autoencoder compression were applied to the raw audio, and this transformed audio data was fed into the model in 1/8 second time steps. In operation, this model flagged each perceived breath, and the time between breaths was analyzed through a statistical T-test for slope, which detected dangerous trends. The entire model achieved 92.5% accuracy on continuous data and had an 11.25-second response rate on data that emulated total respiratory arrest. Because of the compatibility of the trained model with many off-the-shelf devices like Android phones and Raspberry Pi's, free-standing processing hardware deployment is a very feasible future goal.