Alert button
Picture for Sai Rajeswar

Sai Rajeswar

Alert button

Efficient Dynamics Modeling in Interactive Environments with Koopman Theory

Jul 12, 2023
Arnab Kumar Mondal, Siba Smarak Panigrahi, Sai Rajeswar, Kaleem Siddiqi, Siamak Ravanbakhsh

Figure 1 for Efficient Dynamics Modeling in Interactive Environments with Koopman Theory
Figure 2 for Efficient Dynamics Modeling in Interactive Environments with Koopman Theory
Figure 3 for Efficient Dynamics Modeling in Interactive Environments with Koopman Theory
Figure 4 for Efficient Dynamics Modeling in Interactive Environments with Koopman Theory

The accurate modeling of dynamics in interactive environments is critical for successful long-range prediction. Such a capability could advance Reinforcement Learning (RL) and Planning algorithms, but achieving it is challenging. Inaccuracies in model estimates can compound, resulting in increased errors over long horizons. We approach this problem from the lens of Koopman theory, where the nonlinear dynamics of the environment can be linearized in a high-dimensional latent space. This allows us to efficiently parallelize the sequential problem of long-range prediction using convolution, while accounting for the agent's action at every time step. Our approach also enables stability analysis and better control over gradients through time. Taken together, these advantages result in significant improvement over the existing approaches, both in the efficiency and the accuracy of modeling dynamics over extended horizons. We also report promising experimental results in dynamics modeling for the scenarios of both model-based planning and model-free RL.

* 18 pages, 3 figures 
Viaarxiv icon

Choreographer: Learning and Adapting Skills in Imagination

Nov 23, 2022
Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Alexandre Lacoste, Sai Rajeswar

Figure 1 for Choreographer: Learning and Adapting Skills in Imagination
Figure 2 for Choreographer: Learning and Adapting Skills in Imagination
Figure 3 for Choreographer: Learning and Adapting Skills in Imagination
Figure 4 for Choreographer: Learning and Adapting Skills in Imagination

Unsupervised skill learning aims to learn a rich repertoire of behaviors without external supervision, providing artificial agents with the ability to control and influence the environment. However, without appropriate knowledge and exploration, skills may provide control only over a restricted area of the environment, limiting their applicability. Furthermore, it is unclear how to leverage the learned skill behaviors for adapting to downstream tasks in a data-efficient manner. We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. During adaptation, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Choreographer is able to learn skills both from offline data, and by collecting data simultaneously with an exploration policy. The skills can be used to effectively adapt to downstream tasks, as we show in the URL benchmark, where we outperform previous approaches from both pixels and states inputs. The learned skills also explore the environment thoroughly, finding sparse rewards more frequently, as shown in goal-reaching tasks from the DMC Suite and Meta-World. Project website: https://skillchoreographer.github.io/

Viaarxiv icon

Unsupervised Model-based Pre-training for Data-efficient Control from Pixels

Sep 24, 2022
Sai Rajeswar, Pietro Mazzaglia, Tim Verbelen, Alexandre Piché, Bart Dhoedt, Aaron Courville, Alexandre Lacoste

Figure 1 for Unsupervised Model-based Pre-training for Data-efficient Control from Pixels
Figure 2 for Unsupervised Model-based Pre-training for Data-efficient Control from Pixels
Figure 3 for Unsupervised Model-based Pre-training for Data-efficient Control from Pixels
Figure 4 for Unsupervised Model-based Pre-training for Data-efficient Control from Pixels

Controlling artificial agents from visual sensory data is an arduous task. Reinforcement learning (RL) algorithms can succeed in this but require large amounts of interactions between the agent and the environment. To alleviate the issue, unsupervised RL proposes to employ self-supervised interaction and learning, for adapting faster to future tasks. Yet, whether current unsupervised strategies improve generalization capabilities is still unclear, especially in visual control settings. In this work, we design an effective unsupervised RL strategy for data-efficient visual control. First, we show that world models pre-trained with data collected using unsupervised RL can facilitate adaptation for future tasks. Then, we analyze several design choices to adapt efficiently, effectively reusing the agents' pre-trained components, and learning and planning in imagination, with our hybrid planner, which we dub Dyna-MPC. By combining the findings of a large-scale empirical study, we establish an approach that strongly improves performance on the Unsupervised RL Benchmark, requiring 20$\times$ less data to match the performance of supervised methods. The approach also demonstrates robust performance on the Real-Word RL benchmark, hinting that the approach generalizes to noisy environments.

* Presented at DARL Workshop @ ICML 2022 
Viaarxiv icon

Multi-label Iterated Learning for Image Classification with Label Ambiguity

Nov 23, 2021
Sai Rajeswar, Pau Rodriguez, Soumye Singhal, David Vazquez, Aaron Courville

Figure 1 for Multi-label Iterated Learning for Image Classification with Label Ambiguity
Figure 2 for Multi-label Iterated Learning for Image Classification with Label Ambiguity
Figure 3 for Multi-label Iterated Learning for Image Classification with Label Ambiguity
Figure 4 for Multi-label Iterated Learning for Image Classification with Label Ambiguity

Transfer learning from large-scale pre-trained models has become essential for many computer vision tasks. Recent studies have shown that datasets like ImageNet are weakly labeled since images with multiple object classes present are assigned a single label. This ambiguity biases models towards a single prediction, which could result in the suppression of classes that tend to co-occur in the data. Inspired by language emergence literature, we propose multi-label iterated learning (MILe) to incorporate the inductive biases of multi-label learning from single labels using the framework of iterated learning. MILe is a simple yet effective procedure that builds a multi-label description of the image by propagating binary predictions through successive generations of teacher and student networks with a learning bottleneck. Experiments show that our approach exhibits systematic benefits on ImageNet accuracy as well as ReaL F1 score, which indicates that MILe deals better with label ambiguity than the standard training procedure, even when fine-tuning from self-supervised weights. We also show that MILe is effective reducing label noise, achieving state-of-the-art performance on real-world large-scale noisy data such as WebVision. Furthermore, MILe improves performance in class incremental settings such as IIRC and it is robust to distribution shifts. Code: https://github.com/rajeswar18/MILe

Viaarxiv icon

Touch-based Curiosity for Sparse-Reward Tasks

Apr 01, 2021
Sai Rajeswar, Cyril Ibrahim, Nitin Surya, Florian Golemo, David Vazquez, Aaron Courville, Pedro O. Pinheiro

Figure 1 for Touch-based Curiosity for Sparse-Reward Tasks
Figure 2 for Touch-based Curiosity for Sparse-Reward Tasks
Figure 3 for Touch-based Curiosity for Sparse-Reward Tasks
Figure 4 for Touch-based Curiosity for Sparse-Reward Tasks

Robots in many real-world settings have access to force/torque sensors in their gripper and tactile sensing is often necessary in tasks that involve contact-rich motion. In this work, we leverage surprise from mismatches in touch feedback to guide exploration in hard sparse-reward reinforcement learning tasks. Our approach, Touch-based Curiosity (ToC), learns what visible objects interactions are supposed to "feel" like. We encourage exploration by rewarding interactions where the expectation and the experience don't match. In our proposed method, an initial task-independent exploration phase is followed by an on-task learning phase, in which the original interactions are relabeled with on-task rewards. We test our approach on a range of touch-intensive robot arm tasks (e.g. pushing objects, opening doors), which we also release as part of this work. Across multiple experiments in a simulated setting, we demonstrate that our method is able to learn these difficult tasks through sparse reward and curiosity alone. We compare our cross-modal approach to single-modality (touch- or vision-only) approaches as well as other curiosity-based methods and find that our method performs better and is more sample-efficient.

Viaarxiv icon

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

Apr 17, 2020
Sai Rajeswar, Fahim Mannan, Florian Golemo, Jérôme Parent-Lévesque, David Vazquez, Derek Nowrouzezahrai, Aaron Courville

Figure 1 for Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation
Figure 2 for Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation
Figure 3 for Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation
Figure 4 for Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.

* International Journal of Computer Vision, (2020), 1-16  
* This is a pre-print of an article published in International Journal of Computer Vision. The final authenticated version is available online at: https://doi.org/10.1007/s11263-020-01322-1 
Viaarxiv icon

Pix2Shape -- Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

Mar 23, 2020
Sai Rajeswar, Fahim Mannan, Florian Golemo, Jérôme Parent-Lévesque, David Vazquez, Derek Nowrouzezahrai, Aaron Courville

Figure 1 for Pix2Shape -- Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation
Figure 2 for Pix2Shape -- Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation
Figure 3 for Pix2Shape -- Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation
Figure 4 for Pix2Shape -- Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.

* International Journal of Computer Vision, (2020), 1-16  
* This is a pre-print of an article published in International Journal of Computer Vision. The final authenticated version is available online at: https://doi.org/10.1007/s11263-020-01322-1 
Viaarxiv icon

Adversarial Computation of Optimal Transport Maps

Jun 24, 2019
Jacob Leygonie, Jennifer She, Amjad Almahairi, Sai Rajeswar, Aaron Courville

Figure 1 for Adversarial Computation of Optimal Transport Maps
Figure 2 for Adversarial Computation of Optimal Transport Maps
Figure 3 for Adversarial Computation of Optimal Transport Maps
Figure 4 for Adversarial Computation of Optimal Transport Maps

Computing optimal transport maps between high-dimensional and continuous distributions is a challenging problem in optimal transport (OT). Generative adversarial networks (GANs) are powerful generative models which have been successfully applied to learn maps across high-dimensional domains. However, little is known about the nature of the map learned with a GAN objective. To address this problem, we propose a generative adversarial model in which the discriminator's objective is the $2$-Wasserstein metric. We show that during training, our generator follows the $W_2$-geodesic between the initial and the target distributions. As a consequence, it reproduces an optimal map at the end of training. We validate our approach empirically in both low-dimensional and high-dimensional continuous settings, and show that it outperforms prior methods on image data.

Viaarxiv icon

Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data

Jun 18, 2018
Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, Aaron Courville

Figure 1 for Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data
Figure 2 for Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data
Figure 3 for Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data
Figure 4 for Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data

Learning inter-domain mappings from unpaired data can improve performance in structured prediction tasks, such as image segmentation, by reducing the need for paired data. CycleGAN was recently proposed for this problem, but critically assumes the underlying inter-domain mapping is approximately deterministic and one-to-one. This assumption renders the model ineffective for tasks requiring flexible, many-to-many mappings. We propose a new model, called Augmented CycleGAN, which learns many-to-many mappings between domains. We examine Augmented CycleGAN qualitatively and quantitatively on several image datasets.

* ICML 2018 
Viaarxiv icon