Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cordelia Schmid

Masking Modalities for Cross-modal Video Retrieval

Nov 01, 2021
Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Figure 1 for Masking Modalities for Cross-modal Video Retrieval

Figure 2 for Masking Modalities for Cross-modal Video Retrieval

Figure 3 for Masking Modalities for Cross-modal Video Retrieval

Figure 4 for Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never seen by the video encoder, which does not learn to process that modality. We address this drawback of current pre-training methods, which fail to exploit the rich cues in spoken language. Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech. We mask an entire modality in the input and predict it using the other two modalities. This encourages each modality to collaborate with the others, and our video encoder learns to process appearance and audio as well as speech. We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.

* Accepted at WACV 2022

Via

Access Paper or Ask Questions

History Aware Multimodal Transformer for Vision-and-Language Navigation

Oct 25, 2021
Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

Figure 1 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Figure 2 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Figure 3 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Figure 4 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR), high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back). We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories.

* Accepted in NeurIPS 2021; project page at https://cshizhe.github.io/projects/vln_hamt.html

Via

Access Paper or Ask Questions

Differentiable Rendering with Perturbed Optimizers

Oct 18, 2021
Quentin Le Lidec, Ivan Laptev, Cordelia Schmid, Justin Carpentier

Figure 1 for Differentiable Rendering with Perturbed Optimizers

Figure 2 for Differentiable Rendering with Perturbed Optimizers

Figure 3 for Differentiable Rendering with Perturbed Optimizers

Figure 4 for Differentiable Rendering with Perturbed Optimizers

Reasoning about 3D scenes from their 2D image projections is one of the core problems in computer vision. Solutions to this inverse and ill-posed problem typically involve a search for models that best explain observed image data. Notably, images depend both on the properties of observed scenes and on the process of image formation. Hence, if optimization techniques should be used to explain images, it is crucial to design differentiable functions for the projection of 3D scenes into images, also known as differentiable rendering. Previous approaches to differentiable rendering typically replace non-differentiable operations by smooth approximations, impacting the subsequent 3D estimation. In this paper, we take a more general approach and study differentiable renderers through the prism of randomized optimization and the related notion of perturbed optimizers. In particular, our work highlights the link between some well-known differentiable renderer formulations and randomly smoothed optimizers, and introduces differentiable perturbed renderers. We also propose a variance reduction mechanism to alleviate the computational burden inherent to perturbed optimizers and introduce an adaptive scheme to automatically adjust the smoothing parameters of the rendering process. We apply our method to 3D scene reconstruction and demonstrate its advantages on the tasks of 6D pose estimation and 3D mesh reconstruction. By providing informative gradients that can be used as a strong supervisory signal, we demonstrate the benefits of perturbed renderers to obtain more accurate solutions when compared to the state-of-the-art alternatives using smooth gradient approximations.

Via

Access Paper or Ask Questions

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Aug 20, 2021
Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid

Figure 1 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Figure 2 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Figure 3 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Figure 4 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

* To be published on ICCV 2021. Webpage is at https://airbert-vln.github.io/ linking to our dataset, codes and models

Via

Access Paper or Ask Questions

Towards unconstrained joint hand-object reconstruction from RGB videos

Aug 16, 2021
Yana Hasson, Gül Varol, Ivan Laptev, Cordelia Schmid

Figure 1 for Towards unconstrained joint hand-object reconstruction from RGB videos

Figure 2 for Towards unconstrained joint hand-object reconstruction from RGB videos

Figure 3 for Towards unconstrained joint hand-object reconstruction from RGB videos

Figure 4 for Towards unconstrained joint hand-object reconstruction from RGB videos

Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos. Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations. The supervised learning approach to this problem, however, requires 3D supervision and remains limited to constrained laboratory settings and simulators for which 3D ground truth is available. In this paper we first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions. Our method relies on cues obtained with common methods for object detection, hand pose estimation and instance segmentation. We quantitatively evaluate our approach and show that it can be applied to datasets with varying levels of difficulty for which training data is unavailable.

* Project website: https://hassony2.github.io/homan.html

Via

Access Paper or Ask Questions

CCVS: Context-aware Controllable Video Synthesis

Jul 16, 2021
Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Figure 1 for CCVS: Context-aware Controllable Video Synthesis

Figure 2 for CCVS: Context-aware Controllable Video Synthesis

Figure 3 for CCVS: Context-aware Controllable Video Synthesis

Figure 4 for CCVS: Context-aware Controllable Video Synthesis

This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (eg, a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

Via

Access Paper or Ask Questions

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Jul 01, 2021
Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Figure 1 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Figure 2 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Figure 3 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Figure 4 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning. In this work, we propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic. This high-level policy predicts intermediate states halfway to the goal using the value function as a reachability metric. We don't require the policy to reach these subgoals explicitly. Instead, we use them to define a prior policy, and incorporate this prior into a KL-constrained policy iteration scheme to speed up and regularize learning. Imagined subgoals are used during policy learning, but not during test time, where we only apply the learned policy. We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.

* ICML 2021. See the project webpage at https://www.di.ens.fr/willow/research/ris/

Via

Access Paper or Ask Questions

Attention Bottlenecks for Multimodal Fusion

Jun 30, 2021
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Figure 1 for Attention Bottlenecks for Multimodal Fusion

Figure 2 for Attention Bottlenecks for Multimodal Fusion

Figure 3 for Attention Bottlenecks for Multimodal Fusion

Figure 4 for Attention Bottlenecks for Multimodal Fusion

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.

Via

Access Paper or Ask Questions

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

Jun 28, 2021
Lu Mi, Hang Zhao, Charlie Nash, Xiaohan Jin, Jiyang Gao, Chen Sun, Cordelia Schmid, Nir Shavit, Yuning Chai, Dragomir Anguelov

Figure 1 for HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

Figure 2 for HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

Figure 3 for HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

Figure 4 for HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

High Definition (HD) maps are maps with precise definitions of road lanes with rich semantics of the traffic rules. They are critical for several key stages in an autonomous driving system, including motion forecasting and planning. However, there are only a small amount of real-world road topologies and geometries, which significantly limits our ability to test out the self-driving stack to generalize onto new unseen scenarios. To address this issue, we introduce a new challenging task to generate HD maps. In this work, we explore several autoregressive models using different data representations, including sequence, plain graph, and hierarchical graph. We propose HDMapGen, a hierarchical graph generation model capable of producing high-quality and diverse HD maps through a coarse-to-fine approach. Experiments on the Argoverse dataset and an in-house dataset show that HDMapGen significantly outperforms baseline methods. Additionally, we demonstrate that HDMapGen achieves high scalability and efficiency.

Via

Access Paper or Ask Questions

Residual Reinforcement Learning from Demonstrations

Jun 15, 2021
Minttu Alakuijala, Gabriel Dulac-Arnold, Julien Mairal, Jean Ponce, Cordelia Schmid

Figure 1 for Residual Reinforcement Learning from Demonstrations

Figure 2 for Residual Reinforcement Learning from Demonstrations

Figure 3 for Residual Reinforcement Learning from Demonstrations

Figure 4 for Residual Reinforcement Learning from Demonstrations

Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal. We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations. Learning from images, proprioceptive inputs and a sparse task-completion reward relaxes the requirement of accessing full state features, such as object and target positions. In addition, replacing the base controller with a policy learned from demonstrations removes the dependency on a hand-engineered controller in favour of a dataset of demonstrations, which can be provided by non-experts. Our experimental evaluation on simulated manipulation tasks on a 6-DoF UR5 arm and a 28-DoF dexterous hand demonstrates that residual RL from demonstrations is able to generalize to unseen environment conditions more flexibly than either behavioral cloning or RL fine-tuning, and is capable of solving high-dimensional, sparse-reward tasks out of reach for RL from scratch.

Via

Access Paper or Ask Questions