Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ivan Laptev

Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

Nov 02, 2021
Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, Josef Sivic

Figure 1 for Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

Figure 2 for Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

Figure 3 for Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

Figure 4 for Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos

In this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces exerted on the human body. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions. This is cast as a large-scale trajectory optimization problem. Second, we develop a method to automatically recognize from the input video the 2D position and timing of contacts between the person and the object or the ground, thereby significantly simplifying the complexity of the optimization. Third, we validate our approach on a recent video+MoCap dataset capturing typical parkour actions, and demonstrate its performance on a new dataset of Internet videos showing people manipulating a variety of tools in unconstrained environments.

* arXiv admin note: substantial text overlap with arXiv:1904.02683

Via

Access Paper or Ask Questions

History Aware Multimodal Transformer for Vision-and-Language Navigation

Oct 25, 2021
Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, Ivan Laptev

Figure 1 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Figure 2 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Figure 3 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Figure 4 for History Aware Multimodal Transformer for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. To remember previously visited locations and actions taken, most approaches to VLN implement memory using recurrent states. Instead, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR), high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back). We demonstrate HAMT to be particularly effective for navigation tasks with longer trajectories.

* Accepted in NeurIPS 2021; project page at https://cshizhe.github.io/projects/vln_hamt.html

Via

Access Paper or Ask Questions

Differentiable Rendering with Perturbed Optimizers

Oct 18, 2021
Quentin Le Lidec, Ivan Laptev, Cordelia Schmid, Justin Carpentier

Figure 1 for Differentiable Rendering with Perturbed Optimizers

Figure 2 for Differentiable Rendering with Perturbed Optimizers

Figure 3 for Differentiable Rendering with Perturbed Optimizers

Figure 4 for Differentiable Rendering with Perturbed Optimizers

Reasoning about 3D scenes from their 2D image projections is one of the core problems in computer vision. Solutions to this inverse and ill-posed problem typically involve a search for models that best explain observed image data. Notably, images depend both on the properties of observed scenes and on the process of image formation. Hence, if optimization techniques should be used to explain images, it is crucial to design differentiable functions for the projection of 3D scenes into images, also known as differentiable rendering. Previous approaches to differentiable rendering typically replace non-differentiable operations by smooth approximations, impacting the subsequent 3D estimation. In this paper, we take a more general approach and study differentiable renderers through the prism of randomized optimization and the related notion of perturbed optimizers. In particular, our work highlights the link between some well-known differentiable renderer formulations and randomly smoothed optimizers, and introduces differentiable perturbed renderers. We also propose a variance reduction mechanism to alleviate the computational burden inherent to perturbed optimizers and introduce an adaptive scheme to automatically adjust the smoothing parameters of the rendering process. We apply our method to 3D scene reconstruction and demonstrate its advantages on the tasks of 6D pose estimation and 3D mesh reconstruction. By providing informative gradients that can be used as a strong supervisory signal, we demonstrate the benefits of perturbed renderers to obtain more accurate solutions when compared to the state-of-the-art alternatives using smooth gradient approximations.

Via

Access Paper or Ask Questions

Reconstructing and grounding narrated instructional videos in 3D

Sep 10, 2021
Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

Figure 1 for Reconstructing and grounding narrated instructional videos in 3D

Figure 2 for Reconstructing and grounding narrated instructional videos in 3D

Figure 3 for Reconstructing and grounding narrated instructional videos in 3D

Figure 4 for Reconstructing and grounding narrated instructional videos in 3D

Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D alignment graph. Finally, we propose an unsupervised approach to ground natural language in obtained 3D reconstructions. We demonstrate the effectiveness of our approach for the domain of car maintenance. Given raw instructional videos and no manual supervision, our method successfully reconstructs engines of different car models and associates textual descriptions with corresponding objects in 3D.

Via

Access Paper or Ask Questions

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Aug 20, 2021
Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, Cordelia Schmid

Figure 1 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Figure 2 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Figure 3 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Figure 4 for Airbert: In-domain Pretraining for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

* To be published on ICCV 2021. Webpage is at https://airbert-vln.github.io/ linking to our dataset, codes and models

Via

Access Paper or Ask Questions

Towards unconstrained joint hand-object reconstruction from RGB videos

Aug 16, 2021
Yana Hasson, Gül Varol, Ivan Laptev, Cordelia Schmid

Figure 1 for Towards unconstrained joint hand-object reconstruction from RGB videos

Figure 2 for Towards unconstrained joint hand-object reconstruction from RGB videos

Figure 3 for Towards unconstrained joint hand-object reconstruction from RGB videos

Figure 4 for Towards unconstrained joint hand-object reconstruction from RGB videos

Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos. Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations. The supervised learning approach to this problem, however, requires 3D supervision and remains limited to constrained laboratory settings and simulators for which 3D ground truth is available. In this paper we first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions. Our method relies on cues obtained with common methods for object detection, hand pose estimation and instance segmentation. We quantitatively evaluate our approach and show that it can be applied to datasets with varying levels of difficulty for which training data is unavailable.

* Project website: https://hassony2.github.io/homan.html

Via

Access Paper or Ask Questions

Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Jul 01, 2021
Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

Figure 1 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Figure 2 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Figure 3 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Figure 4 for Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning. In this work, we propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic. This high-level policy predicts intermediate states halfway to the goal using the value function as a reachability metric. We don't require the policy to reach these subgoals explicitly. Instead, we use them to define a prior policy, and incorporate this prior into a KL-constrained policy iteration scheme to speed up and regularize learning. Imagined subgoals are used during policy learning, but not during test time, where we only apply the learned policy. We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.

* ICML 2021. See the project webpage at https://www.di.ens.fr/willow/research/ris/

Via

Access Paper or Ask Questions

XCiT: Cross-Covariance Image Transformers

Jun 18, 2021
Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou

Figure 1 for XCiT: Cross-Covariance Image Transformers

Figure 2 for XCiT: Cross-Covariance Image Transformers

Figure 3 for XCiT: Cross-Covariance Image Transformers

Figure 4 for XCiT: Cross-Covariance Image Transformers

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.

Via

Access Paper or Ask Questions

Segmenter: Transformer for Semantic Segmentation

May 12, 2021
Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid

Figure 1 for Segmenter: Transformer for Semantic Segmentation

Figure 2 for Segmenter: Transformer for Semantic Segmentation

Figure 3 for Segmenter: Transformer for Semantic Segmentation

Figure 4 for Segmenter: Transformer for Semantic Segmentation

Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution based approaches, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings corresponding to image patches and obtain class labels from these embeddings with a point-wise linear decoder or a mask transformer decoder. We leverage models pre-trained for image classification and show that we can fine-tune them on moderate sized datasets available for semantic segmentation. The linear decoder allows to obtain excellent results already, but the performance can be further improved by a mask transformer generating class masks. We conduct an extensive ablation study to show the impact of the different parameters, in particular the performance is better for large models and small patch sizes. Segmenter attains excellent results for semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.

* Code available at https://github.com/rstrudel/segmenter

Via

Access Paper or Ask Questions

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Mar 30, 2021
Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

Figure 1 for Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Figure 2 for Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Figure 3 for Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Figure 4 for Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often inapplicable in practice for large-scale retrieval given the cost of the cross-attention mechanisms required for each sample at test time. This work combines the best of both worlds. We make the following three contributions. First, we equip transformer-based models with a new fine-grained cross-attention architecture, providing significant improvements in retrieval accuracy whilst preserving scalability. Second, we introduce a generic approach for combining a Fast dual encoder model with our Slow but accurate transformer-based model via distillation and re-ranking. Finally, we validate our approach on the Flickr30K image dataset where we show an increase in inference speed by several orders of magnitude while having results competitive to the state of the art. We also extend our method to the video domain, improving the state of the art on the VATEX dataset.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions