Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cordelia Schmid

Thoth

M&M Mix: A Multimodal Multiview Transformer Ensemble

Jun 20, 2022

Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid

Figure 1 for M&M Mix: A Multimodal Multiview Transformer Ensemble

Figure 2 for M&M Mix: A Multimodal Multiview Transformer Ensemble

Figure 3 for M&M Mix: A Multimodal Multiview Transformer Ensemble

Figure 4 for M&M Mix: A Multimodal Multiview Transformer Ensemble

Abstract:This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

* Technical report for Epic-Kitchens challenge 2022

Via

Access Paper or Ask Questions

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Jun 16, 2022

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Figure 1 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Figure 2 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Figure 3 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Figure 4 for Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Abstract:Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available at https://antoyang.github.io/frozenbilm.html.

* 23 pages; 5 figures

Via

Access Paper or Ask Questions

AVATAR: Unconstrained Audiovisual Speech Recognition

Jun 15, 2022

Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Abstract:Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.

Via

Access Paper or Ask Questions

Weakly-supervised segmentation of referring expressions

May 12, 2022

Robin Strudel, Ivan Laptev, Cordelia Schmid

Figure 1 for Weakly-supervised segmentation of referring expressions

Figure 2 for Weakly-supervised segmentation of referring expressions

Figure 3 for Weakly-supervised segmentation of referring expressions

Figure 4 for Weakly-supervised segmentation of referring expressions

Abstract:Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and propose Text grounded semantic SEGgmentation (TSEG) that learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our transformer-based method computes patch-text similarities and guides the classification objective during training with a new multi-label patch assignment mechanism. The resulting visual grounding model segments image regions corresponding to given natural language expressions. Our approach TSEG demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.

Via

Access Paper or Ask Questions

Learning to Answer Visual Questions from Web Videos

May 11, 2022

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Figure 1 for Learning to Answer Visual Questions from Web Videos

Figure 2 for Learning to Answer Visual Questions from Web Videos

Figure 3 for Learning to Answer Visual Questions from Web Videos

Figure 4 for Learning to Answer Visual Questions from Web Videos

Abstract:Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available at https://antoyang.github.io/just-ask.html

* Accepted at the TPAMI Special Issue on the Best Papers of ICCV 2021. Journal extension of the conference paper arXiv:2012.00451. 16 pages, 13 figures

Via

Access Paper or Ask Questions

Assembly Planning from Observations under Physical Constraints

Apr 20, 2022

Thomas Chabal, Robin Strudel, Etienne Arlaud, Jean Ponce, Cordelia Schmid

Figure 1 for Assembly Planning from Observations under Physical Constraints

Figure 2 for Assembly Planning from Observations under Physical Constraints

Figure 3 for Assembly Planning from Observations under Physical Constraints

Figure 4 for Assembly Planning from Observations under Physical Constraints

Abstract:This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation. The proposed algorithm uses a simple combination of physical stability constraints, convex optimization and Monte Carlo tree search to plan assemblies as sequences of pick-and-place operations represented by STRIPS operators. It is efficient and, most importantly, robust to the errors in object detection and pose estimation unavoidable in any real robotic system. The proposed approach is demonstrated with thorough experiments on a UR5 manipulator.

* See the project webpage at https://www.di.ens.fr/willow/research/assembly-planning/

Via

Access Paper or Ask Questions

Learning Audio-Video Modalities from Image Captions

Apr 01, 2022

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

Figure 1 for Learning Audio-Video Modalities from Image Captions

Figure 2 for Learning Audio-Video Modalities from Image Captions

Figure 3 for Learning Audio-Video Modalities from Image Captions

Figure 4 for Learning Audio-Video Modalities from Image Captions

Abstract:A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new large-scale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval.

Via

Access Paper or Ask Questions

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Mar 30, 2022

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid

Figure 1 for TubeDETR: Spatio-Temporal Video Grounding with Transformers

Figure 2 for TubeDETR: Spatio-Temporal Video Grounding with Transformers

Figure 3 for TubeDETR: Spatio-Temporal Video Grounding with Transformers

Figure 4 for TubeDETR: Spatio-Temporal Video Grounding with Transformers

Abstract:We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.

* Accepted at CVPR 2022 (Oral); 17 pages; 8 figures

Via

Access Paper or Ask Questions

Leveraging Randomized Smoothing for Optimal Control of Nonsmooth Dynamical Systems

Mar 11, 2022

Quentin Le Lidec, Louis Montaut, Cordelia Schmid, Ivan Laptev, Justin Carpentier

Figure 1 for Leveraging Randomized Smoothing for Optimal Control of Nonsmooth Dynamical Systems

Figure 2 for Leveraging Randomized Smoothing for Optimal Control of Nonsmooth Dynamical Systems

Figure 3 for Leveraging Randomized Smoothing for Optimal Control of Nonsmooth Dynamical Systems

Figure 4 for Leveraging Randomized Smoothing for Optimal Control of Nonsmooth Dynamical Systems

Abstract:Optimal control (OC) algorithms such as Differential Dynamic Programming (DDP) take advantage of the derivatives of the dynamics to efficiently control physical systems. Yet, in the presence of nonsmooth dynamical systems, such class of algorithms are likely to fail due, for instance, to the presence of discontinuities in the dynamics derivatives or because of non-informative gradient. On the contrary, reinforcement learning (RL) algorithms have shown better empirical results in scenarios exhibiting non-smooth effects (contacts, frictions, etc). Our approach leverages recent works on randomized smoothing (RS) to tackle non-smoothness issues commonly encountered in optimal control, and provides key insights on the interplay between RL and OC through the prism of RS methods. This naturally leads us to introduce the randomized Differential Dynamic Programming (R-DDP) algorithm accounting for deterministic but non-smooth dynamics in a very sample-efficient way. The experiments demonstrate that our method is able to solve classic robotic problems with dry friction and frictional contacts, where classical OC algorithms are likely to fail and RL algorithms require in practice a prohibitive number of samples to find an optimal solution.

Via

Access Paper or Ask Questions

The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

Feb 28, 2022

Pia Bideau, Erik Learned-Miller, Cordelia Schmid, Karteek Alahari

Figure 1 for The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

Figure 2 for The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

Figure 3 for The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

Figure 4 for The Right Spin: Learning Object Motion from Rotation-Compensated Flow Fields

Abstract:Both a good understanding of geometrical concepts and a broad familiarity with objects lead to our excellent perception of moving objects. The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer and even camouflage. How humans perceive moving objects so reliably is a longstanding research question in computer vision and borrows findings from related areas such as psychology, cognitive science and physics. One approach to the problem is to teach a deep network to model all of these effects. This contrasts with the strategy used by human vision, where cognitive processes and body design are tightly coupled and each is responsible for certain aspects of correctly identifying moving objects. Similarly from the computer vision perspective, there is evidence that classical, geometry-based techniques are better suited to the "motion-based" parts of the problem, while deep networks are more suitable for modeling appearance. In this work, we argue that the coupling of camera rotation and camera translation can create complex motion fields that are difficult for a deep network to untangle directly. We present a novel probabilistic model to estimate the camera's rotation given the motion field. We then rectify the flow field to obtain a rotation-compensated motion field for subsequent segmentation. This strategy of first estimating camera motion, and then allowing a network to learn the remaining parts of the problem, yields improved results on the widely used DAVIS benchmark as well as the recently published motion segmentation data set MoCA (Moving Camouflaged Animals).

Via

Access Paper or Ask Questions