Alert button
Picture for Josef Sivic

Josef Sivic

Alert button

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Apr 07, 2018
Antoine Miech, Ivan Laptev, Josef Sivic

Figure 1 for Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Figure 2 for Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Figure 3 for Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Figure 4 for Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Joint understanding of video and language is an active research area with many applications. Prior work in this domain typically relies on learning text-video embeddings. One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources. To this end, we propose a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training. As a result, our framework can learn improved text-video embeddings simultaneously from image and video datasets. We also show the generalization of MEE to other input modalities such as face descriptors. We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets. The proposed MEE model demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video-to-text retrieval tasks. Code is available at: https://github.com/antoine77340/Mixture-of-Embedding-Experts

Viaarxiv icon

Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions

Apr 04, 2018
Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, Tomas Pajdla

Figure 1 for Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions
Figure 2 for Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions
Figure 3 for Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions
Figure 4 for Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions

Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds. Practical visual localization approaches need to be robust to a wide variety of viewing condition, including day-night changes, as well as weather and seasonal variations, while providing highly accurate 6 degree-of-freedom (6DOF) camera pose estimates. In this paper, we introduce the first benchmark datasets specifically designed for analyzing the impact of such factors on visual localization. Using carefully created ground truth poses for query images taken under a wide variety of conditions, we evaluate the impact of various factors on 6DOF camera pose estimation accuracy through extensive experiments with state-of-the-art localization approaches. Based on our results, we draw conclusions about the difficulty of different conditions, showing that long-term localization is far from solved, and propose promising avenues for future work, including sequence-based localization approaches and the need for better local features. Our benchmark is available at visuallocalization.net.

* Accepted to CVPR 2018 as a spotlight 
Viaarxiv icon

Learnable pooling with Context Gating for video classification

Mar 05, 2018
Antoine Miech, Ivan Laptev, Josef Sivic

Figure 1 for Learnable pooling with Context Gating for video classification
Figure 2 for Learnable pooling with Context Gating for video classification
Figure 3 for Learnable pooling with Context Gating for video classification
Figure 4 for Learnable pooling with Context Gating for video classification

Current methods for video analysis often extract frame-level features using pre-trained convolutional neural networks (CNNs). Such features are then aggregated over time e.g., by simple temporal averaging or more sophisticated recurrent neural networks such as long short-term memory (LSTM) or gated recurrent units (GRU). In this work we revise existing video representations and study alternative methods for temporal aggregation. We first explore clustering-based aggregation layers and propose a two-stream architecture aggregating audio and visual features. We then introduce a learnable non-linear unit, named Context Gating, aiming to model interdependencies among network activations. Our experimental results show the advantage of both improvements for the task of video classification. In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

* Presented at Youtube 8M CVPR17 Workshop. Kaggle Winning model. Under review for TPAMI 
Viaarxiv icon

Localizing Moments in Video with Natural Language

Aug 04, 2017
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, Bryan Russell

Figure 1 for Localizing Moments in Video with Natural Language
Figure 2 for Localizing Moments in Video with Natural Language
Figure 3 for Localizing Moments in Video with Natural Language
Figure 4 for Localizing Moments in Video with Natural Language

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

* ICCV 2017 
Viaarxiv icon

Weakly-supervised learning of visual relations

Jul 29, 2017
Julia Peyre, Ivan Laptev, Cordelia Schmid, Josef Sivic

Figure 1 for Weakly-supervised learning of visual relations
Figure 2 for Weakly-supervised learning of visual relations
Figure 3 for Weakly-supervised learning of visual relations
Figure 4 for Weakly-supervised learning of visual relations

This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset.

Viaarxiv icon

Learning from Video and Text via Large-Scale Discriminative Clustering

Jul 27, 2017
Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic

Figure 1 for Learning from Video and Text via Large-Scale Discriminative Clustering
Figure 2 for Learning from Video and Text via Large-Scale Discriminative Clustering
Figure 3 for Learning from Video and Text via Large-Scale Discriminative Clustering
Figure 4 for Learning from Video and Text via Large-Scale Discriminative Clustering

Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. Such applications include person and action recognition, text-to-video alignment, object co-segmentation and colocalization in videos and images. One drawback of discriminative clustering, however, is its limited scalability. We address this issue and propose an online optimization algorithm based on the Block-Coordinate Frank-Wolfe algorithm. We apply the proposed method to the problem of weakly supervised learning of actions and actors from movies together with corresponding movie scripts. The scaling up of the learning problem to 66 feature length movies enables us to significantly improve weakly supervised action recognition.

* To appear in ICCV 2017 
Viaarxiv icon

Convolutional neural network architecture for geometric matching

Apr 13, 2017
Ignacio Rocco, Relja Arandjelović, Josef Sivic

Figure 1 for Convolutional neural network architecture for geometric matching
Figure 2 for Convolutional neural network architecture for geometric matching
Figure 3 for Convolutional neural network architecture for geometric matching
Figure 4 for Convolutional neural network architecture for geometric matching

We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging Proposal Flow dataset.

* In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) 
Viaarxiv icon

ActionVLAD: Learning spatio-temporal aggregation for action classification

Apr 10, 2017
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

Figure 1 for ActionVLAD: Learning spatio-temporal aggregation for action classification
Figure 2 for ActionVLAD: Learning spatio-temporal aggregation for action classification
Figure 3 for ActionVLAD: Learning spatio-temporal aggregation for action classification
Figure 4 for ActionVLAD: Learning spatio-temporal aggregation for action classification

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as out-performs other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

* Accepted to CVPR 2017. Project page: https://rohitgirdhar.github.io/ActionVLAD/ 
Viaarxiv icon

Unsupervised Learning from Narrated Instruction Videos

Jun 28, 2016
Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

Figure 1 for Unsupervised Learning from Narrated Instruction Videos
Figure 2 for Unsupervised Learning from Narrated Instruction Videos
Figure 3 for Unsupervised Learning from Narrated Instruction Videos
Figure 4 for Unsupervised Learning from Narrated Instruction Videos

We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method solves two clustering problems, one in text and one in video, applied one after each other and linked by joint constraints to obtain a single coherent sequence of steps in both modalities. Second, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains about 800,000 frames for five different tasks that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings. Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

* Appears in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). 21 pages 
Viaarxiv icon

NetVLAD: CNN architecture for weakly supervised place recognition

May 02, 2016
Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic

Figure 1 for NetVLAD: CNN architecture for weakly supervised place recognition
Figure 2 for NetVLAD: CNN architecture for weakly supervised place recognition
Figure 3 for NetVLAD: CNN architecture for weakly supervised place recognition
Figure 4 for NetVLAD: CNN architecture for weakly supervised place recognition

We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state-of-the-art compact image representations on standard image retrieval benchmarks.

* Appears in: IEEE Computer Vision and Pattern Recognition (CVPR) 2016 
Viaarxiv icon