Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Zisserman

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Jun 27, 2020
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

Figure 1 for Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Figure 2 for Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Figure 3 for Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Figure 4 for Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in constraining the period prediction module to use temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen repetitions in videos in the wild. We train this model, called Repnet, with a synthetic dataset that is generated from a large unlabeled video collection by sampling short clips of varying lengths and repeating them with different periods and counts. This combination of synthetic data and a powerful yet constrained model, allows us to predict periods in a class-agnostic fashion. Our model substantially exceeds the state of the art performance on existing periodicity (PERTUBE) and repetition counting (QUVA) benchmarks. We also collect a new challenging dataset called Countix (~90 times larger than existing datasets) which captures the challenges of repetition counting in real-world videos. Project webpage: https://sites.google.com/view/repnet .

* Accepted at CVPR 2020. Project webpage: https://sites.google.com/view/repnet

Via

Access Paper or Ask Questions

LSD-C: Linearly Separable Deep Clusters

Jun 17, 2020
Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman

Figure 1 for LSD-C: Linearly Separable Deep Clusters

Figure 2 for LSD-C: Linearly Separable Deep Clusters

Figure 3 for LSD-C: Linearly Separable Deep Clusters

Figure 4 for LSD-C: Linearly Separable Deep Clusters

We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our algorithm first establishes pairwise connections in the feature space between the samples of the minibatch based on a similarity metric. Then it regroups in clusters the connected samples and enforces a linear separation between clusters. This is achieved by using the pairwise connections as targets together with a binary cross-entropy loss on the predictions that the associated pairs of samples belong to the same cluster. This way, the feature representation of the network will evolve such that similar samples in this feature space will belong to the same linearly separated cluster. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.

* Code available at https://github.com/srebuffi/lsd-clusters

Via

Access Paper or Ask Questions

The AVA-Kinetics Localized Human Actions Video Dataset

May 20, 2020
Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman

Figure 1 for The AVA-Kinetics Localized Human Actions Video Dataset

Figure 2 for The AVA-Kinetics Localized Human Actions Video Dataset

Figure 3 for The AVA-Kinetics Localized Human Actions Video Dataset

Figure 4 for The AVA-Kinetics Localized Human Actions Video Dataset

This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be downloaded from https://research.google.com/ava/

* 8 pages, 8 figures

Via

Access Paper or Ask Questions

Condensed Movies: Story Based Retrieval with Contextual Embeddings

May 08, 2020
Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman

Figure 1 for Condensed Movies: Story Based Retrieval with Contextual Embeddings

Figure 2 for Condensed Movies: Story Based Retrieval with Contextual Embeddings

Figure 3 for Condensed Movies: Story Based Retrieval with Contextual Embeddings

Figure 4 for Condensed Movies: Story Based Retrieval with Contextual Embeddings

Our objective in this work is the long range understanding of the narrative structure of movies. Instead of considering the entire movie, we propose to learn from the key scenes of the movie, providing a condensed look at the full storyline. To this end, we make the following four contributions: (i) We create the Condensed Movie Dataset (CMD) consisting of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face tracks, and metadata about the movie. Our dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use. It is also an order of magnitude larger than existing movie datasets in the number of movies; (ii) We introduce a new story-based text-to-video retrieval task on this dataset that requires a high level understanding of the plotline; (iii) We provide a deep network baseline for this task on our dataset, combining character, speech and visual cues into a single video embedding; and finally (iv) We demonstrate how the addition of context (both past and future) improves retrieval performance.

Via

Access Paper or Ask Questions

VGGSound: A Large-scale Audio-Visual Dataset

Apr 29, 2020
Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman

Figure 1 for VGGSound: A Large-scale Audio-Visual Dataset

Figure 2 for VGGSound: A Large-scale Audio-Visual Dataset

Figure 3 for VGGSound: A Large-scale Audio-Visual Dataset

Figure 4 for VGGSound: A Large-scale Audio-Visual Dataset

Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes. Third, we investigate various Convolutional Neural Network~(CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions. Code and the dataset are available at http://www.robots.ox.ac.uk/~vgg/data/vggsound/

* ICASSP2020

Via

Access Paper or Ask Questions

Monocular Depth Estimation with Self-supervised Instance Adaptation

Apr 13, 2020
Robert McCraith, Lukas Neumann, Andrew Zisserman, Andrea Vedaldi

Figure 1 for Monocular Depth Estimation with Self-supervised Instance Adaptation

Figure 2 for Monocular Depth Estimation with Self-supervised Instance Adaptation

Figure 3 for Monocular Depth Estimation with Self-supervised Instance Adaptation

Figure 4 for Monocular Depth Estimation with Self-supervised Instance Adaptation

Recent advances in self-supervised learning havedemonstrated that it is possible to learn accurate monoculardepth reconstruction from raw video data, without using any 3Dground truth for supervision. However, in robotics applications,multiple views of a scene may or may not be available, depend-ing on the actions of the robot, switching between monocularand multi-view reconstruction. To address this mixed setting,we proposed a new approach that extends any off-the-shelfself-supervised monocular depth reconstruction system to usemore than one image at test time. Our method builds on astandard prior learned to perform monocular reconstruction,but uses self-supervision at test time to further improve thereconstruction accuracy when multiple images are available.When used to update the correct components of the model, thisapproach is highly-effective. On the standard KITTI bench-mark, our self-supervised method consistently outperformsall the previous methods with an average 25% reduction inabsolute error for the three common setups (monocular, stereoand monocular+stereo), and comes very close in accuracy whencompared to the fully-supervised state-of-the-art methods.

* IROS submission, 7 pages

Via

Access Paper or Ask Questions

Speech2Action: Cross-modal Supervision for Action Recognition

Mar 30, 2020
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

Figure 1 for Speech2Action: Cross-modal Supervision for Action Recognition

Figure 2 for Speech2Action: Cross-modal Supervision for Action Recognition

Figure 3 for Speech2Action: Cross-modal Supervision for Action Recognition

Figure 4 for Speech2Action: Cross-modal Supervision for Action Recognition

Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

* Accepted to CVPR 2020

Via

Access Paper or Ask Questions

Visual Grounding in Video for Unsupervised Word Translation

Mar 26, 2020
Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

Figure 1 for Visual Grounding in Video for Unsupervised Word Translation

Figure 2 for Visual Grounding in Video for Unsupervised Word Translation

Figure 3 for Visual Grounding in Video for Unsupervised Word Translation

Figure 4 for Visual Grounding in Video for Unsupervised Word Translation

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.

* CVPR 2020
* CVPR 2020

Via

Access Paper or Ask Questions

Compact Deep Aggregation for Set Retrieval

Mar 26, 2020
Yujie Zhong, Relja Arandjelović, Andrew Zisserman

Figure 1 for Compact Deep Aggregation for Set Retrieval

Figure 2 for Compact Deep Aggregation for Set Retrieval

Figure 3 for Compact Deep Aggregation for Set Retrieval

Figure 4 for Compact Deep Aggregation for Set Retrieval

The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors. We focus on a specific example of this general problem -- that of retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in each image, and given a query for multiple identities, the goal is then to retrieve, in order, images which contain all the identities, all but one, \etc To this end, we make the following contributions: first, we propose a CNN architecture -- {\em SetNet} -- to achieve the objective: it learns face descriptors and their aggregation over a set to produce a compact fixed length descriptor designed for set retrieval, and the score of an image is a count of the number of identities that match the query; second, we show that this compact descriptor has minimal loss of discriminability up to two faces per image, and degrades slowly after that -- far exceeding a number of baselines; third, we explore the speed vs.\ retrieval quality trade-off for set retrieval using this compact descriptor; and, finally, we collect and annotate a large dataset of images containing various number of celebrities, which we use for evaluation and is publicly released.

* 20 pages

Via

Access Paper or Ask Questions