Alert button
Picture for Joao Carreira

Joao Carreira

Alert button

TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

Jun 14, 2023
Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman

Figure 1 for TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Figure 2 for TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Figure 3 for TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
Figure 4 for TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time. Visualizations, source code, and pretrained models can be found on our project webpage.

Viaarxiv icon

Compressed Vision for Efficient Video Understanding

Oct 06, 2022
Olivia Wiles, Joao Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski

Figure 1 for Compressed Vision for Efficient Video Understanding
Figure 2 for Compressed Vision for Efficient Video Understanding
Figure 3 for Compressed Vision for Efficient Video Understanding
Figure 4 for Compressed Vision for Efficient Video Understanding

Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.

* ACCV 
Viaarxiv icon

Hierarchical Perceiver

Feb 22, 2022
Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, Relja Arandjelovic, Matt Botvinick, Oriol Vinyals, Karen Simonyan, Andrew Zisserman, Andrew Jaegle

Figure 1 for Hierarchical Perceiver
Figure 2 for Hierarchical Perceiver
Figure 3 for Hierarchical Perceiver
Figure 4 for Hierarchical Perceiver

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by exclusively using global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). HiP retains the ability to process arbitrary modalities, but now at higher-resolution and without any specialized preprocessing, improving over flat Perceivers in both efficiency and accuracy on the ImageNet, Audioset and PASCAL VOC datasets.

Viaarxiv icon

Towards Learning Universal Audio Representations

Dec 01, 2021
Luyu Wang, Pauline Luc, Yan Wu, Adria Recasens, Lucas Smaira, Andrew Brock, Andrew Jaegle, Jean-Baptiste Alayrac, Sander Dieleman, Joao Carreira, Aaron van den Oord

Figure 1 for Towards Learning Universal Audio Representations
Figure 2 for Towards Learning Universal Audio Representations
Figure 3 for Towards Learning Universal Audio Representations
Figure 4 for Towards Learning Universal Audio Representations

The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark. We discover that previous sound event classification or speech models do not generalize outside of their domains. We observe that more robust audio representations can be learned with the SimCLR objective; however, the model's transferability depends heavily on the model architecture. We find the Slowfast architecture is good at learning rich representations required by different domains, but its performance is affected by the normalization scheme. Based on these findings, we propose a novel normalizer-free Slowfast NFNet and achieve state-of-the-art performance across all domains.

Viaarxiv icon

Gradient Forward-Propagation for Large-Scale Temporal Video Modelling

Jul 12, 2021
Mateusz Malinowski, Dimitrios Vytiniotis, Grzegorz Swirszcz, Viorica Patraucean, Joao Carreira

Figure 1 for Gradient Forward-Propagation for Large-Scale Temporal Video Modelling
Figure 2 for Gradient Forward-Propagation for Large-Scale Temporal Video Modelling
Figure 3 for Gradient Forward-Propagation for Large-Scale Temporal Video Modelling
Figure 4 for Gradient Forward-Propagation for Large-Scale Temporal Video Modelling

How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.

* Accepted to CVPR 2021. arXiv admin note: text overlap with arXiv:2001.06232 
Viaarxiv icon

Perceiver: General Perception with Iterative Attention

Mar 04, 2021
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

Figure 1 for Perceiver: General Perception with Iterative Attention
Figure 2 for Perceiver: General Perception with Iterative Attention
Figure 3 for Perceiver: General Perception with Iterative Attention
Figure 4 for Perceiver: General Perception with Iterative Attention

Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.

Viaarxiv icon

Sideways: Depth-Parallel Training of Video Models

Jan 17, 2020
Mateusz Malinowski, Grzegorz Swirszcz, Joao Carreira, Viorica Patraucean

Figure 1 for Sideways: Depth-Parallel Training of Video Models
Figure 2 for Sideways: Depth-Parallel Training of Video Models
Figure 3 for Sideways: Depth-Parallel Training of Video Models
Figure 4 for Sideways: Depth-Parallel Training of Video Models

We propose Sideways, an approximate backpropagation scheme for training video models. In standard backpropagation, the gradients and activations at every computation step through the model are temporally synchronized. The forward activations need to be stored until the backward pass is executed, preventing inter-layer (depth) parallelization. However, can we leverage smooth, redundant input streams such as videos to develop a more efficient training scheme? Here, we explore an alternative to backpropagation; we overwrite network activations whenever new ones, i.e., from new frames, become available. Such a more gradual accumulation of information from both passes breaks the precise correspondence between gradients and activations, leading to theoretically more noisy weight updates. Counter-intuitively, we show that Sideways training of deep convolutional video networks not only still converges, but can also potentially exhibit better generalization compared to standard synchronized backpropagation.

Viaarxiv icon

A Short Note on the Kinetics-700 Human Action Dataset

Jul 15, 2019
Joao Carreira, Eric Noland, Chloe Hillier, Andrew Zisserman

Figure 1 for A Short Note on the Kinetics-700 Human Action Dataset
Figure 2 for A Short Note on the Kinetics-700 Human Action Dataset
Figure 3 for A Short Note on the Kinetics-700 Human Action Dataset
Figure 4 for A Short Note on the Kinetics-700 Human Action Dataset

We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.

* arXiv admin note: substantial text overlap with arXiv:1808.01340 
Viaarxiv icon