Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrea Vedaldi

Long-term Tracking in the Wild: A Benchmark

Aug 10, 2018

Jack Valmadre, Luca Bertinetto, João F. Henriques, Ran Tao, Andrea Vedaldi, Arnold Smeulders, Philip Torr, Efstratios Gavves

Figure 1 for Long-term Tracking in the Wild: A Benchmark

Figure 2 for Long-term Tracking in the Wild: A Benchmark

Figure 3 for Long-term Tracking in the Wild: A Benchmark

Figure 4 for Long-term Tracking in the Wild: A Benchmark

Abstract:We introduce the OxUvA dataset and benchmark for evaluating single-object tracking algorithms. Benchmarks have enabled great strides in the field of object tracking by defining standardized evaluations on large sets of diverse videos. However, these works have focused exclusively on sequences that are just tens of seconds in length and in which the target is always visible. Consequently, most researchers have designed methods tailored to this "short-term" scenario, which is poorly representative of practitioners' needs. Aiming to address this disparity, we compile a long-term, large-scale tracking dataset of sequences with average length greater than two minutes and with frequent target object disappearance. The OxUvA dataset is much larger than the object tracking datasets of recent years: it comprises 366 sequences spanning 14 hours of video. We assess the performance of several algorithms, considering both the ability to locate the target and to determine whether it is present or absent. Our goal is to offer the community a large and diverse benchmark to enable the design and evaluation of tracking methods ready to be used "in the wild". The project website is http://oxuva.net

* To appear at ECCV 2018

Via

Access Paper or Ask Questions

Semi-convolutional Operators for Instance Segmentation

Jul 27, 2018

David Novotny, Samuel Albanie, Diane Larlus, Andrea Vedaldi

Figure 1 for Semi-convolutional Operators for Instance Segmentation

Figure 2 for Semi-convolutional Operators for Instance Segmentation

Figure 3 for Semi-convolutional Operators for Instance Segmentation

Figure 4 for Semi-convolutional Operators for Instance Segmentation

Abstract:Object detection and instance segmentation are dominated by region-based methods such as Mask RCNN. However, there is a growing interest in reducing these problems to pixel labeling tasks, as the latter could be more efficient, could be integrated seamlessly in image-to-image network architectures as used in many other tasks, and could be more accurate for objects that are not well approximated by bounding boxes. In this paper we show theoretically and empirically that constructing dense pixel embeddings that can separate object instances cannot be easily achieved using convolutional operators. At the same time, we show that simple modifications, which we call semi-convolutional, have a much better chance of succeeding at this task. We use the latter to show a connection to Hough voting as well as to a variant of the bilateral kernel that is spatially steered by a convolutional network. We demonstrate that these operators can also be used to improve approaches such as Mask RCNN, demonstrating better segmentation of complex biological shapes and PASCAL VOC categories than achievable by Mask RCNN alone.

* Accepted as a conference paper at ECCV 2018

Via

Access Paper or Ask Questions

Invariant Information Distillation for Unsupervised Image Segmentation and Clustering

Jul 21, 2018

Xu Ji, João F. Henriques, Andrea Vedaldi

Figure 1 for Invariant Information Distillation for Unsupervised Image Segmentation and Clustering

Figure 2 for Invariant Information Distillation for Unsupervised Image Segmentation and Clustering

Figure 3 for Invariant Information Distillation for Unsupervised Image Segmentation and Clustering

Figure 4 for Invariant Information Distillation for Unsupervised Image Segmentation and Clustering

Abstract:We present a new method that learns to segment and cluster images without labels of any kind. A simple loss based on information theory is used to extract meaningful representations directly from raw images. This is achieved by maximising mutual information of images known to be related by spatial proximity or randomized transformations, which distills their shared abstract content. Unlike much of the work in unsupervised deep learning, our learned function outputs segmentation heatmaps and discrete classifications labels directly, rather than embeddings that need further processing to be usable. The loss can be formulated as a convolution, making it the first end-to-end unsupervised learning method that learns densely and efficiently for semantic segmentation. Implemented using realistic settings on generic deep neural network architectures, our method attains superior performance on COCO-Stuff and ISPRS-Potsdam for segmentation and STL for clustering, beating state-of-the-art baselines.

Via

Access Paper or Ask Questions

Inductive Visual Localisation: Factorised Training for Superior Generalisation

Jul 21, 2018

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman

Figure 1 for Inductive Visual Localisation: Factorised Training for Superior Generalisation

Figure 2 for Inductive Visual Localisation: Factorised Training for Superior Generalisation

Figure 3 for Inductive Visual Localisation: Factorised Training for Superior Generalisation

Figure 4 for Inductive Visual Localisation: Factorised Training for Superior Generalisation

Abstract:End-to-end trained Recurrent Neural Networks (RNNs) have been successfully applied to numerous problems that require processing sequences, such as image captioning, machine translation, and text recognition. However, RNNs often struggle to generalise to sequences longer than the ones encountered during training. In this work, we propose to optimise neural networks explicitly for induction. The idea is to first decompose the problem in a sequence of inductive steps and then to explicitly train the RNN to reproduce such steps. Generalisation is achieved as the RNN is not allowed to learn an arbitrary internal state; instead, it is tasked with mimicking the evolution of a valid state. In particular, the state is restricted to a spatial memory map that tracks parts of the input image which have been accounted for in previous steps. The RNN is trained for single inductive steps, where it produces updates to the memory in addition to the desired output. We evaluate our method on two different visual recognition problems involving visual sequences: (1) text spotting, i.e. joint localisation and reading of text in images containing multiple lines (or a block) of text, and (2) sequential counting of objects in aerial images. We show that inductive training of recurrent models enhances their generalisation ability on challenging image datasets.

* In BMVC 2018 (spotlight)

Via

Access Paper or Ask Questions

Large scale evaluation of local image feature detectors on homography datasets

Jul 20, 2018

Karel Lenc, Andrea Vedaldi

Figure 1 for Large scale evaluation of local image feature detectors on homography datasets

Figure 2 for Large scale evaluation of local image feature detectors on homography datasets

Figure 3 for Large scale evaluation of local image feature detectors on homography datasets

Figure 4 for Large scale evaluation of local image feature detectors on homography datasets

Abstract:We present a large scale benchmark for the evaluation of local feature detectors. Our key innovation is the introduction of a new evaluation protocol which extends and improves the standard detection repeatability measure. The new protocol is better for assessment on a large number of images and reduces the dependency of the results on unwanted distractors such as the number of detected features and the feature magnification factor. Additionally, our protocol provides a comprehensive assessment of the expected performance of detectors under several practical scenarios. Using images from the recently-introduced HPatches dataset, we evaluate a range of state-of-the-art local feature detectors on two main tasks: viewpoint and illumination invariant detection. Contrary to previous detector evaluations, our study contains an order of magnitude more image sequences, resulting in a quantitative evaluation significantly more robust to over-fitting. We also show that traditional detectors are still very competitive when compared to recent deep-learning alternatives.

* Accepted to BMVC 2018

Via

Access Paper or Ask Questions

Cross Pixel Optical Flow Similarity for Self-Supervised Learning

Jul 15, 2018

Aravindh Mahendran, James Thewlis, Andrea Vedaldi

Figure 1 for Cross Pixel Optical Flow Similarity for Self-Supervised Learning

Figure 2 for Cross Pixel Optical Flow Similarity for Self-Supervised Learning

Figure 3 for Cross Pixel Optical Flow Similarity for Self-Supervised Learning

Figure 4 for Cross Pixel Optical Flow Similarity for Self-Supervised Learning

Abstract:We propose a novel method for learning convolutional neural image representations without manual supervision. We use motion cues in the form of optical flow, to supervise representations of static images. The obvious approach of training a network to predict flow from a single image can be needlessly difficult due to intrinsic ambiguities in this prediction task. We instead propose a much simpler learning goal: embed pixels such that the similarity between their embeddings matches that between their optical flow vectors. At test time, the learned deep network can be used without access to video or flow information and transferred to tasks such as image classification, detection, and segmentation. Our method, which significantly simplifies previous attempts at using motion for self-supervision, achieves state-of-the-art results in self-supervision using motion cues, competitive results for self-supervision in general, and is overall state of the art in self-supervised pretraining for semantic image segmentation, as demonstrated on standard benchmarks.

Via

Access Paper or Ask Questions

ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking

Jul 06, 2018

Oliver Groth, Fabian B. Fuchs, Ingmar Posner, Andrea Vedaldi

Figure 1 for ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking

Figure 2 for ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking

Figure 3 for ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking

Figure 4 for ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking

Abstract:Physical intuition is pivotal for intelligent agents to perform complex tasks. In this paper we investigate the passive acquisition of an intuitive understanding of physical principles as well as the active utilisation of this intuition in the context of generalised object stacking. To this end, we provide: a simulation-based dataset featuring 20,000 stack configurations composed of a variety of elementary geometric primitives richly annotated regarding semantics and structural stability. We train visual classifiers for binary stability prediction on the ShapeStacks data and scrutinise their learned physical intuition. Due to the richness of the training data our approach also generalises favourably to real-world scenarios achieving state-of-the-art stability prediction on a publicly available benchmark of block towers. We then leverage the physical intuition learned by our model to actively construct stable stacks and observe the emergence of an intuitive notion of stackability - an inherent object affordance - induced by the active stacking task. Our approach performs well even in challenging conditions where it considerably exceeds the stack height observed during training or in cases where initially unstable structures must be stabilised via counterbalancing.

* revised version to appear at ECCV 2018

Via

Access Paper or Ask Questions

Conditional Image Generation for Learning the Structure of Visual Objects

Jun 20, 2018

Tomas Jakab, Ankush Gupta, Hakan Bilen, Andrea Vedaldi

Figure 1 for Conditional Image Generation for Learning the Structure of Visual Objects

Figure 2 for Conditional Image Generation for Learning the Structure of Visual Objects

Figure 3 for Conditional Image Generation for Learning the Structure of Visual Objects

Figure 4 for Conditional Image Generation for Learning the Structure of Visual Objects

Abstract:In this paper, we consider the problem of learning landmarks for object categories without any manual annotations. We cast this as the problem of conditionally generating an image of an object from another one, where the images differ by acquisition time and/or viewpoint. The process is aided by providing the generator with a keypoint-like representation extracted from the target image through a tight bottleneck. This encourages the representation to distil information about the object geometry, which changes from source to target, while the appearance, which is shared between the source and target, is read off from the source alone. Conditioning simplifies the generation task significantly, to the point that adopting a simple perceptual loss instead of more sophisticated approaches such as adversarial training is sufficient to learn landmarks. We show that our method is applicable to a large variety of datasets - faces, people, 3D objects, and digits - without any modifications. We further demonstrate that we can learn landmarks from synthetic image deformations or videos, all without manual supervision, while outperforming state-of-the-art unsupervised landmark detectors.

Via

Access Paper or Ask Questions

Neural Stethoscopes: Unifying Analytic, Auxiliary and Adversarial Network Probing

Jun 15, 2018

Fabian B. Fuchs, Oliver Groth, Adam R. Kosiorek, Alex Bewley, Markus Wulfmeier, Andrea Vedaldi, Ingmar Posner

Figure 1 for Neural Stethoscopes: Unifying Analytic, Auxiliary and Adversarial Network Probing

Figure 2 for Neural Stethoscopes: Unifying Analytic, Auxiliary and Adversarial Network Probing

Figure 3 for Neural Stethoscopes: Unifying Analytic, Auxiliary and Adversarial Network Probing

Figure 4 for Neural Stethoscopes: Unifying Analytic, Auxiliary and Adversarial Network Probing

Abstract:Model interpretability and systematic, targeted model adaptation present central tenets in machine learning for addressing limited or biased datasets. In this paper, we introduce neural stethoscopes as a framework for quantifying the degree of importance of specific factors of influence in deep networks as well as for actively promoting and suppressing information as appropriate. In doing so we unify concepts from multitask learning as well as training with auxiliary and adversarial losses. We showcase the efficacy of neural stethoscopes in an intuitive physics domain. Specifically, we investigate the challenge of visually predicting stability of block towers and demonstrate that the network uses visual cues which makes it susceptible to biases in the dataset. Through the use of stethoscopes we interrogate the accessibility of specific information throughout the network stack and show that we are able to actively de-bias network predictions as well as enhance performance via suitable auxiliary and adversarial stethoscope losses.

Via

Access Paper or Ask Questions

Meta-learning with differentiable closed-form solvers

May 21, 2018

Luca Bertinetto, João F. Henriques, Philip H. S. Torr, Andrea Vedaldi

Figure 1 for Meta-learning with differentiable closed-form solvers

Figure 2 for Meta-learning with differentiable closed-form solvers

Figure 3 for Meta-learning with differentiable closed-form solvers

Figure 4 for Meta-learning with differentiable closed-form solvers

Abstract:Adapting deep networks to new concepts from few examples is extremely challenging, due to the high computational and data requirements of standard fine-tuning procedures. Most works on meta-learning and few-shot learning have thus focused on simple learning techniques for adaptation, such as nearest neighbors or gradient descent. Nonetheless, the machine learning literature contains a wealth of methods that learn non-deep models very efficiently. In this work we propose to use these fast convergent methods as the main adaptation mechanism for few-shot learning. The main idea is to teach a deep network to use standard machine learning tools, such as logistic regression, as part of its own internal model, enabling it to quickly adapt to novel tasks. This requires back-propagating errors through the solver steps. While normally the matrix operations involved would be costly, the small number of examples works to our advantage, by making use of the Woodbury identity. We propose both iterative and closed-form solvers, based on logistic regression and ridge regression components. Our methods achieve excellent performance on three few-shot learning benchmarks, showing competitive performance on Omniglot and surpassing all state-of-the-art alternatives on miniImageNet and CIFAR-100.

Via

Access Paper or Ask Questions