Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Laurens van der Maaten

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Dec 20, 2016
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

Figure 1 for CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Figure 2 for CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Figure 3 for CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Figure 4 for CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

Via

Access Paper or Ask Questions

Revisiting Visual Question Answering Baselines

Nov 22, 2016
Allan Jabri, Armand Joulin, Laurens van der Maaten

Figure 1 for Revisiting Visual Question Answering Baselines

Figure 2 for Revisiting Visual Question Answering Baselines

Figure 3 for Revisiting Visual Question Answering Baselines

Figure 4 for Revisiting Visual Question Answering Baselines

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to support "reasoning". For multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. We explore variants of the model and study its transferability between both datasets. We also present an error analysis of our model that suggests a key problem of current VQA systems lies in the lack of visual grounding of concepts that occur in the questions and answers. Overall, our results suggest that the performance of current VQA systems is not significantly better than that of systems designed to exploit dataset biases.

* European Conference on Computer Vision

Via

Access Paper or Ask Questions

Approximated and User Steerable tSNE for Progressive Visual Analytics

Jun 16, 2016
Nicola Pezzotti, Boudewijn P. F. Lelieveldt, Laurens van der Maaten, Thomas Höllt, Elmar Eisemann, Anna Vilanova

Figure 1 for Approximated and User Steerable tSNE for Progressive Visual Analytics

Figure 2 for Approximated and User Steerable tSNE for Progressive Visual Analytics

Figure 3 for Approximated and User Steerable tSNE for Progressive Visual Analytics

Figure 4 for Approximated and User Steerable tSNE for Progressive Visual Analytics

Progressive Visual Analytics aims at improving the interactivity in existing analytics techniques by means of visualization as well as interaction with intermediate results. One key method for data analysis is dimensionality reduction, for example, to produce 2D embeddings that can be visualized and analyzed efficiently. t-Distributed Stochastic Neighbor Embedding (tSNE) is a well-suited technique for the visualization of several high-dimensional data. tSNE can create meaningful intermediate results but suffers from a slow initialization that constrains its application in Progressive Visual Analytics. We introduce a controllable tSNE approximation (A-tSNE), which trades off speed and accuracy, to enable interactive data exploration. We offer real-time visualization techniques, including a density-based solution and a Magic Lens to inspect the degree of approximation. With this feedback, the user can decide on local refinements and steer the approximation level during the analysis. We demonstrate our technique with several datasets, in a real-world research scenario and for the real-time analysis of high-dimensional streams to illustrate its effectiveness for interactive data analysis.

Via

Access Paper or Ask Questions

Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance

Mar 25, 2016
Kevin van Hecke, Guido de Croon, Laurens van der Maaten, Daniel Hennes, Dario Izzo

Figure 1 for Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance

Figure 2 for Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance

Figure 3 for Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance

Figure 4 for Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance

Self-Supervised Learning (SSL) is a reliable learning mechanism in which a robot uses an original, trusted sensor cue for training to recognize an additional, complementary sensor cue. We study for the first time in SSL how a robot's learning behavior should be organized, so that the robot can keep performing its task in the case that the original cue becomes unavailable. We study this persistent form of SSL in the context of a flying robot that has to avoid obstacles based on distance estimates from the visual cue of stereo vision. Over time it will learn to also estimate distances based on monocular appearance cues. A strategy is introduced that has the robot switch from stereo vision based flight to monocular flight, with stereo vision purely used as 'training wheels' to avoid imminent collisions. This strategy is shown to be an effective approach to the 'feedback-induced data bias' problem as also experienced in learning from demonstration. Both simulations and real-world experiments with a stereo vision equipped AR drone 2.0 show the feasibility of this approach, with the robot successfully using monocular vision to avoid obstacles in a 5 x 5 room. The experiments show the potential of persistent SSL as a robust learning approach to enhance the capabilities of robots. Moreover, the abundant training data coming from the own sensors allows to gather large data sets necessary for deep learning approaches.

Via

Access Paper or Ask Questions

Modeling Time Series Similarity with Siamese Recurrent Networks

Mar 15, 2016
Wenjie Pei, David M. J. Tax, Laurens van der Maaten

Figure 1 for Modeling Time Series Similarity with Siamese Recurrent Networks

Figure 2 for Modeling Time Series Similarity with Siamese Recurrent Networks

Figure 3 for Modeling Time Series Similarity with Siamese Recurrent Networks

Figure 4 for Modeling Time Series Similarity with Siamese Recurrent Networks

Traditional techniques for measuring similarities between time series are based on handcrafted similarity measures, whereas more recent learning-based approaches cannot exploit external supervision. We combine ideas from time-series modeling and metric learning, and study siamese recurrent networks (SRNs) that minimize a classification loss to learn a good similarity measure between time series. Specifically, our approach learns a vectorial representation for each time series in such a way that similar time series are modeled by similar representations, and dissimilar time series by dissimilar representations. Because it is a similarity prediction models, SRNs are particularly well-suited to challenging scenarios such as signature recognition, in which each person is a separate class and very few examples per class are available. We demonstrate the potential merits of SRNs in within-domain and out-of-domain classification experiments and in one-shot learning experiments on tasks such as signature, voice, and sign language recognition.

* 11 pages

Via

Access Paper or Ask Questions

Time Series Classification using the Hidden-Unit Logistic Model

Jan 19, 2016
Wenjie Pei, Hamdi Dibeklioğlu, David M. J. Tax, Laurens van der Maaten

Figure 1 for Time Series Classification using the Hidden-Unit Logistic Model

Figure 2 for Time Series Classification using the Hidden-Unit Logistic Model

Figure 3 for Time Series Classification using the Hidden-Unit Logistic Model

Figure 4 for Time Series Classification using the Hidden-Unit Logistic Model

We present a new model for time series classification, called the hidden-unit logistic model, that uses binary stochastic hidden units to model latent structure in the data. The hidden units are connected in a chain structure that models temporal dependencies in the data. Compared to the prior models for time series classification such as the hidden conditional random field, our model can model very complex decision boundaries because the number of latent states grows exponentially with the number of hidden units. We demonstrate the strong performance of our model in experiments on a variety of (computer vision) tasks, including handwritten character recognition, speech recognition, facial expression, and action recognition. We also present a state-of-the-art system for facial action unit detection based on the hidden-unit logistic model.

* 17 pages, 4 figures, 3 tables

Via

Access Paper or Ask Questions

Learning Visual Features from Large Weakly Supervised Data

Nov 06, 2015
Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache

Figure 1 for Learning Visual Features from Large Weakly Supervised Data

Figure 2 for Learning Visual Features from Large Weakly Supervised Data

Figure 3 for Learning Visual Features from Large Weakly Supervised Data

Figure 4 for Learning Visual Features from Large Weakly Supervised Data

Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages.

Via

Access Paper or Ask Questions

Marginalizing Corrupted Features

Feb 27, 2014
Laurens van der Maaten, Minmin Chen, Stephen Tyree, Kilian Weinberger

Figure 1 for Marginalizing Corrupted Features

Figure 2 for Marginalizing Corrupted Features

Figure 3 for Marginalizing Corrupted Features

Figure 4 for Marginalizing Corrupted Features

The goal of machine learning is to develop predictors that generalize well to test data. Ideally, this is achieved by training on an almost infinitely large training data set that captures all variations in the data distribution. In practical learning settings, however, we do not have infinite data and our predictors may overfit. Overfitting may be combatted, for example, by adding a regularizer to the training objective or by defining a prior over the model parameters and performing Bayesian inference. In this paper, we propose a third, alternative approach to combat overfitting: we extend the training set with infinitely many artificial training examples that are obtained by corrupting the original training data. We show that this approach is practical and efficient for a range of predictors and corruption models. Our approach, called marginalized corrupted features (MCF), trains robust predictors by minimizing the expected value of the loss function under the corruption model. We show empirically on a variety of data sets that MCF classifiers can be trained efficiently, may generalize substantially better to test data, and are also more robust to feature deletion at test time.

Via

Access Paper or Ask Questions

Barnes-Hut-SNE

Mar 08, 2013
Laurens van der Maaten

The paper presents an O(N log N)-implementation of t-SNE -- an embedding technique that is commonly used for the visualization of high-dimensional data in scatter plots and that normally runs in O(N^2). The new implementation uses vantage-point trees to compute sparse pairwise similarities between the input data objects, and it uses a variant of the Barnes-Hut algorithm - an algorithm used by astronomers to perform N-body simulations - to approximate the forces between the corresponding points in the embedding. Our experiments show that the new algorithm, called Barnes-Hut-SNE, leads to substantial computational advantages over standard t-SNE, and that it makes it possible to learn embeddings of data sets with millions of objects.

Via

Access Paper or Ask Questions