Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Sun

Multi-modal Transformer for Video Retrieval

Jul 21, 2020
Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

Figure 1 for Multi-modal Transformer for Video Retrieval

Figure 2 for Multi-modal Transformer for Video Retrieval

Figure 3 for Multi-modal Transformer for Video Retrieval

Figure 4 for Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

* ECCV 2020 (spotlight paper)

Via

Access Paper or Ask Questions

What makes for good views for contrastive learning

May 20, 2020
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Figure 1 for What makes for good views for contrastive learning

Figure 2 for What makes for good views for contrastive learning

Figure 3 for What makes for good views for contrastive learning

Figure 4 for What makes for good views for contrastive learning

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI. We also consider data augmentation as a way to reduce MI, and show that increasing data augmentation indeed leads to decreasing MI and improves downstream classification accuracy. As a by-product, we also achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification ($73\%$ top-1 linear readoff with a ResNet-50). In addition, transferring our models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised pre-training. Code:http://github.com/HobbitLong/PyContrast

* submitted to ECCV 2020

Via

Access Paper or Ask Questions

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

May 08, 2020
Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Congcong Li, Cordelia Schmid

Figure 1 for VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

Figure 2 for VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

Figure 3 for VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

Figure 4 for VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e.g. pedestrians and vehicles) and road context information (e.g. lanes, traffic lights). This paper introduces VectorNet, a hierarchical graph neural network that first exploits the spatial locality of individual road components represented by vectors and then models the high-order interactions among all components. In contrast to most recent approaches, which render trajectories of moving agents and road context information as bird-eye images and encode them with convolutional neural networks (ConvNets), our approach operates on a vector representation. By operating on the vectorized high definition (HD) maps and agent trajectories, we avoid lossy rendering and computationally intensive ConvNet encoding steps. To further boost VectorNet's capability in learning context features, we propose a novel auxiliary task to recover the randomly masked out map entities and agent trajectories based on their context. We evaluate VectorNet on our in-house behavior prediction benchmark and the recently released Argoverse forecasting dataset. Our method achieves on par or better performance than the competitive rendering approach on both benchmarks while saving over 70% of the model parameters with an order of magnitude reduction in FLOPs. It also outperforms the state of the art on the Argoverse dataset.

* CVPR 2020

Via

Access Paper or Ask Questions

Speech2Action: Cross-modal Supervision for Action Recognition

Mar 30, 2020
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

Figure 1 for Speech2Action: Cross-modal Supervision for Action Recognition

Figure 2 for Speech2Action: Cross-modal Supervision for Action Recognition

Figure 3 for Speech2Action: Cross-modal Supervision for Action Recognition

Figure 4 for Speech2Action: Cross-modal Supervision for Action Recognition

Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

* Accepted to CVPR 2020

Via

Access Paper or Ask Questions

Unsupervised Learning of Object Structure and Dynamics from Videos

Jun 19, 2019
Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee

Figure 1 for Unsupervised Learning of Object Structure and Dynamics from Videos

Figure 2 for Unsupervised Learning of Object Structure and Dynamics from Videos

Figure 3 for Unsupervised Learning of Object Structure and Dynamics from Videos

Figure 4 for Unsupervised Learning of Object Structure and Dynamics from Videos

Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.

Via

Access Paper or Ask Questions

Contrastive Bidirectional Transformer for Temporal Representation Learning

Jun 13, 2019
Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

Figure 1 for Contrastive Bidirectional Transformer for Temporal Representation Learning

Figure 2 for Contrastive Bidirectional Transformer for Temporal Representation Learning

Figure 3 for Contrastive Bidirectional Transformer for Temporal Representation Learning

Figure 4 for Contrastive Bidirectional Transformer for Temporal Representation Learning

This paper aims at learning representations for long sequences of continuous signals. Recently, the BERT model has demonstrated the effectiveness of stacked transformers for representing sequences of discrete signals (i.e. word tokens). Inspired by its success, we adopt the stacked transformer architecture, but generalize its training objective to maximize the mutual information between the masked signals, and the bidirectional context, via contrastive loss. This enables the model to handle continuous signals, such as visual features. We further consider the case when there are multiple sequences that are semantically aligned at the sequence-level but not at the element-level (e.g. video and ASR), where we propose to use a Transformer to estimate the mutual information between the two sequences, which is again maximized via contrastive loss. We demonstrate the effectiveness of the learned representations on modeling long video sequences for action anticipation and video captioning. The results show that our method, referred to by Contrastive Bidirectional Transformer ({\bf CBT}), outperforms various baselines significantly. Furthermore, we improve over the state of the art.

Via

Access Paper or Ask Questions

Intra-Ensemble in Neural Networks

Apr 09, 2019
Yuan Gao, Zixiang Cai, Yimin Chen, Wenke Chen, Kan Yang, Chen Sun, Cong Yao

Figure 1 for Intra-Ensemble in Neural Networks

Figure 2 for Intra-Ensemble in Neural Networks

Figure 3 for Intra-Ensemble in Neural Networks

Figure 4 for Intra-Ensemble in Neural Networks

Improving model performance is always the key problem in machine learning including deep learning. However, stand-alone neural networks always suffer from marginal effect when stacking more layers. At the same time, ensemble is a useful technique to further enhance model performance. Nevertheless, training several independent stand-alone deep neural networks costs multiple resources. In this work, we propose Intra-Ensemble, an end-to-end strategy with stochastic training operations to train several sub-networks simultaneously within one neural network. Additional parameter size is marginal since the majority of parameters are mutually shared. Meanwhile, stochastic training increases the diversity of sub-networks with weight sharing, which significantly enhances intra-ensemble performance. Extensive experiments prove the applicability of intra-ensemble on various kinds of datasets and network architectures. Our models achieve comparable results with the state-of-the-art architectures on CIFAR-10 and CIFAR-100.

Via

Access Paper or Ask Questions

Relational Action Forecasting

Apr 08, 2019
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid

Figure 1 for Relational Action Forecasting

Figure 2 for Relational Action Forecasting

Figure 3 for Relational Action Forecasting

Figure 4 for Relational Action Forecasting

This paper focuses on multi-person action forecasting in videos. More precisely, given a history of H previous frames, the goal is to detect actors and to predict their future actions for the next T frames. Our approach jointly models temporal and spatial interactions among different actors by constructing a recurrent graph, using actor proposals obtained with Faster R-CNN as nodes. Our method learns to select a subset of discriminative relations without requiring explicit supervision, thus enabling us to tackle challenging visual data. We refer to our model as Discriminative Relational Recurrent Network (DRRN). Evaluation of action prediction on AVA demonstrates the effectiveness of our proposed method compared to simpler baselines. Furthermore, we significantly improve performance on the task of early action classification on J-HMDB, from the previous SOTA of 48% to 60%.

* CVPR 2019 (oral)

Via

Access Paper or Ask Questions

VideoBERT: A Joint Model for Video and Language Representation Learning

Apr 03, 2019
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid

Figure 1 for VideoBERT: A Joint Model for Video and Language Representation Learning

Figure 2 for VideoBERT: A Joint Model for Video and Language Representation Learning

Figure 3 for VideoBERT: A Joint Model for Video and Language Representation Learning

Figure 4 for VideoBERT: A Joint Model for Video and Language Representation Learning

Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use this model in a number of tasks, including action classification and video captioning. We show that it can be applied directly to open-vocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-the-art on video captioning, and quantitative results verify that the model learns high-level semantic features.

Via

Access Paper or Ask Questions

Affordance Learning In Direct Perception for Autonomous Driving

Mar 20, 2019
Chen Sun, Jean M. Uwabeza Vianney, Dongpu Cao

Figure 1 for Affordance Learning In Direct Perception for Autonomous Driving

Figure 2 for Affordance Learning In Direct Perception for Autonomous Driving

Figure 3 for Affordance Learning In Direct Perception for Autonomous Driving

Figure 4 for Affordance Learning In Direct Perception for Autonomous Driving

Recent development in autonomous driving involves high-level computer vision and detailed road scene understanding. Today, most autonomous vehicles are using mediated perception approach for path planning and control, which highly rely on high-definition 3D maps and real time sensors. Recent research efforts aim to substitute the massive HD maps with coarse road attributes. In this paper, we follow the direct perception based method to train a deep neural network for affordance learning in autonomous driving. Our goal in this work is to develop the affordance learning model based on freely available Google Street View panoramas and Open Street Map road vector attributes. Driving scene understanding can be achieved by learning affordances from the images captured by car-mounted cameras. Such scene understanding by learning affordances may be useful for corroborating base maps such as HD maps so that the required data storage space is minimized and available for processing in real time. We compare capability in road attribute identification between human volunteers and our model by experimental evaluation. Our results indicate that this method could act as a cheaper way for training data collection in autonomous driving. The cross validation results also indicate the effectiveness of our model.

* 9 pages, 13 figures

Via

Access Paper or Ask Questions