Alert button
Picture for Li Jing

Li Jing

Alert button

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Oct 09, 2022
Shraman Pramanick, Li Jing, Sayan Nag, Jiachen Zhu, Hardik Shah, Yann LeCun, Rama Chellappa

Figure 1 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Figure 2 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Figure 3 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Figure 4 for VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

Viaarxiv icon

MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion Recognition

Sep 19, 2022
Yunfei Yin, Li Jing, Faliang Huang, Guangchao Yang, Zhuowei Wang

Figure 1 for MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion Recognition
Figure 2 for MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion Recognition
Figure 3 for MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion Recognition
Figure 4 for MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion Recognition

Gait emotion recognition plays a crucial role in the intelligent system. Most of the existing methods recognize emotions by focusing on local actions over time. However, they ignore that the effective distances of different emotions in the time domain are different, and the local actions during walking are quite similar. Thus, emotions should be represented by global states instead of indirect local actions. To address these issues, a novel Multi Scale Adaptive Graph Convolution Network (MSA-GCN) is presented in this work through constructing dynamic temporal receptive fields and designing multiscale information aggregation to recognize emotions. In our model, a adaptive selective spatial-temporal graph convolution is designed to select the convolution kernel dynamically to obtain the soft spatio-temporal features of different emotions. Moreover, a Cross-Scale mapping Fusion Mechanism (CSFM) is designed to construct an adaptive adjacency matrix to enhance information interaction and reduce redundancy. Compared with previous state-of-the-art methods, the proposed method achieves the best performance on two public datasets, improving the mAP by 2\%. We also conduct extensive ablations studies to show the effectiveness of different components in our methods.

Viaarxiv icon

Masked Siamese ConvNets

Jun 15, 2022
Li Jing, Jiachen Zhu, Yann LeCun

Figure 1 for Masked Siamese ConvNets
Figure 2 for Masked Siamese ConvNets
Figure 3 for Masked Siamese ConvNets
Figure 4 for Masked Siamese ConvNets

Self-supervised learning has shown superior performances over supervised methods on various vision benchmarks. The siamese network, which encourages embeddings to be invariant to distortions, is one of the most successful self-supervised visual representation learning approaches. Among all the augmentation methods, masking is the most general and straightforward method that has the potential to be applied to all kinds of input and requires the least amount of domain knowledge. However, masked siamese networks require particular inductive bias and practically only work well with Vision Transformers. This work empirically studies the problems behind masked siamese networks with ConvNets. We propose several empirical designs to overcome these problems gradually. Our method performs competitively on low-shot image classification and outperforms previous methods on object detection benchmarks. We discuss several remaining issues and hope this work can provide useful data points for future general-purpose self-supervised learning.

Viaarxiv icon

Positive Unlabeled Contrastive Learning

Jun 01, 2022
Anish Acharya, Sujay Sanghavi, Li Jing, Bhargav Bhushanam, Dhruv Choudhary, Michael Rabbat, Inderjit Dhillon

Figure 1 for Positive Unlabeled Contrastive Learning
Figure 2 for Positive Unlabeled Contrastive Learning
Figure 3 for Positive Unlabeled Contrastive Learning
Figure 4 for Positive Unlabeled Contrastive Learning

Self-supervised pretraining on unlabeled data followed by supervised finetuning on labeled data is a popular paradigm for learning from limited labeled examples. In this paper, we investigate and extend this paradigm to the classical positive unlabeled (PU) setting - the weakly supervised task of learning a binary classifier only using a few labeled positive examples and a set of unlabeled samples. We propose a novel PU learning objective positive unlabeled Noise Contrastive Estimation (puNCE) that leverages the available explicit (from labeled samples) and implicit (from unlabeled samples) supervision to learn useful representations from positive unlabeled input data. The underlying idea is to assign each training sample an individual weight; labeled positives are given unit weight; unlabeled samples are duplicated, one copy is labeled positive and the other as negative with weights $\pi$ and $(1-\pi)$ where $\pi$ denotes the class prior. Extensive experiments across vision and natural language tasks reveal that puNCE consistently improves over existing unsupervised and supervised contrastive baselines under limited supervision.

Viaarxiv icon

Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials

Mar 02, 2022
Andrew Ma, Yang Zhang, Thomas Christensen, Hoi Chun Po, Li Jing, Liang Fu, Marin Soljačić

Figure 1 for Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials
Figure 2 for Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials
Figure 3 for Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials

Topological materials present unconventional electronic properties that make them attractive for both basic science and next-generation technological applications. The majority of currently-known topological materials have been discovered using methods that involve symmetry-based analysis of the quantum wavefunction. Here we use machine learning to develop a simple-to-use heuristic chemical rule that diagnoses with a high accuracy whether a material is topological using only its chemical formula. This heuristic rule is based on a notion that we term topogivity, a machine-learned numerical value for each element that loosely captures its tendency to form topological materials. We next implement a high-throughput strategy for discovering topological materials based on the heuristic topogivity-rule prediction followed by ab initio validation. This way, we discover new topological materials that are not diagnosable using symmetry indicators, including several that may be promising for experimental observation.

* Main text: 6 pages, 3 figures; supplementary materials: 37 pages, 59 figures, 2 tables 
Viaarxiv icon

Equivariant Contrastive Learning

Oct 28, 2021
Rumen Dangovski, Li Jing, Charlotte Loh, Seungwook Han, Akash Srivastava, Brian Cheung, Pulkit Agrawal, Marin Soljačić

Figure 1 for Equivariant Contrastive Learning
Figure 2 for Equivariant Contrastive Learning
Figure 3 for Equivariant Contrastive Learning
Figure 4 for Equivariant Contrastive Learning

In state-of-the-art self-supervised learning (SSL) pre-training produces semantically good representations by encouraging them to be invariant under meaningful transformations prescribed from human knowledge. In fact, the property of invariance is a trivial instance of a broader class called equivariance, which can be intuitively understood as the property that representations transform according to the way the inputs transform. Here, we show that rather than using only invariance, pre-training that encourages non-trivial equivariance to some transformations, while maintaining invariance to other transformations, can be used to improve the semantic quality of representations. Specifically, we extend popular SSL methods to a more general framework which we name Equivariant Self-Supervised Learning (E-SSL). In E-SSL, a simple additional pre-training objective encourages equivariance by predicting the transformations applied to the input. We demonstrate E-SSL's effectiveness empirically on several popular computer vision benchmarks. Furthermore, we demonstrate usefulness of E-SSL for applications beyond computer vision; in particular, we show its utility on regression problems in photonics science. We will release our code.

* 17 pages, 5 figures 
Viaarxiv icon

Understanding Dimensional Collapse in Contrastive Self-supervised Learning

Oct 18, 2021
Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian

Figure 1 for Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Figure 2 for Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Figure 3 for Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Figure 4 for Understanding Dimensional Collapse in Contrastive Self-supervised Learning

Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.

* 15 pages, 10 figures 
Viaarxiv icon

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Mar 04, 2021
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stéphane Deny

Figure 1 for Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Figure 2 for Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Figure 3 for Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Figure 4 for Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn representations which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant representations. Most current methods avoid such collapsed solutions by careful implementation details. We propose an objective function that naturally avoids such collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. It allows the use of very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.

* 13 pages, 5 figures 
Viaarxiv icon

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

Nov 20, 2020
Ileana Rugina, Rumen Dangovski, Li Jing, Preslav Nakov, Marin Soljačić

Figure 1 for Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks
Figure 2 for Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks
Figure 3 for Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks
Figure 4 for Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks

The attention mechanism is a key component of the neural revolution in Natural Language Processing (NLP). As the size of attention-based models has been scaling with the available computational resources, a number of pruning techniques have been developed to detect and to exploit sparseness in such models in order to make them more efficient. The majority of such efforts have focused on looking for attention patterns and then hard-coding them to achieve sparseness, or pruning the weights of the attention mechanisms based on statistical information from the training data. In this paper, we marry these two lines of research by proposing Attention Pruning (AP): a novel pruning framework that collects observations about the attention patterns in a fixed dataset and then induces a global sparseness mask for the model. Through attention pruning, we find that about 90% of the attention computation can be reduced for language modelling and about 50% for machine translation and %natural language inference prediction with BERT on GLUE tasks, while maintaining the quality of the results. Additionally, using our method, we discovered important distinctions between self- and cross-attention patterns, which could guide future NLP research in attention-based modelling. Our approach could help develop better models for existing or for new NLP applications, and generally for any model that relies on attention mechanisms. Our implementation and instructions to reproduce the experiments are available at https://github.com/irugina/AP.

* 9 pages, 6 figures, 9 tables 
Viaarxiv icon