Alert button
Picture for Sadeep Jayasumana

Sadeep Jayasumana

Alert button

SPEGTI: Structured Prediction for Efficient Generative Text-to-Image Models

Aug 14, 2023
Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit, Ayan Chakrabarti, Sanjiv Kumar

Figure 1 for SPEGTI: Structured Prediction for Efficient Generative Text-to-Image Models
Figure 2 for SPEGTI: Structured Prediction for Efficient Generative Text-to-Image Models
Figure 3 for SPEGTI: Structured Prediction for Efficient Generative Text-to-Image Models
Figure 4 for SPEGTI: Structured Prediction for Efficient Generative Text-to-Image Models

Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running inference multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. This method is shown to work in conjunction with the recently proposed Muse model. The MRF encodes the compatibility among image tokens at different spatial locations and enables us to significantly reduce the required number of Muse prediction steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, SPEGTI, uses this proposed MRF model to speed up Muse by 1.5X with no loss in output image quality.

Viaarxiv icon

EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Jan 27, 2023
Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, Sanjiv Kumar

Figure 1 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval
Figure 2 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval
Figure 3 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval
Figure 4 for EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the IR literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a distillation-friendly embedding geometry, especially for DE student models.

Viaarxiv icon

When does mixup promote local linearity in learned representations?

Oct 28, 2022
Arslan Chaudhry, Aditya Krishna Menon, Andreas Veit, Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar

Figure 1 for When does mixup promote local linearity in learned representations?
Figure 2 for When does mixup promote local linearity in learned representations?
Figure 3 for When does mixup promote local linearity in learned representations?
Figure 4 for When does mixup promote local linearity in learned representations?

Mixup is a regularization technique that artificially produces new samples using convex combinations of original training points. This simple technique has shown strong empirical performance, and has been heavily used as part of semi-supervised learning techniques such as mixmatch~\citep{berthelot2019mixmatch} and interpolation consistent training (ICT)~\citep{verma2019interpolation}. In this paper, we look at Mixup through a \emph{representation learning} lens in a semi-supervised learning setup. In particular, we study the role of Mixup in promoting linearity in the learned network representations. Towards this, we study two questions: (1) how does the Mixup loss that enforces linearity in the \emph{last} network layer propagate the linearity to the \emph{earlier} layers?; and (2) how does the enforcement of stronger Mixup loss on more than two data points affect the convergence of training? We empirically investigate these properties of Mixup on vision datasets such as CIFAR-10, CIFAR-100 and SVHN. Our results show that supervised Mixup training does not make \emph{all} the network layers linear; in fact the \emph{intermediate layers} become more non-linear during Mixup training compared to a network that is trained \emph{without} Mixup. However, when Mixup is used as an unsupervised loss, we observe that all the network layers become more linear resulting in faster training convergence.

* NeurIPS 2022 (First Workshop on Interpolation and Beyond)  
Viaarxiv icon

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

May 12, 2021
Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank Reddi, Sanjiv Kumar

Figure 1 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces
Figure 2 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces
Figure 3 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces
Figure 4 for Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Negative sampling schemes enable efficient training given a large number of classes, by offering a means to approximate a computationally expensive loss function that takes all labels into account. In this paper, we present a new connection between these schemes and loss modification techniques for countering label imbalance. We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels. Further, we provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance. We empirically verify our findings on long-tail classification and retrieval benchmarks.

* To appear in ICML 2021 
Viaarxiv icon

Balancing Constraints and Submodularity in Data Subset Selection

Apr 26, 2021
Srikumar Ramalingam, Daniel Glasner, Kaushal Patel, Raviteja Vemulapalli, Sadeep Jayasumana, Sanjiv Kumar

Figure 1 for Balancing Constraints and Submodularity in Data Subset Selection
Figure 2 for Balancing Constraints and Submodularity in Data Subset Selection
Figure 3 for Balancing Constraints and Submodularity in Data Subset Selection
Figure 4 for Balancing Constraints and Submodularity in Data Subset Selection

Deep learning has yielded extraordinary results in vision and natural language processing, but this achievement comes at a cost. Most deep learning models require enormous resources during training, both in terms of computation and in human labeling effort. In this paper, we show that one can achieve similar accuracy to traditional deep-learning models, while using less training data. Much of the previous work in this area relies on using uncertainty or some form of diversity to select subsets of a larger training set. Submodularity, a discrete analogue of convexity, has been exploited to model diversity in various settings including data subset selection. In contrast to prior methods, we propose a novel diversity driven objective function, and balancing constraints on class labels and decision boundaries using matroids. This allows us to use efficient greedy algorithms with approximation guarantees for subset selection. We outperform baselines on standard image classification datasets such as CIFAR-10, CIFAR-100, and ImageNet. In addition, we also show that the proposed balancing constraints can play a key role in boosting the performance in long-tailed datasets such as CIFAR-100-LT.

Viaarxiv icon

Kernelized Classification in Deep Networks

Dec 08, 2020
Sadeep Jayasumana, Srikumar Ramalingam, Sanjiv Kumar

Figure 1 for Kernelized Classification in Deep Networks
Figure 2 for Kernelized Classification in Deep Networks
Figure 3 for Kernelized Classification in Deep Networks
Figure 4 for Kernelized Classification in Deep Networks

In this paper, we propose a kernelized classification layer for deep networks. Although conventional deep networks introduce an abundance of nonlinearity for representation (feature) learning, they almost universally use a linear classifier on the learned feature vectors. We introduce a nonlinear classification layer by using the kernel trick on the softmax cross-entropy loss function during training and the scorer function during testing. Furthermore, we study the choice of kernel functions one could use with this framework and show that the optimal kernel function for a given problem can be learned automatically within the deep network itself using the usual backpropagation and gradient descent methods. To this end, we exploit a classic mathematical result on the positive definite kernels on the unit n-sphere embedded in the (n+1)-dimensional Euclidean space. We show the usefulness of the proposed nonlinear classification layer on several vision datasets and tasks.

Viaarxiv icon

Long-tail learning via logit adjustment

Jul 14, 2020
Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, Sanjiv Kumar

Figure 1 for Long-tail learning via logit adjustment
Figure 2 for Long-tail learning via logit adjustment
Figure 3 for Long-tail learning via logit adjustment
Figure 4 for Long-tail learning via logit adjustment

Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes na\"ive learning biased towards dominant labels. In this paper, we present two simple modifications of standard softmax cross-entropy training to cope with these challenges. Our techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training. Such adjustment encourages a large relative margin between logits of rare versus dominant labels. These techniques unify and generalise several recent proposals in the literature, while possessing firmer statistical grounding and empirical performance.

Viaarxiv icon

Bipartite Conditional Random Fields for Panoptic Segmentation

Dec 11, 2019
Sadeep Jayasumana, Kanchana Ranasinghe, Mayuka Jayawardhana, Sahan Liyanaarachchi, Harsha Ranasinghe

Figure 1 for Bipartite Conditional Random Fields for Panoptic Segmentation
Figure 2 for Bipartite Conditional Random Fields for Panoptic Segmentation
Figure 3 for Bipartite Conditional Random Fields for Panoptic Segmentation
Figure 4 for Bipartite Conditional Random Fields for Panoptic Segmentation

We tackle the panoptic segmentation problem with a conditional random field (CRF) model. Panoptic segmentation involves assigning a semantic label and an instance label to each pixel of a given image. At each pixel, the semantic label and the instance label should be compatible. Furthermore, a good panoptic segmentation should have a number of other desirable properties such as the spatial and color consistency of the labeling (similar looking neighboring pixels should have the same semantic label and the instance label). To tackle this problem, we propose a CRF model, named Bipartite CRF or BCRF, with two types of random variables for semantic and instance labels. In this formulation, various energies are defined within and across the two types of random variables to encourage a consistent panoptic segmentation. We propose a mean-field-based efficient inference algorithm for solving the CRF and empirically show its convergence properties. This algorithm is fully differentiable, and therefore, BCRF inference can be included as a trainable module in a deep network. In the experimental evaluation, we quantitatively and qualitatively show that the BCRF yields superior panoptic segmentation results in practice.

Viaarxiv icon

Prototypical Priors: From Improving Classification to Zero-Shot Learning

Apr 25, 2018
Saumya Jetley, Bernardino Romera-Paredes, Sadeep Jayasumana, Philip Torr

Figure 1 for Prototypical Priors: From Improving Classification to Zero-Shot Learning
Figure 2 for Prototypical Priors: From Improving Classification to Zero-Shot Learning
Figure 3 for Prototypical Priors: From Improving Classification to Zero-Shot Learning
Figure 4 for Prototypical Priors: From Improving Classification to Zero-Shot Learning

Recent works on zero-shot learning make use of side information such as visual attributes or natural language semantics to define the relations between output visual classes and then use these relationships to draw inference on new unseen classes at test time. In a novel extension to this idea, we propose the use of visual prototypical concepts as side information. For most real-world visual object categories, it may be difficult to establish a unique prototype. However, in cases such as traffic signs, brand logos, flags, and even natural language characters, these prototypical templates are available and can be leveraged for an improved recognition performance. The present work proposes a way to incorporate this prototypical information in a deep learning framework. Using prototypes as prior information, the deepnet pipeline learns the input image projections into the prototypical embedding space subject to minimization of the final classification loss. Based on our experiments with two different datasets of traffic signs and brand logos, prototypical embeddings incorporated in a conventional convolutional neural network improve the recognition performance. Recognition accuracy on the Belga logo dataset is especially noteworthy and establishes a new state-of-the-art. In zero-shot learning scenarios, the same system can be directly deployed to draw inference on unseen classes by simply adding the prototypical information for these new classes at test time. Thus, unlike earlier approaches, testing on seen and unseen classes is handled using the same pipeline, and the system can be tuned for a trade-off of seen and unseen class performance as per task requirement. Comparison with one of the latest works in the zero-shot learning domain yields top results on the two datasets mentioned above.

* 12 Pages, 6 Figures, 2 Tables, in British Machine Vision Conference (BMVC), 2015 
Viaarxiv icon

Higher Order Conditional Random Fields in Deep Neural Networks

Jul 29, 2016
Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, Philip Torr

Figure 1 for Higher Order Conditional Random Fields in Deep Neural Networks
Figure 2 for Higher Order Conditional Random Fields in Deep Neural Networks
Figure 3 for Higher Order Conditional Random Fields in Deep Neural Networks
Figure 4 for Higher Order Conditional Random Fields in Deep Neural Networks

We address the problem of semantic segmentation using deep learning. Most segmentation systems include a Conditional Random Field (CRF) to produce a structured output that is consistent with the image's visual features. Recent deep learning approaches have incorporated CRFs into Convolutional Neural Networks (CNNs), with some even training the CRF end-to-end with the rest of the network. However, these approaches have not employed higher order potentials, which have previously been shown to significantly improve segmentation performance. In this paper, we demonstrate that two types of higher order potential, based on object detections and superpixels, can be included in a CRF embedded within a deep network. We design these higher order potentials to allow inference with the differentiable mean field algorithm. As a result, all the parameters of our richer CRF model can be learned end-to-end with our pixelwise CNN classifier. We achieve state-of-the-art segmentation performance on the PASCAL VOC benchmark with these trainable higher order potentials.

* ECCV 2016 
Viaarxiv icon