Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rameswar Panda

Richard

Dynamic Network Quantization for Efficient Video Inference

Aug 23, 2021

Ximeng Sun, Rameswar Panda, Chun-Fu Chen, Aude Oliva, Rogerio Feris, Kate Saenko

Figure 1 for Dynamic Network Quantization for Efficient Video Inference

Figure 2 for Dynamic Network Quantization for Efficient Video Inference

Figure 3 for Dynamic Network Quantization for Efficient Video Inference

Figure 4 for Dynamic Network Quantization for Efficient Video Inference

Abstract:Deep convolutional networks have recently achieved great success in video recognition, yet their practical realization remains a challenge due to the large amount of computational resources required to achieve robust recognition. Motivated by the effectiveness of quantization for boosting efficiency, in this paper, we propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition. Specifically, given a video clip, we train a very lightweight network in parallel with the recognition network, to produce a dynamic policy indicating which numerical precision to be used per frame in recognizing videos. We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency required for video recognition. Extensive experiments on four challenging diverse benchmark datasets demonstrate that our proposed approach provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods.

* ICCV 2021 Camera Ready Version

Via

Access Paper or Ask Questions

An Image Classifier Can Suffice For Video Understanding

Jun 30, 2021

Quanfu Fan, Chun-Fu, Chen, Rameswar Panda

Figure 1 for An Image Classifier Can Suffice For Video Understanding

Figure 2 for An Image Classifier Can Suffice For Video Understanding

Figure 3 for An Image Classifier Can Suffice For Video Understanding

Figure 4 for An Image Classifier Can Suffice For Video Understanding

Abstract:We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Kinetics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.

Via

Access Paper or Ask Questions

IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Jun 23, 2021

Bowen Pan, Yifan Jiang, Rameswar Panda, Zhangyang Wang, Rogerio Feris, Aude Oliva

Figure 1 for IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Figure 2 for IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Figure 3 for IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Figure 4 for IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Abstract:The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory cost. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$). We start by observing a large amount of redundant computation, mainly spent on uncorrelated input patches, and then introduce an interpretable module to dynamically and gracefully drop these redundant patches. This novel framework is then extended to a hierarchical structure, where uncorrelated tokens at different stages are gradually removed, resulting in a considerable shrinkage of computational cost. We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4X speed-up for state-of-the-art models like DeiT and TimeSformer, by only sacrificing less than 0.7% accuracy. More importantly, contrary to other acceleration approaches, our method is inherently interpretable with substantial visual evidence, making vision transformer closer to a more human-understandable architecture while being lighter. We demonstrate that the interpretability that naturally emerged in our framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results. Project Page: http://people.csail.mit.edu/bpan/ia-red/.

Via

Access Paper or Ask Questions

Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Jun 14, 2021

Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Richard J. Radke

Figure 1 for Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Figure 2 for Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Figure 3 for Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Figure 4 for Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Abstract:Most existing works in few-shot learning rely on meta-learning the network on a large base dataset which is typically from the same domain as the target dataset. We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. The problem of cross-domain few-shot recognition with unlabeled target data is largely unaddressed in the literature. STARTUP was the first method that tackles this problem using self-training. However, it uses a fixed teacher pretrained on a labeled base dataset to create soft labels for the unlabeled target samples. As the base dataset and unlabeled dataset are from different domains, projecting the target images in the class-domain of the base dataset with a fixed pretrained model might be sub-optimal. We propose a simple dynamic distillation-based approach to facilitate unlabeled images from the novel/base dataset. We impose consistency regularization by calculating predictions from the weakly-augmented versions of the unlabeled images from a teacher network and matching it with the strongly augmented versions of the same images from a student network. The parameters of the teacher network are updated as exponential moving average of the parameters of the student network. We show that the proposed network learns representation that can be easily adapted to the target domain even though it has not been trained with target-specific classes during the pretraining phase. Our model outperforms the current state-of-the art method by 4.4% for 1-shot and 3.6% for 5-shot classification in the BSCD-FSL benchmark, and also shows competitive performance on traditional in-domain few-shot learning task. Our code will be available at: https://github.com/asrafulashiq/dynamic-cdfsl.

Via

Access Paper or Ask Questions

RegionViT: Regional-to-Local Attention for Vision Transformers

Jun 04, 2021

Chun-Fu Chen, Rameswar Panda, Quanfu Fan

Figure 1 for RegionViT: Regional-to-Local Attention for Vision Transformers

Figure 2 for RegionViT: Regional-to-Local Attention for Vision Transformers

Figure 3 for RegionViT: Regional-to-Local Attention for Vision Transformers

Figure 4 for RegionViT: Regional-to-Local Attention for Vision Transformers

Abstract:Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on three vision tasks, including image classification, object detection and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models will be publicly available.

Via

Access Paper or Ask Questions

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

May 12, 2021

Rameswar Panda, Chun-Fu Chen, Quanfu Fan, Ximeng Sun, Kate Saenko, Aude Oliva, Rogerio Feris

Figure 1 for AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Figure 2 for AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Figure 3 for AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Figure 4 for AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Abstract:Multi-modal learning, which focuses on utilizing various modalities to improve the performance of a model, is widely used in video recognition. While traditional multi-modal learning offers excellent recognition results, its computational expense limits its impact for many real-world applications. In this paper, we propose an adaptive multi-modal learning framework, called AdaMML, that selects on-the-fly the optimal modalities for each segment conditioned on the input for efficient video recognition. Specifically, given a video segment, a multi-modal policy network is used to decide what modalities should be used for processing by the recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on four challenging diverse datasets demonstrate that our proposed adaptive approach yields 35%-55% reduction in computation when compared to the traditional baseline that simply uses all the modalities irrespective of the input, while also achieving consistent improvements in accuracy over the state-of-the-art methods.

Via

Access Paper or Ask Questions

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

May 05, 2021

Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath(+3 more)

Figure 1 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Figure 2 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Figure 3 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Figure 4 for Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Abstract:Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.

Via

Access Paper or Ask Questions

Detector-Free Weakly Supervised Grounding by Separation

Apr 20, 2021

Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda(+7 more)

Figure 1 for Detector-Free Weakly Supervised Grounding by Separation

Figure 2 for Detector-Free Weakly Supervised Grounding by Separation

Figure 3 for Detector-Free Weakly Supervised Grounding by Separation

Figure 4 for Detector-Free Weakly Supervised Grounding by Separation

Abstract:Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. We directly learn everything from the images and associated free-form text pairs, thus potentially gaining an advantage on the categories unsupported by the detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing `text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Using this approach we demonstrate a significant accuracy improvement, of up to $8.5\%$ over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a significant complementary improvement (above $7\%$) over the detector-based approaches for WSG.

Via

Access Paper or Ask Questions

A Broad Study on the Transferability of Visual Representations with Contrastive Learning

Apr 01, 2021

Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Richard Radke, Rogerio Feris

Figure 1 for A Broad Study on the Transferability of Visual Representations with Contrastive Learning

Figure 2 for A Broad Study on the Transferability of Visual Representations with Contrastive Learning

Figure 3 for A Broad Study on the Transferability of Visual Representations with Contrastive Learning

Figure 4 for A Broad Study on the Transferability of Visual Representations with Contrastive Learning

Abstract:Tremendous progress has been made in visual representation learning, notably with the recent success of self-supervised contrastive learning methods. Supervised contrastive learning has also been shown to outperform its cross-entropy counterparts by leveraging labels for choosing where to contrast. However, there has been little work to explore the transfer capability of contrastive learning to a different domain. In this paper, we conduct a comprehensive study on the transferability of learned representations of different contrastive approaches for linear evaluation, full-network transfer, and few-shot recognition on 12 downstream datasets from different domains, and object detection tasks on MSCOCO and VOC0712. The results show that the contrastive approaches learn representations that are easily transferable to a different downstream task. We further observe that the joint objective of self-supervised contrastive loss with cross-entropy/supervised-contrastive loss leads to better transferability of these models over their supervised counterparts. Our analysis reveals that the representations learned from the contrastive approaches contain more low/mid-level semantics than cross-entropy models, which enables them to quickly adapt to a new task. Our codes and models will be publicly available to facilitate future research on transferability of visual representations.

Via

Access Paper or Ask Questions

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Mar 27, 2021

Chun-Fu Chen, Quanfu Fan, Rameswar Panda

Figure 1 for CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Figure 2 for CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Figure 3 for CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Figure 4 for CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Abstract:The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that the proposed approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2\%

Via

Access Paper or Ask Questions