Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minsu Cho

Contrastive Mean-Shift Learning for Generalized Category Discovery

Apr 15, 2024

Sua Choi, Dahyun Kang, Minsu Cho

Figure 1 for Contrastive Mean-Shift Learning for Generalized Category Discovery

Figure 2 for Contrastive Mean-Shift Learning for Generalized Category Discovery

Figure 3 for Contrastive Mean-Shift Learning for Generalized Category Discovery

Figure 4 for Contrastive Mean-Shift Learning for Generalized Category Discovery

Abstract:We address the problem of generalized category discovery (GCD) that aims to partition a partially labeled collection of images; only a small part of the collection is labeled and the total number of target classes is unknown. To address this generalized image clustering problem, we revisit the mean-shift algorithm, i.e., a classic, powerful technique for mode seeking, and incorporate it into a contrastive learning framework. The proposed method, dubbed Contrastive Mean-Shift (CMS) learning, trains an image encoder to produce representations with better clustering properties by an iterative process of mean shift and contrastive update. Experiments demonstrate that our method, both in settings with and without the total number of clusters being known, achieves state-of-the-art performance on six public GCD benchmarks without bells and whistles.

* Accepted at CVPR 2024

Via

Access Paper or Ask Questions

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Apr 09, 2024

Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid

Abstract:This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).

* CVPR 2024

Via

Access Paper or Ask Questions

Learning Correlation Structures for Vision Transformers

Apr 05, 2024

Manjin Kim, Paul Hongsuck Seo, Cordelia Schmid, Minsu Cho

Figure 1 for Learning Correlation Structures for Vision Transformers

Figure 2 for Learning Correlation Structures for Vision Transformers

Figure 3 for Learning Correlation Structures for Vision Transformers

Figure 4 for Learning Correlation Structures for Vision Transformers

Abstract:We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Dec 12, 2023

Yoonwoo Jeong, Jinwoo Lee, Chiheon Kim, Minsu Cho, Doyup Lee

Figure 1 for NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Figure 2 for NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Figure 3 for NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Figure 4 for NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Abstract:Transfer learning of large-scale Text-to-Image (T2I) models has recently shown impressive potential for Novel View Synthesis (NVS) of diverse objects from a single image. While previous methods typically train large models on multi-view datasets for NVS, fine-tuning the whole parameters of T2I models not only demands a high cost but also reduces the generalization capacity of T2I models in generating diverse images in a new domain. In this study, we propose an effective method, dubbed NVS-Adapter, which is a plug-and-play module for a T2I model, to synthesize novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models. NVS-Adapter consists of two main components; view-consistency cross-attention learns the visual correspondences to align the local details of view features, and global semantic conditioning aligns the semantic structure of generated views with the reference view. Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views and also achieve high performance on benchmarks without full fine-tuning of T2I models. The code and data are publicly available in ~\href{https://postech-cvlab.github.io/nvsadapter/}{https://postech-cvlab.github.io/nvsadapter/}.

* Project Page: https://postech-cvlab.github.io/nvsadapter/

Via

Access Paper or Ask Questions

Activity Grammars for Temporal Action Segmentation

Dec 07, 2023

Dayoung Gong, Joonseok Lee, Deunsol Jung, Suha Kwak, Minsu Cho

Figure 1 for Activity Grammars for Temporal Action Segmentation

Figure 2 for Activity Grammars for Temporal Action Segmentation

Figure 3 for Activity Grammars for Temporal Action Segmentation

Figure 4 for Activity Grammars for Temporal Action Segmentation

Abstract:Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.

* Accepted to NeurIPS 2023

Via

Access Paper or Ask Questions

Towards More Practical Group Activity Detection: A New Benchmark and Model

Dec 05, 2023

Dongkeun Kim, Youngkil Song, Minsu Cho, Suha Kwak

Figure 1 for Towards More Practical Group Activity Detection: A New Benchmark and Model

Figure 2 for Towards More Practical Group Activity Detection: A New Benchmark and Model

Figure 3 for Towards More Practical Group Activity Detection: A New Benchmark and Model

Figure 4 for Towards More Practical Group Activity Detection: A New Benchmark and Model

Abstract:Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Caf\'e. Unlike existing datasets, Caf\'e is constructed primarily for GAD and presents more practical evaluation scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Caf\'e, where it outperformed previous work in terms of both accuracy and inference speed. Both our dataset and code base will be open to the public to promote future research on GAD.

* Project page: https://cvlab.postech.ac.kr/research/CAFE

Via

Access Paper or Ask Questions

Efficient Semantic Matching with Hypercolumn Correlation

Nov 07, 2023

Seungwook Kim, Juhong Min, Minsu Cho

Abstract:Recent studies show that leveraging the match-wise relationships within the 4D correlation map yields significant improvements in establishing semantic correspondences - but at the cost of increased computation and latency. In this work, we focus on the aspect that the performance improvements of recent methods can also largely be attributed to the usage of multi-scale correlation maps, which hold various information ranging from low-level geometric cues to high-level semantic contexts. To this end, we propose HCCNet, an efficient yet effective semantic matching method which exploits the full potential of multi-scale correlation maps, while eschewing the reliance on expensive match-wise relationship mining on the 4D correlation map. Specifically, HCCNet performs feature slicing on the bottleneck features to yield a richer set of intermediate features, which are used to construct a hypercolumn correlation. HCCNet can consequently establish semantic correspondences in an effective manner by reducing the volume of conventional high-dimensional convolution or self-attention operations to efficient point-wise convolutions. HCCNet demonstrates state-of-the-art or competitive performances on the standard benchmarks of semantic matching, while incurring a notably lower latency and computation overhead compared to the existing SoTA methods.

* Accepted to WACV 2024. 17 pages including references and supplementary

Via

Access Paper or Ask Questions

Locality-Aware Generalizable Implicit Neural Representation

Oct 12, 2023

Doyup Lee, Chiheon Kim, Minsu Cho, Wook-Shin Han

Figure 1 for Locality-Aware Generalizable Implicit Neural Representation

Figure 2 for Locality-Aware Generalizable Implicit Neural Representation

Figure 3 for Locality-Aware Generalizable Implicit Neural Representation

Figure 4 for Locality-Aware Generalizable Implicit Neural Representation

Abstract:Generalizable implicit neural representation (INR) enables a single continuous function, i.e., a coordinate-based neural network, to represent multiple data instances by modulating its weights or intermediate features using latent codes. However, the expressive power of the state-of-the-art modulation is limited due to its inability to localize and capture fine-grained details of data entities such as specific pixels and rays. To address this issue, we propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. The transformer encoder predicts a set of latent tokens from a data instance to encode local information into each latent token. The locality-aware INR decoder extracts a modulation vector by selectively aggregating the latent tokens via cross-attention for a coordinate input and then predicts the output by progressively decoding with coarse-to-fine modulation through multiple frequency bandwidths. The selective token aggregation and the multi-band feature modulation enable us to learn locality-aware representation in spatial and spectral aspects, respectively. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks such as image generation.

* 19 pages, 12 figures

Via

Access Paper or Ask Questions

Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions

Oct 11, 2023

Jungtaek Kim, Jeongbeen Yoon, Minsu Cho

Figure 1 for Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions

Figure 2 for Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions

Figure 3 for Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions

Figure 4 for Generalized Neural Sorting Networks with Error-Free Differentiable Swap Functions

Abstract:Sorting is a fundamental operation of all computer systems, having been a long-standing significant research topic. Beyond the problem formulation of traditional sorting algorithms, we consider sorting problems for more abstract yet expressive inputs, e.g., multi-digit images and image fragments, through a neural sorting network. To learn a mapping from a high-dimensional input to an ordinal variable, the differentiability of sorting networks needs to be guaranteed. In this paper we define a softening error by a differentiable swap function, and develop an error-free swap function that holds non-decreasing and differentiability conditions. Furthermore, a permutation-equivariant Transformer network with multi-head attention is adopted to capture dependency between given inputs and also leverage its model capacity with self-attention. Experiments on diverse sorting benchmarks show that our methods perform better than or comparable to baseline methods.

* 19 pages, 7 figures, 21 tables

Via

Access Paper or Ask Questions

PriViT: Vision Transformers for Fast Private Inference

Oct 06, 2023

Naren Dhyani, Jianqiao Mo, Minsu Cho, Ameya Joshi, Siddharth Garg, Brandon Reagen, Chinmay Hegde

Abstract:The Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. However, ViTs are ill-suited for private inference using secure multi-party computation (MPC) protocols, due to the large number of non-polynomial operations (self-attention, feed-forward rectifiers, layer normalization). We propose PriViT, a gradient based algorithm to selectively "Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy. Our algorithm is conceptually simple, easy to implement, and achieves improved performance over existing approaches for designing MPC-friendly transformer architectures in terms of achieving the Pareto frontier in latency-accuracy. We confirm these improvements via experiments on several standard image classification tasks. Public code is available at https://github.com/NYU-DICE-Lab/privit.

* 18 pages, 14 figures

Via

Access Paper or Ask Questions