Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kai Han

and Other Contributors

BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Oct 18, 2024

Shaozhe Hao, Xuantong Liu, Xianbiao Qi, Shihao Zhao, Bojia Zi, Rong Xiao, Kai Han, Kwan-Yee K. Wong

Figure 1 for BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Figure 2 for BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Figure 3 for BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Figure 4 for BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Abstract:We introduce BiGR, a novel conditional image generation model using compact binary latent codes for generative training, focusing on enhancing both generation and representation capabilities. BiGR is the first conditional generative model that unifies generation and discrimination within the same framework. BiGR features a binary tokenizer, a masked modeling mechanism, and a binary transcoder for binary code prediction. Additionally, we introduce a novel entropy-ordered sampling method to enable efficient image generation. Extensive experiments validate BiGR's superior performance in generation quality, as measured by FID-50k, and representation capabilities, as evidenced by linear-probe accuracy. Moreover, BiGR showcases zero-shot generalization across various vision tasks, enabling applications such as image inpainting, outpainting, editing, interpolation, and enrichment, without the need for structural modifications. Our findings suggest that BiGR unifies generative and discriminative tasks effectively, paving the way for further advancements in the field.

* Project page: https://haoosz.github.io/BiGR

Via

Access Paper or Ask Questions

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Oct 14, 2024

Kai Han, Jianyuan Guo, Yehui Tang, Wei He, Enhua Wu, Yunhe Wang

Figure 1 for Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Figure 2 for Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Figure 3 for Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Figure 4 for Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Abstract:Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While training-based video-LLMs deliver high performance, they often require substantial resources for training and inference. Conversely, training-free approaches offer a more efficient alternative by adapting pre-trained image-LLMs models for video tasks without additional training, but they face inference efficiency bottlenecks due to the large number of visual tokens generated from video frames. In this work, we present a novel prompt-guided visual perception framework (abbreviated as \emph{Free Video-LLM}) for efficient inference of training-free video LLMs. The proposed framework decouples spatial-temporal dimension and performs temporal frame sampling and spatial RoI cropping respectively based on task-specific prompts. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks. Extensive experiments demonstrate that our approach achieves competitive results with significantly fewer tokens, offering an optimal trade-off between accuracy and computational efficiency compared to state-of-the-art video LLMs. The code will be available at \url{https://github.com/contrastive/FreeVideoLLM}.

* Tech report

Via

Access Paper or Ask Questions

AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Oct 09, 2024

Yukang Cao, Liang Pan, Kai Han, Kwan-Yee K. Wong, Ziwei Liu

Figure 1 for AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Figure 2 for AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Figure 3 for AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Figure 4 for AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation

Abstract:Recent advancements in diffusion models have led to significant improvements in the generation and animation of 4D full-body human-object interactions (HOI). Nevertheless, existing methods primarily focus on SMPL-based motion generation, which is limited by the scarcity of realistic large-scale interaction data. This constraint affects their ability to create everyday HOI scenes. This paper addresses this challenge using a zero-shot approach with a pre-trained diffusion model. Despite this potential, achieving our goals is difficult due to the diffusion model's lack of understanding of ''where'' and ''how'' objects interact with the human body. To tackle these issues, we introduce AvatarGO, a novel framework designed to generate animatable 4D HOI scenes directly from textual inputs. Specifically, 1) for the ''where'' challenge, we propose LLM-guided contact retargeting, which employs Lang-SAM to identify the contact body part from text prompts, ensuring precise representation of human-object spatial relations. 2) For the ''how'' challenge, we introduce correspondence-aware motion optimization that constructs motion fields for both human and object models using the linear blend skinning function from SMPL-X. Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling penetration issues. Extensive experiments with existing methods validate AvatarGO's superior generation and animation capabilities on a variety of human-object pairs and diverse poses. As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.

* Project page: https://yukangcao.github.io/AvatarGO/

Via

Access Paper or Ask Questions

CusConcept: Customized Visual Concept Decomposition with Diffusion Models

Oct 01, 2024

Zhi Xu, Shaozhe Hao, Kai Han

Abstract:Enabling generative models to decompose visual concepts from a single image is a complex and challenging problem. In this paper, we study a new and challenging task, customized concept decomposition, wherein the objective is to leverage diffusion models to decompose a single image and generate visual concepts from various perspectives. To address this challenge, we propose a two-stage framework, CusConcept (short for Customized Visual Concept Decomposition), to extract customized visual concept embedding vectors that can be embedded into prompts for text-to-image generation. In the first stage, CusConcept employs a vocabulary-guided concept decomposition mechanism to build vocabularies along human-specified conceptual axes. The decomposed concepts are obtained by retrieving corresponding vocabularies and learning anchor weights. In the second stage, joint concept refinement is performed to enhance the fidelity and quality of generated images. We further curate an evaluation benchmark for assessing the performance of the open-world concept decomposition task. Our approach can effectively generate high-quality images of the decomposed concepts and produce related lexical predictions as secondary outcomes. Extensive qualitative and quantitative experiments demonstrate the effectiveness of CusConcept.

Via

Access Paper or Ask Questions

Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Aug 30, 2024

Hongjun Wang, Sagar Vaze, Kai Han

Figure 1 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Figure 2 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Figure 3 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Figure 4 for Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks

Abstract:Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research. Code: https://github.com/Visual-AI/Dissect-OOD-OSR

* Accepted to IJCV, preprint version; v2: add supplementary

Via

Access Paper or Ask Questions

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Aug 21, 2024

Jonathan Roberts, Kai Han, Samuel Albanie

Figure 1 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Figure 2 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Figure 3 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Figure 4 for GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Abstract:Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.

Via

Access Paper or Ask Questions

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Aug 13, 2024

Shibo Jie, Yehui Tang, Jianyuan Guo, Zhi-Hong Deng, Kai Han, Yunhe Wang

Figure 1 for Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Figure 2 for Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Figure 3 for Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Figure 4 for Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Abstract:Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: https://github.com/JieShibo/ToCom

* Accepted to ECCV2024

Via

Access Paper or Ask Questions

HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts

Aug 08, 2024

Hongjun Wang, Sagar Vaze, Kai Han

Abstract:Generalized Category Discovery (GCD) is a challenging task in which, given a partially labelled dataset, models must categorize all unlabelled instances, regardless of whether they come from labelled categories or from new ones. In this paper, we challenge a remaining assumption in this task: that all images share the same domain. Specifically, we introduce a new task and method to handle GCD when the unlabelled data also contains images from different domains to the labelled set. Our proposed `HiLo' networks extract High-level semantic and Low-level domain features, before minimizing the mutual information between the representations. Our intuition is that the clusterings based on domain information and semantic information should be independent. We further extend our method with a specialized domain augmentation tailored for the GCD task, as well as a curriculum learning approach. Finally, we construct a benchmark from corrupted fine-grained datasets as well as a large-scale evaluation on DomainNet with real-world domain shifts, reimplementing a number of GCD baselines in this setting. We demonstrate that HiLo outperforms SoTA category discovery models by a large margin on all evaluations.

* 39 pages, 9 figures, 26 tables

Via

Access Paper or Ask Questions

LatentArtiFusion: An Effective and Efficient Histological Artifacts Restoration Framework

Jul 29, 2024

Zhenqi He, Wenrui Liu, Minghao Yin, Kai Han

Figure 1 for LatentArtiFusion: An Effective and Efficient Histological Artifacts Restoration Framework

Figure 2 for LatentArtiFusion: An Effective and Efficient Histological Artifacts Restoration Framework

Figure 3 for LatentArtiFusion: An Effective and Efficient Histological Artifacts Restoration Framework

Figure 4 for LatentArtiFusion: An Effective and Efficient Histological Artifacts Restoration Framework

Abstract:Histological artifacts pose challenges for both pathologists and Computer-Aided Diagnosis (CAD) systems, leading to errors in analysis. Current approaches for histological artifact restoration, based on Generative Adversarial Networks (GANs) and pixel-level Diffusion Models, suffer from performance limitations and computational inefficiencies. In this paper, we propose a novel framework, LatentArtiFusion, which leverages the latent diffusion model (LDM) to reconstruct histological artifacts with high performance and computational efficiency. Unlike traditional pixel-level diffusion frameworks, LatentArtiFusion executes the restoration process in a lower-dimensional latent space, significantly improving computational efficiency. Moreover, we introduce a novel regional artifact reconstruction algorithm in latent space to prevent mistransfer in non-artifact regions, distinguishing our approach from GAN-based methods. Through extensive experiments on real-world histology datasets, LatentArtiFusion demonstrates remarkable speed, outperforming state-of-the-art pixel-level diffusion frameworks by more than 30X. It also consistently surpasses GAN-based methods by at least 5% across multiple evaluation metrics. Furthermore, we evaluate the effectiveness of our proposed framework in downstream tissue classification tasks, showcasing its practical utility. Code is available at https://github.com/bugs-creator/LatentArtiFusion.

* Accept to DGM4MICCAI2024

Via

Access Paper or Ask Questions

PromptCCD: Learning Gaussian Mixture Prompt Pool for Continual Category Discovery

Jul 26, 2024

Fernando Julio Cendra, Bingchen Zhao, Kai Han

Abstract:We tackle the problem of Continual Category Discovery (CCD), which aims to automatically discover novel categories in a continuous stream of unlabeled data while mitigating the challenge of catastrophic forgetting -- an open problem that persists even in conventional, fully supervised continual learning. To address this challenge, we propose PromptCCD, a simple yet effective framework that utilizes a Gaussian Mixture Model (GMM) as a prompting method for CCD. At the core of PromptCCD lies the Gaussian Mixture Prompting (GMP) module, which acts as a dynamic pool that updates over time to facilitate representation learning and prevent forgetting during category discovery. Moreover, GMP enables on-the-fly estimation of category numbers, allowing PromptCCD to discover categories in unlabeled data without prior knowledge of the category numbers. We extend the standard evaluation metric for Generalized Category Discovery (GCD) to CCD and benchmark state-of-the-art methods on diverse public datasets. PromptCCD significantly outperforms existing methods, demonstrating its effectiveness. Project page: https://visual-ai.github.io/promptccd .

* ECCV 2024, Project page: https://visual-ai.github.io/promptccd

Via

Access Paper or Ask Questions