Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Xiao

MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Nov 25, 2024

Feifei Shao, Ping Liu, Zhao Wang, Yawei Luo, Hongwei Wang, Jun Xiao

Figure 1 for MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Figure 2 for MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Figure 3 for MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Figure 4 for MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Abstract:Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable $4.1\%$ improvement in the part segmentation task and delivers consistent gains across various PCP applications.

* 15 pages, 6 figures, 3 tables

Via

Access Paper or Ask Questions

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Nov 25, 2024

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, Long Chen

Abstract:With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at https://github.com/Dawn-LX/CausalCache-VDM

* Technical Report. Code is available at https://github.com/Dawn-LX/CausalCache-VDM

Via

Access Paper or Ask Questions

Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

Nov 19, 2024

Jun Xiao, Zihang Lyu, Hao Xie, Cong Zhang, Yakun Ju, Changjian Shui, Kin-Man Lam

Figure 1 for Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

Figure 2 for Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

Figure 3 for Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

Figure 4 for Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models

Abstract:Blind image restoration remains a significant challenge in low-level vision tasks. Recently, denoising diffusion models have shown remarkable performance in image synthesis. Guided diffusion models, leveraging the potent generative priors of pre-trained models along with a differential guidance loss, have achieved promising results in blind image restoration. However, these models typically consider data consistency solely in the spatial domain, often resulting in distorted image content. In this paper, we propose a novel frequency-aware guidance loss that can be integrated into various diffusion models in a plug-and-play manner. Our proposed guidance loss, based on 2D discrete wavelet transform, simultaneously enforces content consistency in both the spatial and frequency domains. Experimental results demonstrate the effectiveness of our method in three blind restoration tasks: blind image deblurring, imaging through turbulence, and blind restoration for multiple degradations. Notably, our method achieves a significant improvement in PSNR score, with a remarkable enhancement of 3.72\,dB in image deblurring. Moreover, our method exhibits superior capability in generating images with rich details and reduced distortion, leading to the best visual quality.

* 17 pages, 6 figures, has been accepted by the ECCV 2024: AIM workshop

Via

Access Paper or Ask Questions

Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning

Nov 15, 2024

Yushen Zuo, Jun Xiao, Kin-Chung Chan, Rongkang Dong, Cuixin Yang, Zongqi He, Hao Xie, Kin-Man Lam

Figure 1 for Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning

Figure 2 for Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning

Figure 3 for Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning

Figure 4 for Towards Multi-View Consistent Style Transfer with One-Step Diffusion via Vision Conditioning

Abstract:The stylization of 3D scenes is an increasingly attractive topic in 3D vision. Although image style transfer has been extensively researched with promising results, directly applying 2D style transfer methods to 3D scenes often fails to preserve the structural and multi-view properties of 3D environments, resulting in unpleasant distortions in images from different viewpoints. To address these issues, we leverage the remarkable generative prior of diffusion-based models and propose a novel style transfer method, OSDiffST, based on a pre-trained one-step diffusion model (i.e., SD-Turbo) for rendering diverse styles in multi-view images of 3D scenes. To efficiently adapt the pre-trained model for multi-view style transfer on small datasets, we introduce a vision condition module to extract style information from the reference style image to serve as conditional input for the diffusion model and employ LoRA in diffusion model for adaptation. Additionally, we consider color distribution alignment and structural similarity between the stylized and content images using two specific loss functions. As a result, our method effectively preserves the structural information and multi-view consistency in stylized images without any 3D information. Experiments show that our method surpasses other promising style transfer methods in synthesizing various styles for multi-view images of 3D scenes. Stylized images from different viewpoints generated by our method achieve superior visual quality, with better structural integrity and less distortion. The source code is available at https://github.com/YushenZuo/OSDiffST.

* Accepted by ECCV 2024 AI for Visual Arts Workshop and Challenges, 18 pages, 7 figures

Via

Access Paper or Ask Questions

Bridging Text and Image for Artist Style Transfer via Contrastive Learning

Oct 12, 2024

Zhi-Song Liu, Li-Wen Wang, Jun Xiao, Vicky Kalogeiton

Figure 1 for Bridging Text and Image for Artist Style Transfer via Contrastive Learning

Figure 2 for Bridging Text and Image for Artist Style Transfer via Contrastive Learning

Figure 3 for Bridging Text and Image for Artist Style Transfer via Contrastive Learning

Figure 4 for Bridging Text and Image for Artist Style Transfer via Contrastive Learning

Abstract:Image style transfer has attracted widespread attention in the past few years. Despite its remarkable results, it requires additional style images available as references, making it less flexible and inconvenient. Using text is the most natural way to describe the style. More importantly, text can describe implicit abstract styles, like styles of specific artists or art movements. In this paper, we propose a Contrastive Learning for Artistic Style Transfer (CLAST) that leverages advanced image-text encoders to control arbitrary style transfer. We introduce a supervised contrastive training strategy to effectively extract style descriptions from the image-text model (i.e., CLIP), which aligns stylization with the text description. To this end, we also propose a novel and efficient adaLN based state space models that explore style-content fusion. Finally, we achieve a text-driven image style transfer. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods in artistic style transfer. More importantly, it does not require online fine-tuning and can render a 512x512 image in 0.03s.

* 18 pages, 8 figures

Via

Access Paper or Ask Questions

Event-Customized Image Generation

Oct 03, 2024

Zhen Wang, Yilei Jiang, Dong Zheng, Jun Xiao, Long Chen

Figure 1 for Event-Customized Image Generation

Figure 2 for Event-Customized Image Generation

Figure 3 for Event-Customized Image Generation

Figure 4 for Event-Customized Image Generation

Abstract:Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.

Via

Access Paper or Ask Questions

RADAR: Robust Two-stage Modality-incomplete Industrial Anomaly Detection

Oct 02, 2024

Bingchen Miao, Wenqiao Zhang, Juncheng Li, Siliang Tang, Zhaocheng Li, Haochen Shi, Jun Xiao, Yueting Zhuang

Abstract:Multimodal Industrial Anomaly Detection (MIAD), utilizing 3D point clouds and 2D RGB images to identify the abnormal region of products, plays a crucial role in industrial quality inspection. However, the conventional MIAD setting presupposes that all 2D and 3D modalities are paired, overlooking the fact that multimodal data collected from the real world is often imperfect due to missing modalities. Consequently, MIAD models that demonstrate robustness against modal-incomplete data are highly desirable in practice. To address this practical challenge, we introduce a first-of-its-kind study that comprehensively investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD), to consider the imperfect learning environment in which the multimodal information may be incomplete. Not surprisingly, we discovered that most existing MIAD approaches are inadequate for addressing MIIAD challenges, leading to significant performance degradation on the MIIAD benchmark we developed. In this paper, we propose a novel two-stage Robust modAlity-imcomplete fusing and Detecting frAmewoRk, abbreviated as RADAR. Our bootstrapping philosophy is to enhance two stages in MIIAD, improving the robustness of the Multimodal Transformer: i) In feature fusion, we first explore learning modality-incomplete instruction, guiding the pre-trained Multimodal Transformer to robustly adapt to various modality-incomplete scenarios, and implement adaptive parameter learning based on a HyperNetwork; ii) In anomaly detection, we construct a real-pseudo hybrid module to highlight the distinctiveness of modality combinations, further enhancing the robustness of the MIIAD model. Our experimental results demonstrate that the proposed RADAR significantly surpasses conventional MIAD methods in terms of effectiveness and robustness on our newly created MIIAD dataset, underscoring its practical application value.

Via

Access Paper or Ask Questions

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Sep 30, 2024

Jiacong Wang, Bohong Wu, Haiyong Jiang, Xun Zhou, Xin Xiao, Haoyuan Guo, Jun Xiao

Figure 1 for World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Figure 2 for World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Figure 3 for World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Figure 4 for World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Abstract:Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture of specialists in caption and OCR, or stronger VLM APIs and expensive human annotation. In this paper, we present World to Code (W2C), a meticulously curated multi-modal data construction pipeline that organizes the final generation output into a Python code format. The pipeline leverages the VLM itself to extract cross-modal information via different prompts and filter the generated outputs again via a consistency filtering strategy. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence than the commonly used detail caption ability. Our code is available at https://github.com/foundation-multimodal-models/World2Code.

* Accepted at EMNLP 2024 Main Conference, 16pages

Via

Access Paper or Ask Questions

From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation

Jul 12, 2024

Hanrong Shi, Lin Li, Jun Xiao, Yueting Zhuang, Long Chen

Abstract:Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware features, which inherently focus on the contours and boundaries of objects. To bridge this gap, we propose a model-agnostic Curricular shApe-aware FEature (CAFE) learning strategy for PSG. Specifically, we incorporate shape-aware features (i.e., mask features and boundary features) into PSG, moving beyond reliance solely on bbox features. Furthermore, drawing inspiration from human cognition, we propose to integrate shape-aware features in an easy-to-hard manner. To achieve this, we categorize the predicates into three groups based on cognition learning difficulty and correspondingly divide the training process into three stages. Each stage utilizes a specialized relation classifier to distinguish specific groups of predicates. As the learning difficulty of predicates increases, these classifiers are equipped with features of ascending complexity. We also incorporate knowledge distillation to retain knowledge acquired in earlier stages. Due to its model-agnostic nature, CAFE can be seamlessly incorporated into any PSG model. Extensive experiments and ablations on two PSG tasks under both robust and zero-shot PSG have attested to the superiority and robustness of our proposed CAFE, which outperforms existing state-of-the-art methods by a large margin.

* Accepted by IJCV

Via

Access Paper or Ask Questions

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Jun 16, 2024

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao

Figure 1 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 2 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 3 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Figure 4 for ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Abstract:With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computations, which restricts the receptive context of each autoregression step, and results in the model lacking long-term dependencies. Inspired from the huge success of large language models (LLMs) and following GPT (generative pre-trained transformer), we bring causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames. For Causal Generation, we introduce causal temporal attention into VDM, which forces each generated frame to depend on its previous frames. For Frame as Prompt, we inject the conditional frames by concatenating them with noisy frames (frames to be generated) along the temporal axis. Consequently, we present Video Diffusion GPT (ViD-GPT). Based on the two key designs, in each autoregression step, it is able to acquire long-term context from prompting frames concatenated by all previously generated frames. Additionally, we bring the kv-cache mechanism to VDMs, which eliminates the redundant computation from overlapped frames, significantly boosting the inference speed. Extensive experiments demonstrate that our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation. Code will be available at https://github.com/Dawn-LX/Causal-VideoGen.

* Code will be available at https://github.com/Dawn-LX/Causal-VideoGen

Via

Access Paper or Ask Questions