Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lei Zhang

Sid

Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Jul 17, 2024

Kaixin Bai, Lei Zhang, Zhaopeng Chen, Fang Wan, Jianwei Zhang

Figure 1 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Figure 2 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Figure 3 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Figure 4 for Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Abstract:Despite the substantial progress in deep learning, its adoption in industrial robotics projects remains limited, primarily due to challenges in data acquisition and labeling. Previous sim2real approaches using domain randomization require extensive scene and model optimization. To address these issues, we introduce an innovative physically-based structured light simulation system, generating both RGB and physically realistic depth images, surpassing previous dataset generation tools. We create an RGBD dataset tailored for robotic industrial grasping scenarios and evaluate it across various tasks, including object detection, instance segmentation, and embedding sim2real visual perception in industrial robotic grasping. By reducing the sim2real gap and enhancing deep learning training, we facilitate the application of deep learning models in industrial settings. Project details are available at https://baikaixinpublic.github.io/structured light 3D synthesizer/.

* 7 pages, 2024 IEEE International Conference on Robotics and Automation

Via

Access Paper or Ask Questions

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Jul 13, 2024

Ruihuang Li, Zhengqiang Zhang, Chenhang He, Zhiyuan Ma, Vishal M. Patel, Lei Zhang

Figure 1 for Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Figure 2 for Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Figure 3 for Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Figure 4 for Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Abstract:Recent vision-language pre-training models have exhibited remarkable generalization ability in zero-shot recognition tasks. Previous open-vocabulary 3D scene understanding methods mostly focus on training 3D models using either image or text supervision while neglecting the collective strength of all modalities. In this work, we propose a Dense Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space for maximizing their synergistic benefits. Instead of extracting coarse view- or region-level text prompts, we leverage large vision-language models to extract complete category information and scalable scene descriptions to build the text modality, and take image modality as the bridge to build dense point-pixel-text associations. Besides, in order to enhance the generalization ability of the 2D model for downstream 3D tasks without compromising the open-vocabulary capability, we employ a dual-path integration approach to combine frozen CLIP visual features and learnable mask features. Extensive experiments show that our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.

* Accepted by ECCV 2024

Via

Access Paper or Ask Questions

LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models

Jul 12, 2024

Yabin Zhang, Wenjie Zhu, Chenhang He, Lei Zhang

Abstract:Out-of-distribution (OOD) detection is crucial for model reliability, as it identifies samples from unknown classes and reduces errors due to unexpected inputs. Vision-Language Models (VLMs) such as CLIP are emerging as powerful tools for OOD detection by integrating multi-modal information. However, the practical application of such systems is challenged by manual prompt engineering, which demands domain expertise and is sensitive to linguistic nuances. In this paper, we introduce Label-driven Automated Prompt Tuning (LAPT), a novel approach to OOD detection that reduces the need for manual prompt engineering. We develop distribution-aware prompts with in-distribution (ID) class names and negative labels mined automatically. Training samples linked to these class labels are collected autonomously via image synthesis and retrieval methods, allowing for prompt learning without manual effort. We utilize a simple cross-entropy loss for prompt optimization, with cross-modal and cross-distribution mixing strategies to reduce image noise and explore the intermediate space between distributions, respectively. The LAPT framework operates autonomously, requiring only ID class names as input and eliminating the need for manual intervention. With extensive experiments, LAPT consistently outperforms manually crafted prompts, setting a new standard for OOD detection. Moreover, LAPT not only enhances the distinction between ID and OOD samples, but also improves the ID classification accuracy and strengthens the generalization robustness to covariate shifts, resulting in outstanding performance in challenging full-spectrum OOD detection tasks. Codes are available at \url{https://github.com/YBZh/LAPT}.

* ECCV2024; Codes and Supp. are available at: https://github.com/YBZh/LAPT

Via

Access Paper or Ask Questions

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Jul 11, 2024

Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li(+3 more)

Figure 1 for MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Figure 2 for MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Figure 3 for MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Figure 4 for MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Abstract:Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

* 14 pages, 9 figures

Via

Access Paper or Ask Questions

A Text-to-Game Engine for UGC-Based Role-Playing Games

Jul 11, 2024

Lei Zhang, Xuezheng Peng, Shuyi Yang, Feiyang Wang

Figure 1 for A Text-to-Game Engine for UGC-Based Role-Playing Games

Figure 2 for A Text-to-Game Engine for UGC-Based Role-Playing Games

Figure 3 for A Text-to-Game Engine for UGC-Based Role-Playing Games

Figure 4 for A Text-to-Game Engine for UGC-Based Role-Playing Games

Abstract:The shift from professionally generated content (PGC) to user-generated content (UGC) has revolutionized various media formats, from text to video. With the rapid advancements in generative AI, a similar shift is set to transform the game industry, particularly in the realm of role-playing games (RPGs). This paper introduces a new framework for a text-to-game engine that utilizes foundation models to convert simple textual inputs into complex, interactive RPG experiences. The engine dynamically renders the game story in a multi-modal format and adjusts the game character, environment, and mechanics in real-time in response to player actions. Using this framework, we developed the "Zagii" game engine, which has successfully supported hundreds of RPG games across a diverse range of genres and facilitated tens of thousands of online user gameplay instances. This validates the effectiveness of our frame-work. Our work showcases the potential for a more open and democratized gaming paradigm, highlighting the transformative impact of generative AI on the game life cycle.

* 13 pages,11 figures

Via

Access Paper or Ask Questions

Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Jul 11, 2024

Suqi Song, Chenxu Zhang, Peng Zhang, Pengkun Li, Fenglong Song, Lei Zhang

Figure 1 for Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Figure 2 for Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Figure 3 for Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Figure 4 for Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Abstract:Urban waterlogging poses a major risk to public safety and infrastructure. Conventional methods using water-level sensors need high-maintenance to hardly achieve full coverage. Recent advances employ surveillance camera imagery and deep learning for detection, yet these struggle amidst scarce data and adverse environmental conditions. In this paper, we establish a challenging Urban Waterlogging Benchmark (UW-Bench) under diverse adverse conditions to advance real-world applications. We propose a Large-Small Model co-adapter paradigm (LSM-adapter), which harnesses the substantial generic segmentation potential of large model and the specific task-directed guidance of small model. Specifically, a Triple-S Prompt Adapter module alongside a Dynamic Prompt Combiner are proposed to generate then merge multiple prompts for mask decoder adaptation. Meanwhile, a Histogram Equalization Adap-ter module is designed to infuse the image specific information for image encoder adaptation. Results and analysis show the challenge and superiority of our developed benchmark and algorithm. Project page: \url{https://github.com/zhang-chenxu/LSM-Adapter}

* ECCV 2024

Via

Access Paper or Ask Questions

AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

Jul 09, 2024

Rui Jin, Derun Li, Dehui Xiang, Lei Zhang, Hailing Zhou, Fei Shi, Weifang Zhu, Jing Cai, Tao Peng, Xinjian Chen

Figure 1 for AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

Figure 2 for AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

Figure 3 for AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

Figure 4 for AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

Abstract:Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The advent of precision medicine and a significant increase in clinical capacity have spurred the need for various data-driven tasks in the field of medical imaging. Recently, numerous machine learning and data mining tools have been integrated into various medical areas, including image segmentation. This article proposes a new classification method that differentiates supervision types, either in number or kind, during the training phase. Subsequently, we conducted a survey on artificial intelligence (AI)-based automatic prostate segmentation methods, examining the advantages and limitations of each. Additionally, we introduce variants of evaluation metrics for the verification and performance assessment of the segmentation method and summarize the current challenges. Finally, future research directions and development trends are discussed, reflecting the outcomes of our literature survey, suggesting high-precision detection and treatment of prostate cancer as a promising avenue.

Via

Access Paper or Ask Questions

EMBANet: A Flexible Efffcient Multi-branch Attention Network

Jul 07, 2024

Keke Zu, Hu Zhang, Jian Lu, Lei Zhang, Chen Xu

Abstract:This work presents a novel module, namely multi-branch concat (MBC), to process the input tensor and obtain the multi-scale feature map. The proposed MBC module brings new degrees of freedom (DoF) for the design of attention networks by allowing the type of transformation operators and the number of branches to be flexibly adjusted. Two important transformation operators, multiplex and split, are considered in this work, both of which can represent multi-scale features at a more granular level and increase the range of receptive fields. By integrating the MBC and attention module, a multi-branch attention (MBA) module is consequently developed to capture the channel-wise interaction of feature maps for establishing the long-range channel dependency. By substituting the 3x3 convolutions in the bottleneck blocks of the ResNet with the proposed MBA, a novel block namely efficient multi-branch attention (EMBA) is obtained, which can be easily plugged into the state-of-the-art backbone CNN models. Furthermore, a new backbone network called EMBANet is established by stacking the EMBA blocks. The proposed EMBANet is extensively evaluated on representative computer vision tasks including: classification, detection, and segmentation. And it demonstrates consistently superior performance over the popular backbones.

Via

Access Paper or Ask Questions

TokenPacker: Efficient Visual Projector for Multimodal LLM

Jul 02, 2024

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang

Figure 1 for TokenPacker: Efficient Visual Projector for Multimodal LLM

Figure 2 for TokenPacker: Efficient Visual Projector for Multimodal LLM

Figure 3 for TokenPacker: Efficient Visual Projector for Multimodal LLM

Figure 4 for TokenPacker: Efficient Visual Projector for Multimodal LLM

Abstract:The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.

* 16 pages, Codes:https://github.com/CircleRadon/TokenPacker

Via

Access Paper or Ask Questions

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

Jul 02, 2024

Zhiyuan Ma, Yuxiang Wei, Yabin Zhang, Xiangyu Zhu, Zhen Lei, Lei Zhang

Abstract:By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model's comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments across different 2D diffusion models, including Stable Diffusion and MVDream, and text-to-3D generators, including Hyper-iNGP, 3DConv-Net and Triplane-Transformer. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus.

* Accepted by ECCV 2024. Code available at https://github.com/theEricMa/ScaleDreamer

Via

Access Paper or Ask Questions