Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhen Lei

RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection

Dec 17, 2024

Yiheng Li, Yang Yang, Zhen Lei

Figure 1 for RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection

Figure 2 for RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection

Figure 3 for RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection

Figure 4 for RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection

Abstract:In radar-camera 3D object detection, the radar point clouds are sparse and noisy, which causes difficulties in fusing camera and radar modalities. To solve this, we introduce a novel query-based detection method named Radar-Camera Transformer (RCTrans). Specifically, we first design a Radar Dense Encoder to enrich the sparse valid radar tokens, and then concatenate them with the image tokens. By doing this, we can fully explore the 3D information of each interest region and reduce the interference of empty tokens during the fusing stage. We then design a Pruning Sequential Decoder to predict 3D boxes based on the obtained tokens and random initialized queries. To alleviate the effect of elevation ambiguity in radar point clouds, we gradually locate the position of the object via a sequential fusion structure. It helps to get more precise and flexible correspondences between tokens and queries. A pruning training strategy is adopted in the decoder, which can save much time during inference and inhibit queries from losing their distinctiveness. Extensive experiments on the large-scale nuScenes dataset prove the superiority of our method, and we also achieve new state-of-the-art radar-camera 3D detection results. Our implementation is available at https://github.com/liyih/RCTrans.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data

Dec 02, 2024

Ivan DeAndres-Tame, Ruben Tolosana, Pietro Melzi, Ruben Vera-Rodriguez, Minchul Kim, Christian Rathgeb, Xiaoming Liu, Luis F. Gomez, Aythami Morales, Julian Fierrez(+49 more)

Abstract:Synthetic data is gaining increasing popularity for face recognition technologies, mainly due to the privacy concerns and challenges associated with obtaining real data, including diverse scenarios, quality, and demographic groups, among others. It also offers some advantages over real data, such as the large amount of data that can be generated or the ability to customize it to adapt to specific problem-solving needs. To effectively use such data, face recognition models should also be specifically designed to exploit synthetic data to its fullest potential. In order to promote the proposal of novel Generative AI methods and synthetic data, and investigate the application of synthetic data to better train face recognition systems, we introduce the 2nd FRCSyn-onGoing challenge, based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024. This is an ongoing challenge that provides researchers with an accessible platform to benchmark i) the proposal of novel Generative AI methods and synthetic data, and ii) novel face recognition systems that are specifically proposed to take advantage of synthetic data. We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition such as demographic bias, domain adaptation, and performance constraints in demanding situations, such as age disparities between training and testing, changes in the pose, or occlusions. Very interesting findings are obtained in this second edition, including a direct comparison with the first one, in which synthetic databases were restricted to DCFace and GANDiffFace.

Via

Access Paper or Ask Questions

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Nov 27, 2024

Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Qifeng Chen, Zhaoxiang Zhang

Figure 1 for SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Figure 2 for SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Figure 3 for SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Figure 4 for SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Abstract:Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework, SimCMF, to study an important problem: cross-modal fine-tuning from vision foundation models trained on natural RGB images to other imaging modalities of different physical properties (e.g., polarization). In SimCMF, we conduct a thorough analysis of different basic components from the most naive design and ultimately propose a novel cross-modal alignment module to address the modality misalignment problem. We apply SimCMF to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new imaging modality. Given the absence of relevant benchmarks, we construct a benchmark for performance evaluation. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. SimCMF can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. The code is available at https://github.com/mt-cly/SimCMF

* project page: https://mt-cly.github.io/SimCMF.github.io/. arXiv admin note: substantial text overlap with arXiv:2409.08083

Via

Access Paper or Ask Questions

MVBoost: Boost 3D Reconstruction with Multi-View Refinement

Nov 26, 2024

Xiangyu Liu, Xiaomei Zhang, Zhiyuan Ma, Xiangyu Zhu, Zhen Lei

Abstract:Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.

Via

Access Paper or Ask Questions

Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks

Nov 25, 2024

Xiangyu Zhu, Chang Yu, Jiankuo Zhao, Zhaoxiang Zhang, Stan Z. Li, Zhen Lei

Figure 1 for Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks

Figure 2 for Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks

Figure 3 for Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks

Figure 4 for Revisiting Marr in Face: The Building of 2D--2.5D--3D Representations in Deep Neural Networks

Abstract:David Marr's seminal theory of vision proposes that the human visual system operates through a sequence of three stages, known as the 2D sketch, the 2.5D sketch, and the 3D model. In recent years, Deep Neural Networks (DNN) have been widely thought to have reached a level comparable to human vision. However, the mechanisms by which DNNs accomplish this and whether they adhere to Marr's 2D--2.5D--3D construction theory remain unexplored. In this paper, we delve into the perception task to explore these questions and find evidence supporting Marr's theory. We introduce a graphics probe, a sub-network crafted to reconstruct the original image from the network's intermediate layers. The key to the graphics probe is its flexible architecture that supports image in both 2D and 3D formats, as well as in a transitional state between them. By injecting graphics probes into neural networks, and analyzing their behavior in reconstructing images, we find that DNNs initially encode images as 2D representations in low-level layers, and finally construct 3D representations in high-level layers. Intriguingly, in mid-level layers, DNNs exhibit a hybrid state, building a geometric representation that s sur normals within a narrow depth range, akin to the appearance of a low-relief sculpture. This stage resembles the 2.5D representations, providing a view of how DNNs evolve from 2D to 3D in the perception process. The graphics probe therefore serves as a tool for peering into the mechanisms of DNN, providing empirical support for Marr's theory.

Via

Access Paper or Ask Questions

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Nov 23, 2024

Te Yang, Jian Jia, Xiangyu Zhu, Weisong Zhao, Bo Wang, Yanhua Cheng, Yan Li, Shengyuan Liu, Quan Chen, Peng Jiang(+2 more)

Figure 1 for Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Figure 2 for Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Figure 3 for Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Figure 4 for Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Abstract:Large Language Models (LLMs) have strong instruction-following capability to interpret and execute tasks as directed by human commands. Multimodal Large Language Models (MLLMs) have inferior instruction-following ability compared to LLMs. However, there is a significant gap in the instruction-following capabilities between the MLLMs and LLMs. In this study, we conduct a pilot experiment, which demonstrates that spatially down-sampling visual tokens significantly enhances the instruction-following capability of MLLMs. This is attributed to the substantial redundancy in visual modality. However, this intuitive method severely impairs the MLLM's multimodal understanding capability. In this paper, we propose Visual-Modality Token Compression (VMTC) and Cross-Modality Attention Inhibition (CMAI) strategies to alleviate this gap between MLLMs and LLMs by inhibiting the influence of irrelevant visual tokens during content generation, increasing the instruction-following ability of the MLLMs while retaining their multimodal understanding capacity. In VMTC module, the primary tokens are retained and the redundant tokens are condensed by token clustering and merging. In CMAI process, we aggregate text-to-image attentions by text-to-text attentions to obtain a text-to-image focus score. Attention inhibition is performed on the text-image token pairs with low scores. Our comprehensive experiments over instruction-following capabilities and VQA-V2, GQA, TextVQA, MME and MMBench five benchmarks, demonstrate that proposed strategy significantly enhances the instruction following capability of MLLMs while preserving the ability to understand and process multimodal inputs.

Via

Access Paper or Ask Questions

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Nov 23, 2024

Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

Figure 1 for What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Figure 2 for What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Figure 3 for What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Figure 4 for What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Abstract:While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.

Via

Access Paper or Ask Questions

GWQ: Gradient-Aware Weight Quantization for Large Language Models

Oct 30, 2024

Yihua Shao, Siyu Liang, Xiaolin Lin, Zijian Ling, Zixian Zhu, Minxi Yan, Haiyang Liu, Siyu Chen, Ziyang Yan, Yilan Meng(+9 more)

Figure 1 for GWQ: Gradient-Aware Weight Quantization for Large Language Models

Figure 2 for GWQ: Gradient-Aware Weight Quantization for Large Language Models

Figure 3 for GWQ: Gradient-Aware Weight Quantization for Large Language Models

Figure 4 for GWQ: Gradient-Aware Weight Quantization for Large Language Models

Abstract:Large language models (LLMs) show impressive performance in solving complex languagetasks. However, its large number of parameterspresent significant challenges for the deployment and application of the model on edge devices. Compressing large language models to low bits can enable them to run on resource-constrained devices, often leading to performance degradation. To address this problem, we propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers, requiring only a minimal amount of calibration data for outlier detection. GWQ retains the weights corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format. GWQ found experimentally that utilizing the sensitive weights in the gradient localization model is more scientific compared to utilizing the sensitive weights in the Hessian matrix localization model. Compared to current quantization methods, GWQ can be applied to multiple language models and achieves lower PPL on the WikiText2 and C4 dataset. In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.GWQ is also suitable for multimodal model quantization, and the quantized Qwen-VL family model is more accurate than other methods. zero-shot target detection task dataset RefCOCO outperforms the current stat-of-the-arts method SPQR. GWQ achieves 1.2x inference speedup in comparison to the original model, and effectively reduces the inference memory.

Via

Access Paper or Ask Questions

DMT-HI: MOE-based Hyperbolic Interpretable Deep Manifold Transformation for Unspervised Dimensionality Reduction

Oct 25, 2024

Zelin Zang, Yuhao Wang, Jinlin Wu, Hong Liu, Yue Shen, Stan. Z Li, Zhen Lei

Figure 1 for DMT-HI: MOE-based Hyperbolic Interpretable Deep Manifold Transformation for Unspervised Dimensionality Reduction

Figure 2 for DMT-HI: MOE-based Hyperbolic Interpretable Deep Manifold Transformation for Unspervised Dimensionality Reduction

Figure 3 for DMT-HI: MOE-based Hyperbolic Interpretable Deep Manifold Transformation for Unspervised Dimensionality Reduction

Figure 4 for DMT-HI: MOE-based Hyperbolic Interpretable Deep Manifold Transformation for Unspervised Dimensionality Reduction

Abstract:Dimensionality reduction (DR) plays a crucial role in various fields, including data engineering and visualization, by simplifying complex datasets while retaining essential information. However, the challenge of balancing DR accuracy and interpretability remains crucial, particularly for users dealing with high-dimensional data. Traditional DR methods often face a trade-off between precision and transparency, where optimizing for performance can lead to reduced interpretability, and vice versa. This limitation is especially prominent in real-world applications such as image, tabular, and text data analysis, where both accuracy and interpretability are critical. To address these challenges, this work introduces the MOE-based Hyperbolic Interpretable Deep Manifold Transformation (DMT-HI). The proposed approach combines hyperbolic embeddings, which effectively capture complex hierarchical structures, with Mixture of Experts (MOE) models, which dynamically allocate tasks based on input features. DMT-HI enhances DR accuracy by leveraging hyperbolic embeddings to represent the hierarchical nature of data, while also improving interpretability by explicitly linking input data, embedding outcomes, and key features through the MOE structure. Extensive experiments demonstrate that DMT-HI consistently achieves superior performance in both DR accuracy and model interpretability, making it a robust solution for complex data analysis. The code is available at \url{https://github.com/zangzelin/code_dmthi}.

* 14 pages, 8 figures

Via

Access Paper or Ask Questions

Long-time Integration of Nonlinear Wave Equations with Neural Operators

Oct 21, 2024

Guanhang Lei, Zhen Lei, Lei Shi

Figure 1 for Long-time Integration of Nonlinear Wave Equations with Neural Operators

Figure 2 for Long-time Integration of Nonlinear Wave Equations with Neural Operators

Figure 3 for Long-time Integration of Nonlinear Wave Equations with Neural Operators

Figure 4 for Long-time Integration of Nonlinear Wave Equations with Neural Operators

Abstract:Neural operators have shown promise in solving many types of Partial Differential Equations (PDEs). They are significantly faster compared to traditional numerical solvers once they have been trained with a certain amount of observed data. However, their numerical performance in solving time-dependent PDEs, particularly in long-time prediction of dynamic systems, still needs improvement. In this paper, we focus on solving the long-time integration of nonlinear wave equations via neural operators by replacing the initial condition with the prediction in a recurrent manner. Given limited observed temporal trajectory data, we utilize some intrinsic features of these nonlinear wave equations, such as conservation laws and well-posedness, to improve the algorithm design and reduce accumulated error. Our numerical experiments examine these improvements in the Korteweg-de Vries (KdV) equation, the sine-Gordon equation, and a semilinear wave equation on the irregular domain.

Via

Access Paper or Ask Questions