Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hao Chen

Charlie

SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems

Mar 13, 2025

Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng

Abstract:The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io

* Initially released in September 2024. Project page: https://sciverse-cuhk.github.io

Via

Access Paper or Ask Questions

ACE: Concept Editing in Diffusion Models without Performance Degradation

Mar 11, 2025

Ruipeng Wang, Junfeng Fang, Jiaqi Li, Hao Chen, Jie Shi, Kun Wang, Xiang Wang

Figure 1 for ACE: Concept Editing in Diffusion Models without Performance Degradation

Figure 2 for ACE: Concept Editing in Diffusion Models without Performance Degradation

Figure 3 for ACE: Concept Editing in Diffusion Models without Performance Degradation

Figure 4 for ACE: Concept Editing in Diffusion Models without Performance Degradation

Abstract:Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at https://github.com/littlelittlenine/ACE-zero.git

Via

Access Paper or Ask Questions

Robust Latent Matters: Boosting Image Generation with Sampling Error

Mar 11, 2025

Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides

Figure 1 for Robust Latent Matters: Boosting Image Generation with Sampling Error

Figure 2 for Robust Latent Matters: Boosting Image Generation with Sampling Error

Figure 3 for Robust Latent Matters: Boosting Image Generation with Sampling Error

Figure 4 for Robust Latent Matters: Boosting Image Generation with Sampling Error

Abstract:Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: https://github.com/lxa9867/ImageFolder.

* 17 pages, 13 figures, 6 tables

Via

Access Paper or Ask Questions

Weakly Supervised Convolutional Dictionary Learning with Shared and Discriminative Components for Classification

Mar 11, 2025

Hao Chen, Yusen Wu, Dayuan Tan

Figure 1 for Weakly Supervised Convolutional Dictionary Learning with Shared and Discriminative Components for Classification

Figure 2 for Weakly Supervised Convolutional Dictionary Learning with Shared and Discriminative Components for Classification

Figure 3 for Weakly Supervised Convolutional Dictionary Learning with Shared and Discriminative Components for Classification

Figure 4 for Weakly Supervised Convolutional Dictionary Learning with Shared and Discriminative Components for Classification

Abstract:In today's data-driven landscape spanning finance, government, and healthcare sectors, the exponential growth of information necessitates robust solutions for secure storage, efficient dissemination, and fine-grained access control. Convolutional dictionary learning emerges as a powerful approach for extracting meaningful representations from complex data. This paper presents a novel weakly supervised convolutional dictionary learning framework that incorporates both shared and discriminative components for classification tasks. Our approach leverages limited label information to learn dictionaries that capture common patterns across classes while simultaneously highlighting class-specific features. By decomposing the learned representations into shared and discriminative parts, we enhance both feature interpretability and classification performance. Extensive experiments across multiple datasets demonstrate that our method outperforms state-of-the-art approaches, particularly in scenarios with limited labeled data. The proposed framework offers a promising solution for applications requiring both effective feature extraction and accurate classification in weakly supervised settings.

Via

Access Paper or Ask Questions

Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

Mar 11, 2025

Wenqiang Zu, Shenghao Xie, Hao Chen, Lei Ma

Figure 1 for Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

Figure 2 for Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

Figure 3 for Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

Figure 4 for Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

Abstract:This paper investigates the critical problem of representation similarity evolution during cross-domain transfer learning, with particular focus on understanding why pre-trained models maintain effectiveness when adapted to medical imaging tasks despite significant domain gaps. The study establishes a rigorous problem definition centered on quantifying and analyzing representation similarity trajectories throughout the fine-tuning process, while carefully delineating the scope to encompass both medical image analysis and broader cross-domain adaptation scenarios. Our empirical findings reveal three critical discoveries: the potential existence of high-performance models that preserve both task accuracy and representation similarity to their pre-trained origins; a robust linear correlation between layer-wise similarity metrics and representation quality indicators; and distinct adaptation patterns that differentiate supervised versus self-supervised pre-training paradigms. The proposed similarity space framework not only provides mechanistic insights into knowledge transfer dynamics but also raises fundamental questions about optimal utilization of pre-trained models. These results advance our understanding of neural network adaptation processes while offering practical implications for transfer learning strategies that extend beyond medical imaging applications. The code will be available once accepted.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

Mar 11, 2025

Yufan Chen, Ching Ting Leung, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao

Abstract:Artificial intelligence (AI) has demonstrated significant promise in advancing organic chemistry research; however, its effectiveness depends on the availability of high-quality chemical reaction data. Currently, most published chemical reactions are not available in machine-readable form, limiting the broader application of AI in this field. The extraction of published chemical reactions into structured databases still relies heavily on manual curation, and robust automatic parsing of chemical reaction images into machine-readable data remains a significant challenge. To address this, we introduce the Reaction Image Multimodal large language model (RxnIM), the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable reaction data. RxnIM not only extracts key chemical components from reaction images but also interprets the textual content that describes reaction conditions. Together with specially designed large-scale dataset generation method to support model training, our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%. This represents a crucial step toward the automatic construction of large databases of machine-readable reaction data parsed from images in the chemistry literature, providing essential data resources for AI research in chemistry. The source code, model checkpoints, and datasets developed in this work are released under permissive licenses. An instance of the RxnIM web application can be accessed at https://huggingface.co/spaces/CYF200127/RxnIM.

Via

Access Paper or Ask Questions

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Mar 11, 2025

Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen

Abstract:While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.

* CVPR2025;Code will be released at \url{https://github.com/aim-uofa/SegAgent}

Via

Access Paper or Ask Questions

Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

Mar 10, 2025

Hao Chen, Jian Chen, Xinran Liu, Zihui Zhang, Yuanrui Huang, Zhongkai Zhang, Hongbin Liu

Figure 1 for Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

Figure 2 for Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

Figure 3 for Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

Figure 4 for Accelerated Quasi-Static FEM for Real-Time Modeling of Continuum Robots with Multiple Contacts and Large Deformation

Abstract:Continuum robots offer high flexibility and multiple degrees of freedom, making them ideal for navigating narrow lumens. However, accurately modeling their behavior under large deformations and frequent environmental contacts remains challenging. Current methods for solving the deformation of these robots, such as the Model Order Reduction and Gauss-Seidel (GS) methods, suffer from significant drawbacks. They experience reduced computational speed as the number of contact points increases and struggle to balance speed with model accuracy. To overcome these limitations, we introduce a novel finite element method (FEM) named Acc-FEM. Acc-FEM employs a large deformation quasi-static finite element model and integrates an accelerated solver scheme to handle multi-contact simulations efficiently. Additionally, it utilizes parallel computing with Graphics Processing Units (GPU) for real-time updates of the finite element models and collision detection. Extensive numerical experiments demonstrate that Acc-FEM significantly improves computational efficiency in modeling continuum robots with multiple contacts while achieving satisfactory accuracy, addressing the deficiencies of existing methods.

* Accepted by ICRA2025

Via

Access Paper or Ask Questions

Keeping Representation Similarity in Finetuning for Medical Image Analysis

Mar 10, 2025

Wenqiang Zu, Shenghao Xie, Hao Chen, Yiming Liang, Lei Ma

Figure 1 for Keeping Representation Similarity in Finetuning for Medical Image Analysis

Figure 2 for Keeping Representation Similarity in Finetuning for Medical Image Analysis

Figure 3 for Keeping Representation Similarity in Finetuning for Medical Image Analysis

Figure 4 for Keeping Representation Similarity in Finetuning for Medical Image Analysis

Abstract:Foundation models pretrained on large-scale natural images have been widely used to adapt to medical image analysis through finetuning. This is largely attributed to pretrained representations capturing universal, robust, and generalizable features, which can be reutilized by downstream tasks. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of foundation model's original abilities, e.g., generalizability. In this paper, we argue that pretrained representations can be well preserved while still effectively adapting to downstream tasks. We study this by proposing a new finetuning method RepSim, which minimizes the distance between pretrained and finetuned representations via constraining learnable orthogonal manifold based on similarity invariance. Compared to standard finetuning methods, e.g., full finetuning, our method improves representation similarity by over 30% while maintaining competitive accuracy, and reduces sharpness by 42% across five medical image classification datasets. The code will be released.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Mar 09, 2025

Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, Chunhua Shen

Abstract:This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.

Via

Access Paper or Ask Questions