Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shunli Wang

Center of Ultra-precision Optoelectronic Instrument engineering, Harbin Institute of Technology, Key Lab of Ultra-precision Intelligent Instrumentation, Harbin Institute of Technology

VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

Apr 16, 2026

Hui Han, Shunli Wang, Yandan Zhao, Taiping Yao, Shouhong Ding

Abstract:In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge injection.The lack of professional forgery knowledge hinders the performance of these DFD-MLLMs.To solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL).Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning capabilities.Specifically, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT construction.In terms of model training, we adopt a three-stage training method (Alignment->SFT->GRPO) to gradually cultivate the critical reasoning ability of the MLLM.In terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.

Via

Access Paper or Ask Questions

Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Nov 05, 2024

Mingcheng Li, Dingkang Yang, Yang Liu, Shunli Wang, Jiawei Chen, Shuaibing Wang, Jinjie Wei, Yue Jiang, Qingyao Xu, Xiaolu Hou(+4 more)

Figure 1 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Figure 2 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Figure 3 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Figure 4 for Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Abstract:Multimodal Sentiment Analysis (MSA) is an important research area that aims to understand and recognize human sentiment through multiple modalities. The complementary information provided by multimodal fusion promotes better sentiment analysis compared to utilizing only a single modality. Nevertheless, in real-world applications, many unavoidable factors may lead to situations of uncertain modality missing, thus hindering the effectiveness of multimodal modeling and degrading the model's performance. To this end, we propose a Hierarchical Representation Learning Framework (HRLF) for the MSA task under uncertain missing modalities. Specifically, we propose a fine-grained representation factorization module that sufficiently extracts valuable sentiment information by factorizing modality into sentiment-relevant and modality-specific representations through crossmodal translation and sentiment semantic reconstruction. Moreover, a hierarchical mutual information maximization mechanism is introduced to incrementally maximize the mutual information between multi-scale representations to align and reconstruct the high-level semantics in the representations. Ultimately, we propose a hierarchical adversarial learning mechanism that further aligns and adapts the latent distribution of sentiment-relevant representations to produce robust joint multimodal representations. Comprehensive experiments on three datasets demonstrate that HRLF significantly improves MSA performance under uncertain modality missing cases.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT: Theory, Method, and Generalization

Aug 27, 2024

Zhisheng Wang, Yanxu Sun, Shangyu Li, Legeng Lin, Shunli Wang, Junning Cui

Figure 1 for Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT: Theory, Method, and Generalization

Figure 2 for Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT: Theory, Method, and Generalization

Figure 3 for Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT: Theory, Method, and Generalization

Figure 4 for Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT: Theory, Method, and Generalization

Abstract:For extending CT field-of-view to perform non-destructive testing, the Symmetric Multi-Linear trajectory Computed Tomography (SMLCT) has been developed as a successful example of non-standard CT scanning modes. However, inevitable geometric errors can cause severe artifacts in the reconstructed images. The existing calibration method for SMLCT is both crude and inefficient. It involves reconstructing hundreds of images by exhaustively substituting each potential error, and then manually identifying the images with the fewest geometric artifacts to estimate the final geometric errors for calibration. In this paper, we comprehensively and efficiently address the challenging geometric artifacts in SMLCT, , and the corresponding works mainly involve theory, method, and generalization. In particular, after identifying sensitive parameters and conducting some theory analysis of geometric artifacts, we summarize several key properties between sensitive geometric parameters and artifact characteristics. Then, we further construct mathematical relationships that relate sensitive geometric errors to the pixel offsets of reconstruction images with artifact characteristics. To accurately extract pixel bias, we innovatively adapt the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) algorithm, commonly used in sound processing, for our image registration task for each paired symmetric LCT. This adaptation leads to the design of a highly efficient rigid translation registration method. Simulation and physical experiments have validated the excellent performance of this work. Additionally, our results demonstrate significant generalization to common rotated CT and a variant of SMLCT.

* 15 pages, 10 figures

Via

Access Paper or Ask Questions

MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Aug 17, 2024

Xiao Zhao, Xukun Zhang, Dingkang Yang, Mingyang Sun, Mingcheng Li, Shunli Wang, Lihua Zhang

Figure 1 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Figure 2 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Figure 3 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Figure 4 for MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Abstract:Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of complementary learning among tasks and decreased performance in multi-task learning (MTL) due to joint training. In this paper, we propose MaskBEV, a masked attention-based MTL paradigm that unifies 3D object detection and bird's eye view (BEV) map segmentation. MaskBEV introduces a task-agnostic Transformer decoder to process these diverse tasks, enabling MTL to be completed in a unified decoder without requiring additional design of specific task heads. To fully exploit the complementary information between BEV map segmentation and 3D object detection tasks in BEV space, we propose spatial modulation and scene-level context aggregation strategies. These strategies consider the inherent dependencies between BEV segmentation and 3D detection, naturally boosting MTL performance. Extensive experiments on nuScenes dataset show that compared with previous state-of-the-art MTL methods, MaskBEV achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation, while also demonstrating slightly leading inference speed.

* Accepted to ACM MM 2024

Via

Access Paper or Ask Questions

Faster Diffusion Action Segmentation

Aug 04, 2024

Shuaibing Wang, Shunli Wang, Mingcheng Li, Dingkang Yang, Haopeng Kuang, Ziyun Qian, Lihua Zhang

Figure 1 for Faster Diffusion Action Segmentation

Figure 2 for Faster Diffusion Action Segmentation

Figure 3 for Faster Diffusion Action Segmentation

Figure 4 for Faster Diffusion Action Segmentation

Abstract:Temporal Action Segmentation (TAS) is an essential task in video analysis, aiming to segment and classify continuous frames into distinct action segments. However, the ambiguous boundaries between actions pose a significant challenge for high-precision segmentation. Recent advances in diffusion models have demonstrated substantial success in TAS tasks due to their stable training process and high-quality generation capabilities. However, the heavy sampling steps required by diffusion models pose a substantial computational burden, limiting their practicality in real-time applications. Additionally, most related works utilize Transformer-based encoder architectures. Although these architectures excel at capturing long-range dependencies, they incur high computational costs and face feature-smoothing issues when processing long video sequences. To address these challenges, we propose EffiDiffAct, an efficient and high-performance TAS algorithm. Specifically, we develop a lightweight temporal feature encoder that reduces computational overhead and mitigates the rank collapse phenomenon associated with traditional self-attention mechanisms. Furthermore, we introduce an adaptive skip strategy that allows for dynamic adjustment of timestep lengths based on computed similarity metrics during inference, thereby further enhancing computational efficiency. Comprehensive experiments on the 50Salads, Breakfast, and GTEA datasets demonstrated the effectiveness of the proposed algorithm.

* 25 pages, 6 figures

Via

Access Paper or Ask Questions

MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Jun 18, 2024

Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, Lihua Zhang

Figure 1 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Figure 2 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Figure 3 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Figure 4 for MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Abstract:When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effective way to address model hallucination. Moreover, the limited availability of training data in the medical field and privacy concerns greatly hinder the model's accuracy and generalization capabilities. In this paper, we introduce a method that mimics human cognitive processes to construct fine-grained instruction pairs and apply the concept of chain-of-thought (CoT) from inference scenarios to training scenarios, thereby proposing a method called MedThink. Our experiments on various LVLMs demonstrate that our novel data construction method tailored for the medical domain significantly improves the model's performance in medical image report generation tasks and substantially mitigates the hallucinations. All resources of this work will be released soon.

Via

Access Paper or Ask Questions

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jun 14, 2024

Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang

Figure 1 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Figure 2 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Figure 3 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Figure 4 for Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Abstract:Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.

Via

Access Paper or Ask Questions

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

May 29, 2024

Dingkang Yang, Jinjie Wei, Dongling Xiao, Shunli Wang, Tong Wu, Gang Li, Mingcheng Li, Shuaibing Wang, Jiawei Chen, Yue Jiang(+4 more)

Figure 1 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Figure 2 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Figure 3 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Figure 4 for PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Abstract:Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300,000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline. In the continuous pre-training phase, we introduce a hybrid instruction pre-training mechanism to mitigate the internal-injected knowledge inconsistency of LLMs for medical domain adaptation. Immediately, the full-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the general medical knowledge schema into the models. After that, we devise a direct following preference optimization to enhance the generation of pediatrician-like humanistic responses. In the parameter-efficient secondary SFT phase, a mixture of universal-specific experts strategy is presented to resolve the competency conflict between medical generalist and pediatric expertise mastery. Extensive results based on the metrics, GPT-4, and doctor evaluations on distinct doctor downstream tasks show that PediatricsGPT consistently outperforms previous Chinese medical LLMs. Our model and dataset will be open-source for community development.

* A Technical Report on a Powerful Chinese Medical Large Language Model

Via

Access Paper or Ask Questions

SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion

May 05, 2024

Ziyun Qian, Zeyu Xiao, Zhenyi Wu, Dingkang Yang, Mingcheng Li, Shunli Wang, Shuaibing Wang, Dongliang Kou, Lihua Zhang

Figure 1 for SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion

Figure 2 for SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion

Figure 3 for SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion

Figure 4 for SMCD: High Realism Motion Style Transfer via Mamba-based Diffusion

Abstract:Motion style transfer is a significant research direction in multimedia applications. It enables the rapid switching of different styles of the same motion for virtual digital humans, thus vastly increasing the diversity and realism of movements. It is widely applied in multimedia scenarios such as movies, games, and the Metaverse. However, most of the current work in this field adopts the GAN, which may lead to instability and convergence issues, making the final generated motion sequence somewhat chaotic and unable to reflect a highly realistic and natural style. To address these problems, we consider style motion as a condition and propose the Style Motion Conditioned Diffusion (SMCD) framework for the first time, which can more comprehensively learn the style features of motion. Moreover, we apply Mamba model for the first time in the motion style transfer field, introducing the Motion Style Mamba (MSM) module to handle longer motion sequences. Thirdly, aiming at the SMCD framework, we propose Diffusion-based Content Consistency Loss and Content Consistency Loss to assist the overall framework's training. Finally, we conduct extensive experiments. The results reveal that our method surpasses state-of-the-art methods in both qualitative and quantitative comparisons, capable of generating more realistic motion sequences.

Via

Access Paper or Ask Questions

Robust Emotion Recognition in Context Debiasing

Mar 09, 2024

Dingkang Yang, Kun Yang, Mingcheng Li, Shunli Wang, Shuaibing Wang, Lihua Zhang

Abstract:Context-aware emotion recognition (CAER) has recently boosted the practical applications of affective computing techniques in unconstrained environments. Mainstream CAER methods invariably extract ensemble representations from diverse contexts and subject-centred characteristics to perceive the target person's emotional state. Despite advancements, the biggest challenge remains due to context bias interference. The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation, causing severe performance bottlenecks and confounding valuable context priors. In this paper, we propose a counterfactual emotion inference (CLEF) framework to address the above issue. Specifically, we first formulate a generalized causal graph to decouple the causal relationships among the variables in CAER. Following the causal graph, CLEF introduces a non-invasive context branch to capture the adverse direct effect caused by the context bias. During the inference, we eliminate the direct context effect from the total causal effect by comparing factual and counterfactual outcomes, resulting in bias mitigation and robust prediction. As a model-agnostic framework, CLEF can be readily integrated into existing methods, bringing consistent performance gains.

Via

Access Paper or Ask Questions