Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiming Sun

X-MethaneWet: A Cross-scale Global Wetland Methane Emission Benchmark Dataset for Advancing Science Discovery with AI

May 23, 2025

Yiming Sun, Shuo Chen, Shengyu Chen, Chonghao Qiu, Licheng Liu, Youmi Oh, Sparkle L. Malone, Gavin McNicol, Qianlai Zhuang, Chris Smith(+2 more)

Abstract:Methane (CH$_4$) is the second most powerful greenhouse gas after carbon dioxide and plays a crucial role in climate change due to its high global warming potential. Accurately modeling CH$_4$ fluxes across the globe and at fine temporal scales is essential for understanding its spatial and temporal variability and developing effective mitigation strategies. In this work, we introduce the first-of-its-kind cross-scale global wetland methane benchmark dataset (X-MethaneWet), which synthesizes physics-based model simulation data from TEM-MDM and the real-world observation data from FLUXNET-CH$_4$. This dataset can offer opportunities for improving global wetland CH$_4$ modeling and science discovery with new AI algorithms. To set up AI model baselines for methane flux prediction, we evaluate the performance of various sequential deep learning models on X-MethaneWet. Furthermore, we explore four different transfer learning techniques to leverage simulated data from TEM-MDM to improve the generalization of deep learning models on real-world FLUXNET-CH$_4$ observations. Our extensive experiments demonstrate the effectiveness of these approaches, highlighting their potential for advancing methane emission modeling and contributing to the development of more accurate and scalable AI-driven climate models.

* 8 pages, 8 figures, 3 tables

Via

Access Paper or Ask Questions

From Biometrics to Environmental Control: AI-Enhanced Digital Twins for Personalized Health Interventions in Healing Landscapes

May 04, 2025

Yiping Meng, Yiming Sun

Abstract:The dynamic nature of human health and comfort calls for adaptive systems that respond to individual physiological needs in real time. This paper presents an AI-enhanced digital twin framework that integrates biometric signals, specifically electrocardiogram (ECG) data, with environmental parameters such as temperature, humidity, and ventilation. Leveraging IoT-enabled sensors and biometric monitoring devices, the system continuously acquires, synchronises, and preprocesses multimodal data streams to construct a responsive virtual replica of the physical environment. To validate this framework, a detailed case study is conducted using the MIT-BIH noise stress test dataset. ECG signals are filtered and segmented using dynamic sliding windows, followed by extracting heart rate variability (HRV) features such as SDNN, BPM, QTc, and LF/HF ratio. Relative deviation metrics are computed against clean baselines to quantify stress responses. A random forest classifier is trained to predict stress levels across five categories, and Shapley Additive exPlanations (SHAP) is used to interpret model behaviour and identify key contributing features. These predictions are mapped to a structured set of environmental interventions using a Five Level Stress Intervention Mapping, which activates multi-scale responses across personal, room, building, and landscape levels. This integration of physiological insight, explainable AI, and adaptive control establishes a new paradigm for health-responsive built environments. It lays the foundation for the future development of intelligent, personalised healing spaces.

Via

Access Paper or Ask Questions

Multi-Scale Graph Learning for Anti-Sparse Downscaling

May 03, 2025

Yingda Fan, Runlong Yu, Janet R. Barclay, Alison P. Appling, Yiming Sun, Yiqun Xie, Xiaowei Jia

Abstract:Water temperature can vary substantially even across short distances within the same sub-watershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, $\leq$ 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, challenges persist in predicting at fine spatial scales due to the lack of data at that scale.To address the problem of insufficient fine-scale data, we propose a Multi-Scale Graph Learning (MSGL) method. This method employs a multi-task learning framework where coarse-scale graph learning, bolstered by larger datasets, simultaneously enhances fine-scale graph learning. Although existing multi-scale or multi-resolution methods integrate data from different spatial scales, they often overlook the spatial correspondences across graph structures at various scales. To address this, our MSGL introduces an additional learning task, cross-scale interpolation learning, which leverages the hydrological connectedness of stream locations across coarse- and fine-scale graphs to establish cross-scale connections, thereby enhancing overall model performance. Furthermore, we have broken free from the mindset that multi-scale learning is limited to synchronous training by proposing an Asynchronous Multi-Scale Graph Learning method (ASYNC-MSGL). Extensive experiments demonstrate the state-of-the-art performance of our method for anti-sparse downscaling of daily stream temperatures in the Delaware River Basin, USA, highlighting its potential utility for water resources monitoring and management.

* AAAI-25, pages 27969-27977, 2025
* AAAI-25, Multi-scale deep learning approach for spatial downscaling of geospatial data with sparse observations

Via

Access Paper or Ask Questions

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Mar 19, 2025

Feifei Li, Mi Zhang, Yiming Sun, Min Yang

Figure 1 for Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Figure 2 for Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Figure 3 for Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Figure 4 for Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Abstract:Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

* CVPR25

Via

Access Paper or Ask Questions

Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

Mar 14, 2025

Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, Xingxing Wei

Abstract:Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

Feb 05, 2025

Xumeng Wen, Shun Zheng, Zhen Xu, Yiming Sun, Jiang Bian

Figure 1 for Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

Figure 2 for Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

Figure 3 for Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

Figure 4 for Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

Abstract:Recent studies have shown that large language models (LLMs), when customized with post-training on tabular data, can acquire general tabular in-context learning (TabICL) capabilities. These models are able to transfer effectively across diverse data schemas and different task domains. However, existing LLM-based TabICL approaches are constrained to few-shot scenarios due to the sequence length limitations of LLMs, as tabular instances represented in plain text consume substantial tokens. To address this limitation and enable scalable TabICL for any data size, we propose retrieval-augmented LLMs tailored to tabular data. Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs. This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets and demonstrating promising scaling behavior. Extensive comparisons with state-of-the-art tabular models reveal that, while LLM-based TabICL still lags behind well-tuned numeric models in overall performance, it uncovers powerful algorithms under limited contexts, enhances ensemble diversity, and excels on specific datasets. These unique properties underscore the potential of language as a universal and accessible interface for scalable tabular data learning.

* Preprint

Via

Access Paper or Ask Questions

Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Nov 08, 2024

Linfeng He, Yiming Sun, Sihao Wu, Jiaxu Liu, Xiaowei Huang

Figure 1 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Figure 2 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Figure 3 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Figure 4 for Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent

Abstract:In this paper, we propose a novel framework for enhancing visual comprehension in autonomous driving systems by integrating visual language models (VLMs) with additional visual perception module specialised in object detection. We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network, addressing limitations in object detection and localisation. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness. Experiments on the DriveLM visual question answering challenge demonstrate significant improvements over baseline models, with enhanced performance in ChatGPT scores, BLEU scores, and CIDEr metrics, indicating closeness of model answer to ground truth. Our method represents a promising step towards more capable and interpretable autonomous driving systems. Possible safety enhancement enabled by detection modality is also discussed.

* accepted by SafeGenAI workshop of NeurIPS 2024

Via

Access Paper or Ask Questions

Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion

Nov 07, 2024

Yiming Sun, Bing Cao, Pengfei Zhu, Qinghua Hu

Figure 1 for Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion

Figure 2 for Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion

Figure 3 for Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion

Figure 4 for Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion

Abstract:Infrared and visible image fusion aim to integrate modality strengths for visually enhanced, informative images. Visible imaging in real-world scenarios is susceptible to dynamic environmental brightness fluctuations, leading to texture degradation. Existing fusion methods lack robustness against such brightness perturbations, significantly compromising the visual fidelity of the fused imagery. To address this challenge, we propose the Brightness Adaptive multimodal dynamic fusion framework (BA-Fusion), which achieves robust image fusion despite dynamic brightness fluctuations. Specifically, we introduce a Brightness Adaptive Gate (BAG) module, which is designed to dynamically select features from brightness-related channels for normalization, while preserving brightness-independent structural information within the source images. Furthermore, we propose a brightness consistency loss function to optimize the BAG module. The entire framework is tuned via alternating training strategies. Extensive experiments validate that our method surpasses state-of-the-art methods in preserving multi-modal image information and visual fidelity, while exhibiting remarkable robustness across varying brightness levels. Our code is available: https://github.com/SunYM2020/BA-Fusion.

* Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,Main Track,Pages 1317-1325, 2024
* Accepted by IJCAI 2024

Via

Access Paper or Ask Questions

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Nov 04, 2024

Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Chenhui Li, Yang Li, Changbo Wang

Figure 1 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Figure 2 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Figure 3 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Figure 4 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Abstract:Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language~(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate descriptions of the target with tracking feedback. To further utilize semantic information produced by MLLM, a simple yet effective VL tracking framework is proposed and can be easily integrated as a plug-and-play module to boost the performance of both VL and visual trackers. Experimental results show that our proposed ChatTracker achieves a performance comparable to existing methods.

Via

Access Paper or Ask Questions

Learning Multimodal Cues of Children's Uncertainty

Oct 17, 2024

Qi Cheng, Mert İnan, Rahma Mbarki, Grace Grmek, Theresa Choi, Yiming Sun, Kimele Persaud, Jenny Wang, Malihe Alikhani

Figure 1 for Learning Multimodal Cues of Children's Uncertainty

Figure 2 for Learning Multimodal Cues of Children's Uncertainty

Figure 3 for Learning Multimodal Cues of Children's Uncertainty

Figure 4 for Learning Multimodal Cues of Children's Uncertainty

Abstract:Understanding uncertainty plays a critical role in achieving common ground (Clark et al.,1983). This is especially important for multimodal AI systems that collaborate with users to solve a problem or guide the user through a challenging concept. In this work, for the first time, we present a dataset annotated in collaboration with developmental and cognitive psychologists for the purpose of studying nonverbal cues of uncertainty. We then present an analysis of the data, studying different roles of uncertainty and its relationship with task difficulty and performance. Lastly, we present a multimodal machine learning model that can predict uncertainty given a real-time video clip of a participant, which we find improves upon a baseline multimodal transformer model. This work informs research on cognitive coordination between human-human and human-AI and has broad implications for gesture understanding and generation. The anonymized version of our data and code will be publicly available upon the completion of the required consent forms and data sheets.

* SIGDIAL 2023

Via

Access Paper or Ask Questions