Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hao Jiang

Optimizing Ride-Pooling Operations with Extended Pickup and Drop-Off Flexibility

Mar 11, 2025

Hao Jiang, Yixing Xu, Pradeep Varakantham

Abstract:The Ride-Pool Matching Problem (RMP) is central to on-demand ride-pooling services, where vehicles must be matched with multiple requests while adhering to service constraints such as pickup delays, detour limits, and vehicle capacity. Most existing RMP solutions assume passengers are picked up and dropped off at their original locations, neglecting the potential for passengers to walk to nearby spots to meet vehicles. This assumption restricts the optimization potential in ride-pooling operations. In this paper, we propose a novel matching method that incorporates extended pickup and drop-off areas for passengers. We first design a tree-based approach to efficiently generate feasible matches between passengers and vehicles. Next, we optimize vehicle routes to cover all designated pickup and drop-off locations while minimizing total travel distance. Finally, we employ dynamic assignment strategies to achieve optimal matching outcomes. Experiments on city-scale taxi datasets demonstrate that our method improves the number of served requests by up to 13\% and average travel distance by up to 21\% compared to leading existing solutions, underscoring the potential of leveraging passenger mobility to significantly enhance ride-pooling service efficiency.

Via

Access Paper or Ask Questions

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Mar 07, 2025

Guanghao Zhang, Tao Zhong, Yan Xia, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Mushui Liu, Dong She, Yi Wang(+1 more)

Figure 1 for CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Figure 2 for CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Figure 3 for CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Figure 4 for CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Abstract:While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: 1. The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. 2. The introduction of a test-time memory augmentation module that expands the model reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

Via

Access Paper or Ask Questions

MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

Mar 03, 2025

Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu(+5 more)

Abstract:Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.

Via

Access Paper or Ask Questions

UniGO: A Unified Graph Neural Network for Modeling Opinion Dynamics on Graphs

Feb 17, 2025

Hao Li, Hao Jiang, Yuke Zheng, Hao Sun, Wenying Gong

Figure 1 for UniGO: A Unified Graph Neural Network for Modeling Opinion Dynamics on Graphs

Figure 2 for UniGO: A Unified Graph Neural Network for Modeling Opinion Dynamics on Graphs

Figure 3 for UniGO: A Unified Graph Neural Network for Modeling Opinion Dynamics on Graphs

Figure 4 for UniGO: A Unified Graph Neural Network for Modeling Opinion Dynamics on Graphs

Abstract:Polarization and fragmentation in social media amplify user biases, making it increasingly important to understand the evolution of opinions. Opinion dynamics provide interpretability for studying opinion evolution, yet incorporating these insights into predictive models remains challenging. This challenge arises due to the inherent complexity of the diversity of opinion fusion rules and the difficulty in capturing equilibrium states while avoiding over-smoothing. This paper constructs a unified opinion dynamics model to integrate different opinion fusion rules and generates corresponding synthetic datasets. To fully leverage the advantages of unified opinion dynamics, we introduces UniGO, a framework for modeling opinion evolution on graphs. Using a coarsen-refine mechanism, UniGO efficiently models opinion dynamics through a graph neural network, mitigating over-smoothing while preserving equilibrium phenomena. UniGO leverages pretraining on synthetic datasets, which enhances its ability to generalize to real-world scenarios, providing a viable paradigm for applications of opinion dynamics. Experimental results on both synthetic and real-world datasets demonstrate UniGO's effectiveness in capturing complex opinion formation processes and predicting future evolution. The pretrained model also shows strong generalization capability, validating the benefits of using synthetic data to boost real-world performance.

* WWW2025

Via

Access Paper or Ask Questions

D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

Jan 14, 2025

Qian Zeng, Jie Song, Han Zheng, Hao Jiang, Mingli Song

Figure 1 for D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

Figure 2 for D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

Figure 3 for D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

Figure 4 for D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

Abstract:Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.

* 9 pages, 4 figures, acceptted by AAAI2025

Via

Access Paper or Ask Questions

Boosting Private Domain Understanding of Efficient MLLMs: A Tuning-free, Adaptive, Universal Prompt Optimization Framework

Dec 27, 2024

Jiang Liu, Bolin Li, Haoyuan Li, Tianwei Lin, Wenqiao Zhang, Tao Zhong, Zhelun Yu, Jinghao Wei, Hao Cheng, Hao Jiang(+4 more)

Abstract:Efficient multimodal large language models (EMLLMs), in contrast to multimodal large language models (MLLMs), reduce model size and computational costs and are often deployed on resource-constrained devices. However, due to data privacy concerns, existing open-source EMLLMs rarely have access to private domain-specific data during the pre-training process, making them difficult to directly apply in device-specific domains, such as certain business scenarios. To address this weakness, this paper focuses on the efficient adaptation of EMLLMs to private domains, specifically in two areas: 1) how to reduce data requirements, and 2) how to avoid parameter fine-tuning. Specifically, we propose a tun\textbf{\underline{I}}ng-free, a\textbf{\underline{D}}aptiv\textbf{\underline{E}}, univers\textbf{\underline{AL}} \textbf{\underline{Prompt}} Optimization Framework, abbreviated as \textit{\textbf{\ourmethod{}}} which consists of two stages: 1) Predefined Prompt, based on the reinforcement searching strategy, generate a prompt optimization strategy tree to acquire optimization priors; 2) Prompt Reflection initializes the prompt based on optimization priors, followed by self-reflection to further search and refine the prompt. By doing so, \ourmethod{} elegantly generates the ``ideal prompts'' for processing private domain-specific data. Note that our method requires no parameter fine-tuning and only a small amount of data to quickly adapt to the data distribution of private data. Extensive experiments across multiple tasks demonstrate that our proposed \ourmethod{} significantly improves both efficiency and performance compared to baselines.

Via

Access Paper or Ask Questions

Cramér-Rao Bound Optimization for Near-Field Sensing with Continuous-Aperture Arrays

Dec 19, 2024

Hao Jiang, Zhaolin Wang, Yuanwei Liu, Arumugam Nallanathan

Figure 1 for Cramér-Rao Bound Optimization for Near-Field Sensing with Continuous-Aperture Arrays

Figure 2 for Cramér-Rao Bound Optimization for Near-Field Sensing with Continuous-Aperture Arrays

Figure 3 for Cramér-Rao Bound Optimization for Near-Field Sensing with Continuous-Aperture Arrays

Figure 4 for Cramér-Rao Bound Optimization for Near-Field Sensing with Continuous-Aperture Arrays

Abstract:A Cram\'er-Rao bound (CRB) optimization framework for near-field sensing (NISE) with continuous-aperture arrays (CAPAs) is proposed. In contrast to conventional spatially discrete arrays (SPDAs), CAPAs emit electromagnetic (EM) probing signals through continuous source currents for target sensing, thereby exploiting the full spatial degrees of freedom (DoFs). The maximum likelihood estimation (MLE) method for estimating target locations in the near-field region is developed. To evaluate the NISE performance with CAPAs, the CRB for estimating target locations is derived based on continuous transmit and receive array responses of CAPAs. Subsequently, a CRB minimization problem is formulated to optimize the continuous source current of CAPAs. This results in a non-convex, integral-based functional optimization problem. To address this challenge, the optimal structure of the source current is derived and proven to be spanned by a series of basis functions determined by the system geometry. To solve the CRB minimization problem, a low-complexity subspace manifold gradient descent (SMGD) method is proposed, leveraging the derived optimal structure of the source current. Our simulation results validate the effectiveness of the proposed SMGD method and further demonstrate that i)~the proposed SMGD method can effectively solve the CRB minimization problem with reduced computational complexity, and ii)~CAPA achieves a tenfold improvement in sensing performance compared to its SPDA counterpart, due to full exploitation of spatial DoFs.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Dec 13, 2024

Yuchen Ren, Wenwei Han, Qianyuan Zhang, Yining Tang, Weiqiang Bai, Yuchen Cai, Lifeng Qiao, Hao Jiang, Dong Yuan, Tao Chen(+6 more)

Figure 1 for COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Figure 2 for COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Figure 3 for COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Figure 4 for COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Abstract:As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis.

Via

Access Paper or Ask Questions

All-in-One: Transferring Vision Foundation Models into Stereo Matching

Dec 13, 2024

Jingyi Zhou, Haoyu Zhang, Jiakang Yuan, Peng Ye, Tao Chen, Hao Jiang, Meiya Chen, Yangyang Zhang

Figure 1 for All-in-One: Transferring Vision Foundation Models into Stereo Matching

Figure 2 for All-in-One: Transferring Vision Foundation Models into Stereo Matching

Figure 3 for All-in-One: Transferring Vision Foundation Models into Stereo Matching

Figure 4 for All-in-One: Transferring Vision Foundation Models into Stereo Matching

Abstract:As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

Unsupervised Cross-Domain Regression for Fine-grained 3D Game Character Reconstruction

Dec 11, 2024

Qi Wen, Xiang Wen, Hao Jiang, Siqi Yang, Bingfeng Han, Tianlei Hu, Gang Chen, Shuang Li

Abstract:With the rise of the ``metaverse'' and the rapid development of games, it has become more and more critical to reconstruct characters in the virtual world faithfully. The immersive experience is one of the most central themes of the ``metaverse'', while the reducibility of the avatar is the crucial point. Meanwhile, the game is the carrier of the metaverse, in which players can freely edit the facial appearance of the game character. In this paper, we propose a simple but powerful cross-domain framework that can reconstruct fine-grained 3D game characters from single-view images in an end-to-end manner. Different from the previous methods, which do not resolve the cross-domain gap, we propose an effective regressor that can greatly reduce the discrepancy between the real-world domain and the game domain. To figure out the drawbacks of no ground truth, our unsupervised framework has accomplished the knowledge transfer of the target domain. Additionally, an innovative contrastive loss is proposed to solve the instance-wise disparity, which keeps the person-specific details of the reconstructed character. In contrast, an auxiliary 3D identity-aware extractor is activated to make the results of our model more impeccable. Then a large set of physically meaningful facial parameters is generated robustly and exquisitely. Experiments demonstrate that our method yields state-of-the-art performance in 3D game character reconstruction.

* 12 pages, 10 figures

Via

Access Paper or Ask Questions