Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Zhang

Alibaba Group

Joint Topology and Power Optimization for Multi-UAV Collaborative Secure Communication

Apr 23, 2025

Bin Qiu, Wenchi Cheng, Hongxiang He, Wei Zhang

Abstract:In this paper, we investigate an unmanned aerial vehicle (UAV)-enabled secure communication scenario that a cluster of UAVs performs a virtual non-uniform linear array (NULA) to communicate with a base station (BS) in the presence of eavesdroppers (Eves). Our goal is to design the UAV topology, trajectory, and precoding to maximize the system channel capacity. To this end, we convert the original problem into equivalent two-stage problems. Specifically, we first try to maximize the channel gain by meticulously designing the UAV topology. We then study the joint optimization of the trajectory and precoding for total transmit power minimization while satisfying the constraints on providing quality of service (QoS) assurance to the BS, the leakage tolerance to Eves, the per-UAV transmit power, the initial/final locations, and the cylindrical no-fly zones. For the UAV topology design, we prove that the topology follows the Fekete-point distribution. The design of trajectory and precoding is formulated as a non-convex optimization problem which is generally intractable. Subsequently, the non-convex constraints are converted into convex terms, and a double-loop search algorithm is proposed to solve the transmit power minimization problem. Introduce random rotation offsets so as to perform a dynamic stochastic channel to enhance the security. Numerical results demonstrate the superiority of the proposed method in promoting capacity.

Via

Access Paper or Ask Questions

The PHD/CPHD filter for Multiple Extended Target Tracking with Trajectory Set Theory and Explicit Shape Estimation

Apr 21, 2025

Yuanhao Cheng, Yunhe Cao, Tat-Soon Yeo, Fu Jie, Wei Zhang

Figure 1 for The PHD/CPHD filter for Multiple Extended Target Tracking with Trajectory Set Theory and Explicit Shape Estimation

Figure 2 for The PHD/CPHD filter for Multiple Extended Target Tracking with Trajectory Set Theory and Explicit Shape Estimation

Figure 3 for The PHD/CPHD filter for Multiple Extended Target Tracking with Trajectory Set Theory and Explicit Shape Estimation

Figure 4 for The PHD/CPHD filter for Multiple Extended Target Tracking with Trajectory Set Theory and Explicit Shape Estimation

Abstract:In this paper, we propose two methods for tracking multiple extended targets or unresolved group targets with elliptical extent shape. These two methods are deduced from the famous Probability Hypothesis Density (PHD) filter and the Cardinality-PHD (CPHD) filter, respectively. In these two methods, Trajectory Set Theory (TST) is combined to establish the target trajectory estimates. Moreover, by employing a decoupled shape estimation model, the proposed methods can explicitly provide the shape estimation of the target, such as the orientation of the ellipse extension and the length of its two axes. We derived the closed Bayesian recursive of these two methods with stable trajectory generation and accurate extent estimation, resulting in the TPHD-E filter and the TCPHD-E filter. In addition, Gaussian mixture implementations of our methods are provided, which are further referred to as the GM-TPHD-E filter and the GM-TCPHD-E filters. We illustrate the ability of these methods through simulations and experiments with real data. These experiments demonstrate that the two proposed algorithms have advantages over existing algorithms in target shape estimation, as well as in the completeness and accuracy of target trajectory generation.

* 16 pages, 11 figures

Via

Access Paper or Ask Questions

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Apr 20, 2025

Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu(+14 more)

Figure 1 for Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Figure 2 for Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Figure 3 for Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Figure 4 for Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Abstract:The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles.

Via

Access Paper or Ask Questions

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Apr 20, 2025

Zhouyang Li, Yuliang Liu, Wei Zhang, Tailing Yuan, Bin Chen, Chengru Song, Di Zhang

Abstract:Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context scenarios, existing pipeline parallelism methods fail to address the substantial activation memory pressure, primarily due to the peak memory consumption resulting from the accumulation of activations across multiple microbatches. Moreover, these approaches inevitably introduce considerable pipeline bubbles, further hindering efficiency. To tackle these challenges, we propose SlimPipe, a novel approach to fine-grained pipeline parallelism that employs uniform sequence slicing coupled with one-forward-one-backward (1F1B) schedule. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. Although the slices are evenly partitioned, the computation cost is not equal across slices due to causal attention. We develop a sophisticated workload redistribution technique to address this load imbalance. SlimPipe achieves (1) near-zero memory overhead and (2) minimal pipeline bubbles simultaneously. The effectiveness of SlimPipe has been proven by thorough testing with diverse model architectures, context window sizes, and SlimPipe-specific configurations. For example, on the Llama 70B model, compared to state-of-the-art methods, SlimPipe significantly boosts the Model FLOPs Utilization (MFU) to up to $1.57\times$ for a context length of 512K. More notably, for a context length of 2048K, it maintains over 45% utilization on 256 NVIDIA Hopper 80GB GPUs, while other approaches either suffer significant performance drops or fail entirely due to memory constraints.

Via

Access Paper or Ask Questions

EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

Apr 17, 2025

Wei Zhang, Miaoxin Cai, Yaqian Ning, Tong Zhang, Yin Zhuang, He Chen, Jun Li, Xuerui Mao

Abstract:Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field.

Via

Access Paper or Ask Questions

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Apr 15, 2025

Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du

Abstract:Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at https://github.com/Fhujinwu/DCP.

* Under review (submited in 2024.11)

Via

Access Paper or Ask Questions

CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Apr 14, 2025

Jing Chen, Zhihua Wei, Wei Zhang, Yingying Hu, Qiong Zhang

Figure 1 for CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Figure 2 for CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Figure 3 for CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Figure 4 for CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

Abstract:Large language models (LLMs) hold great promise for assisting clinical interviews due to their fluent interactive capabilities and extensive medical knowledge. However, the lack of high-quality interview dialogue data and widely accepted evaluation methods has significantly impeded this process. So we propose CliniChat, a framework that integrates multi-source knowledge to enable LLMs to simulate real-world clinical interviews. It consists of two modules: Clini-Recon and Clini-Eval, each responsible for reconstructing and evaluating interview dialogues, respectively. By incorporating three sources of knowledge, Clini-Recon transforms clinical notes into systematic, professional, and empathetic interview dialogues. Clini-Eval combines a comprehensive evaluation metric system with a two-phase automatic evaluation approach, enabling LLMs to assess interview performance like experts. We contribute MedQA-Dialog, a high-quality synthetic interview dialogue dataset, and CliniChatGLM, a model specialized for clinical interviews. Experimental results demonstrate that CliniChatGLM's interview capabilities undergo a comprehensive upgrade, particularly in history-taking, achieving state-of-the-art performance.

Via

Access Paper or Ask Questions

Debiasing 6-DOF IMU via Hierarchical Learning of Continuous Bias Dynamics

Apr 13, 2025

Ben Liu, Tzu-Yuan Lin, Wei Zhang, Maani Ghaffari

Figure 1 for Debiasing 6-DOF IMU via Hierarchical Learning of Continuous Bias Dynamics

Figure 2 for Debiasing 6-DOF IMU via Hierarchical Learning of Continuous Bias Dynamics

Figure 3 for Debiasing 6-DOF IMU via Hierarchical Learning of Continuous Bias Dynamics

Figure 4 for Debiasing 6-DOF IMU via Hierarchical Learning of Continuous Bias Dynamics

Abstract:This paper develops a deep learning approach to the online debiasing of IMU gyroscopes and accelerometers. Most existing methods rely on implicitly learning a bias term to compensate for raw IMU data. Explicit bias learning has recently shown its potential as a more interpretable and motion-independent alternative. However, it remains underexplored and faces challenges, particularly the need for ground truth bias data, which is rarely available. To address this, we propose a neural ordinary differential equation (NODE) framework that explicitly models continuous bias dynamics, requiring only pose ground truth, often available in datasets. This is achieved by extending the canonical NODE framework to the matrix Lie group for IMU kinematics with a hierarchical training strategy. The validation on two public datasets and one real-world experiment demonstrates significant accuracy improvements in IMU measurements, reducing errors in both pure IMU integration and visual-inertial odometry.

* Accepted by Robotics: Science and Systems, 2025

Via

Access Paper or Ask Questions

BioChemInsight: An Open-Source Toolkit for Automated Identification and Recognition of Optical Chemical Structures and Activity Data in Scientific Publications

Apr 12, 2025

Zhe Wang, Fangtian Fu, Wei Zhang, Lige Yan, Yan Meng, Jianping Wu, Hui Wu, Gang Xu, Si Chen

Abstract:Automated extraction of chemical structures and their bioactivity data is crucial for accelerating drug discovery and enabling data-driven pharmaceutical research. Existing optical chemical structure recognition (OCSR) tools fail to autonomously associate molecular structures with their bioactivity profiles, creating a critical bottleneck in structure-activity relationship (SAR) analysis. Here, we present BioChemInsight, an open-source pipeline that integrates: (1) DECIMER Segmentation and MolVec for chemical structure recognition, (2) Qwen2.5-VL-32B for compound identifier association, and (3) PaddleOCR with Gemini-2.0-flash for bioactivity extraction and unit normalization. We evaluated the performance of BioChemInsight on 25 patents and 17 articles. BioChemInsight achieved 95% accuracy for tabular patent data (structure/identifier recognition), with lower accuracy in non-tabular patents (~80% structures, ~75% identifiers), plus 92.2 % bioactivity extraction accuracy. For articles, it attained >99% identifiers and 78-80% structure accuracy in non-tabular formats, plus 97.4% bioactivity extraction accuracy. The system generates ready-to-use SAR datasets, reducing data preprocessing time from weeks to hours while enabling applications in high-throughput screening and ML-driven drug design (https://github.com/dahuilangda/BioChemInsight).

* 20 pages, 7 figures

Via

Access Paper or Ask Questions

The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Apr 07, 2025

Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang

Figure 1 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Figure 2 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Figure 3 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Figure 4 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Abstract:Motion expression video segmentation is designed to segment objects in accordance with the input motion expressions. In contrast to the conventional Referring Video Object Segmentation (RVOS), it places emphasis on motion as well as multi-object expressions, making it more arduous. Recently, Large Multimodal Models (LMMs) have begun to shine in RVOS due to their powerful vision-language perception capabilities. In this work, we propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation. Firstly, we use Sa2VA as our baseline, which is a unified LMM for dense grounded understanding of both images and videos. Secondly, we uniformly sample the video frames during the inference process to enhance the model's understanding of the entire video. Finally, we integrate the results of multiple expert models to mitigate the erroneous predictions of a single model. Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.

Via

Access Paper or Ask Questions