Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hui Zhang

Centre for Medical Image Computing, Department of Computer Science, University College London, UK

COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation

Mar 04, 2025

Gen Shi, Hui Zhang, Jie Tian

Figure 1 for COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation

Figure 2 for COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation

Figure 3 for COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation

Figure 4 for COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation

Abstract:Accurate segmentation of 3D vascular structures is essential for various medical imaging applications. The dispersed nature of vascular structures leads to inherent spatial uncertainty and necessitates location awareness, yet most current 3D medical segmentation models rely on the patch-wise training strategy that usually loses this spatial context. In this study, we introduce the Coordinate-aware Modulated Mamba Network (COMMA) and contribute a manually labeled dataset of 570 cases, the largest publicly available 3D vessel dataset to date. COMMA leverages both entire and cropped patch data through global and local branches, ensuring robust and efficient spatial location awareness. Specifically, COMMA employs a channel-compressed Mamba (ccMamba) block to encode entire image data, capturing long-range dependencies while optimizing computational costs. Additionally, we propose a coordinate-aware modulated (CaM) block to enhance interactions between the global and local branches, allowing the local branch to better perceive spatial information. We evaluate COMMA on six datasets, covering two imaging modalities and five types of vascular tissues. The results demonstrate COMMA's superior performance compared to state-of-the-art methods with computational efficiency, especially in segmenting small vessels. Ablation studies further highlight the importance of our proposed modules and spatial information. The code and data will be open source at https://github.com/shigen-StoneRoot/COMMA.

Via

Access Paper or Ask Questions

Flexible Exoskeleton Control Based on Binding Alignment Strategy and Full-arm Coordination Mechanism

Mar 03, 2025

Chuang Cheng, Xinglong Zhang, Xieyuanli Chen, Wei Dai, Longwen Chen, Daoxun Zhang, Hui Zhang, Jie Jiang, Huimin Lu

Abstract:In rehabilitation, powered, and teleoperation exoskeletons, connecting the human body to the exoskeleton through binding attachments is a common configuration. However, the uncertainty of the tightness and the donning deviation of the binding attachments will affect the flexibility and comfort of the exoskeletons, especially during high-speed movement. To address this challenge, this paper presents a flexible exoskeleton control approach with binding alignment and full-arm coordination. Firstly, the sources of the force interaction caused by donning offsets are analyzed, based on which the interactive force data is classified into the major, assistant, coordination, and redundant component categories. Then, a binding alignment strategy (BAS) is proposed to reduce the donning disturbances by combining different force data. Furthermore, we propose a full-arm coordination mechanism (FCM) that focuses on two modes of arm movement intent, joint-oriented and target-oriented, to improve the flexible performance of the whole exoskeleton control during high-speed motion. In this method, we propose an algorithm to distinguish the two intentions to resolve the conflict issue of the force component. Finally, a series of experiments covering various aspects of exoskeleton performance (flexibility, adaptability, accuracy, speed, and fatigue) were conducted to demonstrate the benefits of our control framework in our full-arm exoskeleton.

Via

Access Paper or Ask Questions

SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation

Feb 08, 2025

Xingtong Yu, Zechuan Gong, Chang Zhou, Yuan Fang, Hui Zhang

Figure 1 for SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation

Figure 2 for SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation

Figure 3 for SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation

Figure 4 for SAMGPT: Text-free Graph Foundation Model for Multi-domain Pre-training and Cross-domain Adaptation

Abstract:Graphs are able to model interconnected entities in many online services, supporting a wide range of applications on the Web. This raises an important question: How can we train a graph foundational model on multiple source domains and adapt to an unseen target domain? A major obstacle is that graphs from different domains often exhibit divergent characteristics. Some studies leverage large language models to align multiple domains based on textual descriptions associated with the graphs, limiting their applicability to text-attributed graphs. For text-free graphs, a few recent works attempt to align different feature distributions across domains, while generally neglecting structural differences. In this work, we propose a novel Structure Alignment framework for text-free Multi-domain Graph Pre-Training and cross-domain adaptation (SAMGPT). It is designed to learn multi-domain knowledge from graphs originating in multiple source domains, which can then be adapted to address applications in an unseen target domain. Specifically, we introduce a set of structure tokens to harmonize structure-based aggregation across source domains during the pre-training phase. Next, for cross-domain adaptation, we design dual prompts, namely, holistic prompts and specific prompts, which adapt unified multi-domain structural knowledge and fine-grained, domain-specific information, respectively, to a target domain. Finally, we conduct comprehensive experiments on seven public datasets to evaluate and analyze the effectiveness of SAMGPT.

* Accepted by WWW2025 Main Track

Via

Access Paper or Ask Questions

MapFusion: A Novel BEV Feature Fusion Network for Multi-modal Map Construction

Feb 05, 2025

Xiaoshuai Hao, Yunfeng Diao, Mengchuan Wei, Yifan Yang, Peng Hao, Rong Yin, Hui Zhang, Weiming Li, Shu Zhao, Yu Liu

Abstract:Map construction task plays a vital role in providing precise and comprehensive static environmental information essential for autonomous driving systems. Primary sensors include cameras and LiDAR, with configurations varying between camera-only, LiDAR-only, or camera-LiDAR fusion, based on cost-performance considerations. While fusion-based methods typically perform best, existing approaches often neglect modality interaction and rely on simple fusion strategies, which suffer from the problems of misalignment and information loss. To address these issues, we propose MapFusion, a novel multi-modal Bird's-Eye View (BEV) feature fusion method for map construction. Specifically, to solve the semantic misalignment problem between camera and LiDAR BEV features, we introduce the Cross-modal Interaction Transform (CIT) module, enabling interaction between two BEV feature spaces and enhancing feature representation through a self-attention mechanism. Additionally, we propose an effective Dual Dynamic Fusion (DDF) module to adaptively select valuable information from different modalities, which can take full advantage of the inherent information between different modalities. Moreover, MapFusion is designed to be simple and plug-and-play, easily integrated into existing pipelines. We evaluate MapFusion on two map construction tasks, including High-definition (HD) map and BEV map segmentation, to show its versatility and effectiveness. Compared with the state-of-the-art methods, MapFusion achieves 3.6% and 6.2% absolute improvements on the HD map construction and BEV map segmentation tasks on the nuScenes dataset, respectively, demonstrating the superiority of our approach.

Via

Access Paper or Ask Questions

HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

Dec 16, 2024

Zijian Gu, Jianwei Ma, Yan Huang, Honghao Wei, Zhanye Chen, Hui Zhang, Wei Hong

Figure 1 for HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

Figure 2 for HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

Figure 3 for HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

Figure 4 for HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

Abstract:Millimeter-wave radar plays a vital role in 3D object detection for autonomous driving due to its all-weather and all-lighting-condition capabilities for perception. However, radar point clouds suffer from pronounced sparsity and unavoidable angle estimation errors. To address these limitations, incorporating a camera may partially help mitigate the shortcomings. Nevertheless, the direct fusion of radar and camera data can lead to negative or even opposite effects due to the lack of depth information in images and low-quality image features under adverse lighting conditions. Hence, in this paper, we present the radar-camera fusion network with Hybrid Generation and Synchronization (HGSFusion), designed to better fuse radar potentials and image features for 3D object detection. Specifically, we propose the Radar Hybrid Generation Module (RHGM), which fully considers the Direction-Of-Arrival (DOA) estimation errors in radar signal processing. This module generates denser radar points through different Probability Density Functions (PDFs) with the assistance of semantic information. Meanwhile, we introduce the Dual Sync Module (DSM), comprising spatial sync and modality sync, to enhance image features with radar positional information and facilitate the fusion of distinct characteristics in different modalities. Extensive experiments demonstrate the effectiveness of our approach, outperforming the state-of-the-art methods in the VoD and TJ4DRadSet datasets by $6.53\%$ and $2.03\%$ in RoI AP and BEV AP, respectively. The code is available at https://github.com/garfield-cpp/HGSFusion.

* 12 pages, 8 figures, 7 tables. Accepted by AAAI 2025 , the 39th Annual AAAI Conference on Artificial Intelligence

Via

Access Paper or Ask Questions

CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a Dual Teacher-Student Framework

Dec 12, 2024

Yushan Han, Hui Zhang, Honglei Zhang, Jing Wang, Yidong Li

Figure 1 for CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a Dual Teacher-Student Framework

Figure 2 for CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a Dual Teacher-Student Framework

Figure 3 for CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a Dual Teacher-Student Framework

Figure 4 for CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a Dual Teacher-Student Framework

Abstract:Current collaborative perception methods often rely on fully annotated datasets, which can be expensive to obtain in practical situations. To reduce annotation costs, some works adopt sparsely supervised learning techniques and generate pseudo labels for the missing instances. However, these methods fail to achieve an optimal confidence threshold that harmonizes the quality and quantity of pseudo labels. To address this issue, we propose an end-to-end Collaborative perception Dual Teacher-Student framework (CoDTS), which employs adaptive complementary learning to produce both high-quality and high-quantity pseudo labels. Specifically, the Main Foreground Mining (MFM) module generates high-quality pseudo labels based on the prediction of the static teacher. Subsequently, the Supplement Foreground Mining (SFM) module ensures a balance between the quality and quantity of pseudo labels by adaptively identifying missing instances based on the prediction of the dynamic teacher. Additionally, the Neighbor Anchor Sampling (NAS) module is incorporated to enhance the representation of pseudo labels. To promote the adaptive complementary learning, we implement a staged training strategy that trains the student and dynamic teacher in a mutually beneficial manner. Extensive experiments demonstrate that the CoDTS effectively ensures an optimal balance of pseudo labels in both quality and quantity, establishing a new state-of-the-art in sparsely supervised collaborative perception.

* AAAI 2025

Via

Access Paper or Ask Questions

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Dec 05, 2024

Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang

Figure 1 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Figure 2 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Figure 3 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Figure 4 for CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Abstract:Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at https://creatilayout.github.io.

* 16 pages, 13 figures

Via

Access Paper or Ask Questions

FunGrasp: Functional Grasping for Diverse Dexterous Hands

Nov 24, 2024

Linyi Huang, Hui Zhang, Zijian Wu, Sammy Christen, Jie Song

Figure 1 for FunGrasp: Functional Grasping for Diverse Dexterous Hands

Figure 2 for FunGrasp: Functional Grasping for Diverse Dexterous Hands

Figure 3 for FunGrasp: Functional Grasping for Diverse Dexterous Hands

Figure 4 for FunGrasp: Functional Grasping for Diverse Dexterous Hands

Abstract:Functional grasping is essential for humans to perform specific tasks, such as grasping scissors by the finger holes to cut materials or by the blade to safely hand them over. Enabling dexterous robot hands with functional grasping capabilities is crucial for their deployment to accomplish diverse real-world tasks. Recent research in dexterous grasping, however, often focuses on power grasps while overlooking task- and object-specific functional grasping poses. In this paper, we introduce FunGrasp, a system that enables functional dexterous grasping across various robot hands and performs one-shot transfer to unseen objects. Given a single RGBD image of functional human grasping, our system estimates the hand pose and transfers it to different robotic hands via a human-to-robot (H2R) grasp retargeting module. Guided by the retargeted grasping poses, a policy is trained through reinforcement learning in simulation for dynamic grasping control. To achieve robust sim-to-real transfer, we employ several techniques including privileged learning, system identification, domain randomization, and gravity compensation. In our experiments, we demonstrate that our system enables diverse functional grasping of unseen objects using single RGBD images, and can be successfully deployed across various dexterous robot hands. The significance of the components is validated through comprehensive ablation studies. Project page: https://hly-123.github.io/FunGrasp/ .

* Project page: https://hly-123.github.io/FunGrasp/

Via

Access Paper or Ask Questions

Few-shot Model Extraction Attacks against Sequential Recommender Systems

Nov 18, 2024

Hui Zhang, Fu Liu

Abstract:Among adversarial attacks against sequential recommender systems, model extraction attacks represent a method to attack sequential recommendation models without prior knowledge. Existing research has primarily concentrated on the adversary's execution of black-box attacks through data-free model extraction. However, a significant gap remains in the literature concerning the development of surrogate models by adversaries with access to few-shot raw data (10\% even less). That is, the challenge of how to construct a surrogate model with high functional similarity within the context of few-shot data scenarios remains an issue that requires resolution.This study addresses this gap by introducing a novel few-shot model extraction framework against sequential recommenders, which is designed to construct a superior surrogate model with the utilization of few-shot data. The proposed few-shot model extraction framework is comprised of two components: an autoregressive augmentation generation strategy and a bidirectional repair loss-facilitated model distillation procedure. Specifically, to generate synthetic data that closely approximate the distribution of raw data, autoregressive augmentation generation strategy integrates a probabilistic interaction sampler to extract inherent dependencies and a synthesis determinant signal module to characterize user behavioral patterns. Subsequently, bidirectional repair loss, which target the discrepancies between the recommendation lists, is designed as auxiliary loss to rectify erroneous predictions from surrogate models, transferring knowledge from the victim model to the surrogate model effectively. Experiments on three datasets show that the proposed few-shot model extraction framework yields superior surrogate models.

Via

Access Paper or Ask Questions

Edge-Cloud Collaborative Satellite Image Analysis for Efficient Man-Made Structure Recognition

Oct 08, 2024

Kaicheng Sheng, Junxiao Xue, Hui Zhang

Figure 1 for Edge-Cloud Collaborative Satellite Image Analysis for Efficient Man-Made Structure Recognition

Figure 2 for Edge-Cloud Collaborative Satellite Image Analysis for Efficient Man-Made Structure Recognition

Figure 3 for Edge-Cloud Collaborative Satellite Image Analysis for Efficient Man-Made Structure Recognition

Figure 4 for Edge-Cloud Collaborative Satellite Image Analysis for Efficient Man-Made Structure Recognition

Abstract:The increasing availability of high-resolution satellite imagery has created immense opportunities for various applications. However, processing and analyzing such vast amounts of data in a timely and accurate manner poses significant challenges. The paper presents a new satellite image processing architecture combining edge and cloud computing to better identify man-made structures against natural landscapes. By employing lightweight models at the edge, the system initially identifies potential man-made structures from satellite imagery. These identified images are then transmitted to the cloud, where a more complex model refines the classification, determining specific types of structures. The primary focus is on the trade-off between latency and accuracy, as efficient models often sacrifice accuracy. We compare this hybrid edge-cloud approach against traditional "bent-pipe" method in virtual environment experiments as well as introduce a practical model and compare its performance with existing lightweight models for edge deployment, focusing on accuracy and latency. The results demonstrate that the edge-cloud collaborative model not only reduces overall latency due to minimized data transmission but also maintains high accuracy, offering substantial improvements over traditional approaches under this scenario.

Via

Access Paper or Ask Questions