Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chuang Zhang

V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection

Jan 04, 2025

Sichao Wang, Chuang Zhang, Ming Yuan, Qing Xu, Lei He, Jianqiang Wang

Figure 1 for V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection

Figure 2 for V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection

Figure 3 for V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection

Figure 4 for V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection

Abstract:In V2X collaborative perception, the domain gaps between heterogeneous nodes pose a significant challenge for effective information fusion. Pose errors arising from latency and GPS localization noise further exacerbate the issue by leading to feature misalignment. To overcome these challenges, we propose V2X-DGPE, a high-accuracy and robust V2X feature-level collaborative perception framework. V2X-DGPE employs a Knowledge Distillation Framework and a Feature Compensation Module to learn domain-invariant representations from multi-source data, effectively reducing the feature distribution gap between vehicles and roadside infrastructure. Historical information is utilized to provide the model with a more comprehensive understanding of the current scene. Furthermore, a Collaborative Fusion Module leverages a heterogeneous self-attention mechanism to extract and integrate heterogeneous representations from vehicles and infrastructure. To address pose errors, V2X-DGPE introduces a deformable attention mechanism, enabling the model to adaptively focus on critical parts of the input features by dynamically offsetting sampling points. Extensive experiments on the real-world DAIR-V2X dataset demonstrate that the proposed method outperforms existing approaches, achieving state-of-the-art detection performance. The code is available at https://github.com/wangsch10/V2X-DGPE.

Via

Access Paper or Ask Questions

Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning

Jul 12, 2024

Chuang Zhang, Geng Sun, Jiahui Li, Qingqing Wu, Jiacheng Wang, Dusit Niyato, Yuanwei Liu

Figure 1 for Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning

Figure 2 for Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning

Figure 3 for Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning

Figure 4 for Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning

Abstract:Due to flexibility and low-cost, unmanned aerial vehicles (UAVs) are increasingly crucial for enhancing coverage and functionality of wireless networks. However, incorporating UAVs into next-generation wireless communication systems poses significant challenges, particularly in sustaining high-rate and long-range secure communications against eavesdropping attacks. In this work, we consider a UAV swarm-enabled secure surveillance network system, where a UAV swarm forms a virtual antenna array to transmit sensitive surveillance data to a remote base station (RBS) via collaborative beamforming (CB) so as to resist mobile eavesdroppers. Specifically, we formulate an aerial secure communication and energy efficiency multi-objective optimization problem (ASCEE-MOP) to maximize the secrecy rate of the system and to minimize the flight energy consumption of the UAV swarm. To address the non-convex, NP-hard and dynamic ASCEE-MOP, we propose a generative diffusion model-enabled twin delayed deep deterministic policy gradient (GDMTD3) method. Specifically, GDMTD3 leverages an innovative application of diffusion models to determine optimal excitation current weights and position decisions of UAVs. The diffusion models can better capture the complex dynamics and the trade-off of the ASCEE-MOP, thereby yielding promising solutions. Simulation results highlight the superior performance of the proposed approach compared with traditional deployment strategies and some other deep reinforcement learning (DRL) benchmarks. Moreover, performance analysis under various parameter settings of GDMTD3 and different numbers of UAVs verifies the robustness of the proposed approach.

* This paper has been submitted to IEEE Transactions on Mobile Computing

Via

Access Paper or Ask Questions

M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and Atmosphere

Jun 19, 2024

Mengqiu Xu, Ming Wu, Kaixin Chen, Yixiang Huang, Mingrui Xu, Yujia Yang, Yiqing Feng, Yiying Guo, Bin Huang, Dongliang Chang(+4 more)

Abstract:Marine fog poses a significant hazard to global shipping, necessitating effective detection and forecasting to reduce economic losses. In recent years, several machine learning (ML) methods have demonstrated superior detection accuracy compared to traditional meteorological methods. However, most of these works are developed on proprietary datasets, and the few publicly accessible datasets are often limited to simplistic toy scenarios for research purposes. To advance the field, we have collected nearly a decade's worth of multi-modal data related to continuous marine fog stages from four series of geostationary meteorological satellites, along with meteorological observations and numerical analysis, covering 15 marine regions globally where maritime fog frequently occurs. Through pixel-level manual annotation by meteorological experts, we present the most comprehensive marine fog detection and forecasting dataset to date, named M4Fog, to bridge ocean and atmosphere. The dataset comprises 68,000 "super data cubes" along four dimensions: elements, latitude, longitude and time, with a temporal resolution of half an hour and a spatial resolution of 1 kilometer. Considering practical applications, we have defined and explored three meaningful tracks with multi-metric evaluation systems: static or dynamic marine fog detection, and spatio-temporal forecasting for cloud images. Extensive benchmarking and experiments demonstrate the rationality and effectiveness of the construction concept for proposed M4Fog. The data and codes are available to whole researchers through cloud platforms to develop ML-driven marine fog solutions and mitigate adverse impacts on human activities.

Via

Access Paper or Ask Questions

Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking

Jun 04, 2024

Zefeng Zhang, Jiawei Sheng, Chuang Zhang, Yunzhi Liang, Wenyuan Zhang, Siqi Wang, Tingwen Liu

Figure 1 for Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking

Figure 2 for Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking

Figure 3 for Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking

Figure 4 for Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking

Abstract:Multimodal Entity Linking (MEL) aims to link ambiguous mentions in multimodal contexts to entities in a multimodal knowledge graph. A pivotal challenge is to fully leverage multi-element correlations between mentions and entities to bridge modality gap and enable fine-grained semantic matching. Existing methods attempt several local correlative mechanisms, relying heavily on the automatically learned attention weights, which may over-concentrate on partial correlations. To mitigate this issue, we formulate the correlation assignment problem as an optimal transport (OT) problem, and propose a novel MEL framework, namely OT-MEL, with OT-guided correlation assignment. Thereby, we exploit the correlation between multimodal features to enhance multimodal fusion, and the correlation between mentions and entities to enhance fine-grained matching. To accelerate model prediction, we further leverage knowledge distillation to transfer OT assignment knowledge to attention mechanism. Experimental results show that our model significantly outperforms previous state-of-the-art baselines and confirm the effectiveness of the OT-guided correlation assignment.

* Findings of ACL 2024

Via

Access Paper or Ask Questions

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

May 31, 2024

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, Sifa Zheng

Figure 1 for SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

Figure 2 for SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

Figure 3 for SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

Figure 4 for SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation

Abstract:The well-established modular autonomous driving system is decoupled into different standalone tasks, e.g. perception, prediction and planning, suffering from information loss and error accumulation across modules. In contrast, end-to-end paradigms unify multi-tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. Despite the great potential of end-to-end paradigms, both the performance and efficiency of existing methods are not satisfactory, particularly in terms of planning safety. We attribute this to the computationally expensive BEV (bird's eye view) features and the straightforward design for prediction and planning. To this end, we explore the sparse representation and review the task design for end-to-end autonomous driving, proposing a new paradigm named SparseDrive. Concretely, SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. For motion prediction and planning, we review the great similarity between these two tasks, leading to a parallel design for motion planner. Based on this parallel design, which models planning as a multi-modal problem, we propose a hierarchical planning selection strategy , which incorporates a collision-aware rescore module, to select a rational and safe trajectory as the final planning output. With such effective designs, SparseDrive surpasses previous state-of-the-arts by a large margin in performance of all tasks, while achieving much higher training and inference efficiency. Code will be avaliable at https://github.com/swc-17/SparseDrive for facilitating future research.

Via

Access Paper or Ask Questions

Multi-Scale Representations by Varying Window Attention for Semantic Segmentation

Apr 26, 2024

Haotian Yan, Ming Wu, Chuang Zhang

Abstract:Multi-scale learning is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: scale inadequacy and field inactivation. A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the context's scale to vary for the query to learn representations at multiple scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R^2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. Consequently, VWA uses the same cost as LWA to overcome the receptive limitation of the local window. Furthermore, depending on VWA and employing various MLPs, we introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency competitive with the most compute-friendly MSDs, like FPN and MLP decoder, but performs much better than any MSDs. For instance, using nearly half of UPerNet's computation, VWFormer outperforms it by 1.0%-2.5% mIoU on ADE20K. With little extra overhead, ~10G FLOPs, Mask2Former armed with VWFormer improves by 1.0%-1.3%. The code and models are available at https://github.com/yan-hao-tian/vw

* ICLR2024 Poster

Via

Access Paper or Ask Questions

MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation

Mar 21, 2024

Longzheng Wang, Xiaohan Xu, Lei Zhang, Jiarui Lu, Yongxiu Xu, Hongbo Xu, Chuang Zhang

Abstract:Automatic detection of multimodal misinformation has gained a widespread attention recently. However, the potential of powerful Large Language Models (LLMs) for multimodal misinformation detection remains underexplored. Besides, how to teach LLMs to interpret multimodal misinformation in cost-effective and accessible way is still an open question. To address that, we propose MMIDR, a framework designed to teach LLMs in providing fluent and high-quality textual explanations for their decision-making process of multimodal misinformation. To convert multimodal misinformation into an appropriate instruction-following format, we present a data augmentation perspective and pipeline. This pipeline consists of a visual information processing module and an evidence retrieval module. Subsequently, we prompt the proprietary LLMs with processed contents to extract rationales for interpreting the authenticity of multimodal misinformation. Furthermore, we design an efficient knowledge distillation approach to distill the capability of proprietary LLMs in explaining multimodal misinformation into open-source LLMs. To explore several research questions regarding the performance of LLMs in multimodal misinformation detection tasks, we construct an instruction-following multimodal misinformation dataset and conduct comprehensive experiments. The experimental findings reveal that our MMIDR exhibits sufficient detection performance and possesses the capacity to provide compelling rationales to support its assessments.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

HIP Network: Historical Information Passing Network for Extrapolation Reasoning on Temporal Knowledge Graph

Feb 19, 2024

Yongquan He, Peng Zhang, Luchen Liu, Qi Liang, Wenyuan Zhang, Chuang Zhang

Abstract:In recent years, temporal knowledge graph (TKG) reasoning has received significant attention. Most existing methods assume that all timestamps and corresponding graphs are available during training, which makes it difficult to predict future events. To address this issue, recent works learn to infer future events based on historical information. However, these methods do not comprehensively consider the latent patterns behind temporal changes, to pass historical information selectively, update representations appropriately and predict events accurately. In this paper, we propose the Historical Information Passing (HIP) network to predict future events. HIP network passes information from temporal, structural and repetitive perspectives, which are used to model the temporal evolution of events, the interactions of events at the same time step, and the known events respectively. In particular, our method considers the updating of relation representations and adopts three scoring functions corresponding to the above dimensions. Experimental results on five benchmark datasets show the superiority of HIP network, and the significant improvements on Hits@1 prove that our method can more accurately predict what is going to happen.

* IJCAI (2021) 1915-1921
* 7 pages, 3 figures

Via

Access Paper or Ask Questions

Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Jan 29, 2024

Lei Yang, Xinyu Zhang, Jun Li, Li Wang, Chuang Zhang, Li Ju, Zhiwei Li, Yang Shen

Figure 1 for Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Figure 2 for Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Figure 3 for Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Figure 4 for Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Abstract:Roadside perception can greatly increase the safety of autonomous vehicles by extending their perception ability beyond the visual range and addressing blind spots. However, current state-of-the-art vision-based roadside detection methods possess high accuracy on labeled scenes but have inferior performance on new scenes. This is because roadside cameras remain stationary after installation and can only collect data from a single scene, resulting in the algorithm overfitting these roadside backgrounds and camera poses. To address this issue, in this paper, we propose an innovative Scenario Generalization Framework for Vision-based Roadside 3D Object Detection, dubbed SGV3D. Specifically, we employ a Background-suppressed Module (BSM) to mitigate background overfitting in vision-centric pipelines by attenuating background features during the 2D to bird's-eye-view projection. Furthermore, by introducing the Semi-supervised Data Generation Pipeline (SSDG) using unlabeled images from new scenes, diverse instance foregrounds with varying camera poses are generated, addressing the risk of overfitting specific camera poses. We evaluate our method on two large-scale roadside benchmarks. Our method surpasses all previous methods by a significant margin in new scenes, including +42.57% for vehicle, +5.87% for pedestrian, and +14.89% for cyclist compared to BEVHeight on the DAIR-V2X-I heterologous benchmark. On the larger-scale Rope3D heterologous benchmark, we achieve notable gains of 14.48% for car and 12.41% for large vehicle. We aspire to contribute insights on the exploration of roadside perception techniques, emphasizing their capability for scenario generalization. The code will be available at {\url{ https://github.com/yanglei18/SGV3D}}

* 13 pages, 8 figures

Via

Access Paper or Ask Questions

UAV Swarm-enabled Collaborative Secure Relay Communications with Time-domain Colluding Eavesdropper

Oct 03, 2023

Chuang Zhang, Geng Sun, Qingqing Wu, Jiahui Li, Shuang Liang, Dusit Niyato, Victor C. M. Leung

Abstract:Unmanned aerial vehicles (UAVs) as aerial relays are practically appealing for assisting Internet of Things (IoT) network. In this work, we aim to utilize the UAV swarm to assist the secure communication between the micro base station (MBS) equipped with the planar array antenna (PAA) and the IoT terminal devices by collaborative beamforming (CB), so as to counteract the effects of collusive eavesdropping attacks in time-domain. Specifically, we formulate a UAV swarm-enabled secure relay multi-objective optimization problem (US2RMOP) for simultaneously maximizing the achievable sum rate of associated IoT terminal devices, minimizing the achievable sum rate of the eavesdropper and minimizing the energy consumption of UAV swarm, by jointly optimizing the excitation current weights of both MBS and UAV swarm, the selection of the UAV receiver, the position of UAVs and user association order of IoT terminal devices. Furthermore, the formulated US2RMOP is proved to be a non-convex, NP-hard and large-scale optimization problem. Therefore, we propose an improved multi-objective grasshopper algorithm (IMOGOA) with some specific designs to address the problem. Simulation results exhibit the effectiveness of the proposed UAV swarm-enabled collaborative secure relay strategy and demonstrate the superiority of IMOGOA.

* Submitted to IEEE Transactions on Mobile Computing

Via

Access Paper or Ask Questions