Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jing Zhang

The University of Sydney, Australia

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

May 31, 2024

Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, Bing-Yi Jing

Figure 1 for Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Figure 2 for Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Figure 3 for Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Figure 4 for Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Abstract:In offline reinforcement learning (RL), it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. Policy-regularized methods address this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm that we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance grounds on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. For critic training, we train a Q-ensemble to stabilize the estimation of Q-gradient. Additionally, DAC employs lower confidence bound (LCB) to address the overestimation and underestimation of value targets due to function approximation error. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in almost all environments. Code is available at \href{https://github.com/Fang-Lin93/DAC}{\texttt{github.com/Fang-Lin93/DAC}}.

Via

Access Paper or Ask Questions

You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

May 31, 2024

Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, Yiran Zhong

Figure 1 for You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

Figure 2 for You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

Figure 3 for You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

Figure 4 for You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

Abstract:Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed. However, the inherent decay mechanism in linear attention presents challenges when applied to multi-dimensional sequence modeling tasks, such as image processing and multi-modal learning. In these scenarios, the utilization of sequential scanning to establish a global receptive field necessitates multiple scans for multi-dimensional data, thereby leading to inefficiencies. This paper identifies the inefficiency caused by a multiplicative linear recurrence and proposes an efficient alternative additive linear recurrence to avoid the issue, as it can handle multi-dimensional data within a single scan. We further develop an efficient multi-dimensional sequential modeling framework called LightNet based on the new recurrence. Moreover, we present two new multi-dimensional linear relative positional encoding methods, MD-TPE and MD-LRPE to enhance the model's ability to discern positional information in multi-dimensional scenarios. Our empirical evaluations across various tasks, including image classification, image generation, bidirectional language modeling, and autoregressive language modeling, demonstrate the efficacy of LightNet, showcasing its potential as a versatile and efficient solution for multi-dimensional sequential modeling.

* Technical report. Yiran Zhong is the corresponding author. The code is available at https://github.com/OpenNLPLab/LightNet

Via

Access Paper or Ask Questions

FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Flight Trajectory Prediction

May 25, 2024

Lan Wu, Xuebin Wang, Ruijuan Chu, Guangyi Liu, Yingchun Chen, Jing Zhang, Linyu Wang

Figure 1 for FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Flight Trajectory Prediction

Figure 2 for FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Flight Trajectory Prediction

Figure 3 for FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Flight Trajectory Prediction

Figure 4 for FlightPatchNet: Multi-Scale Patch Network with Differential Coding for Flight Trajectory Prediction

Abstract:Accurate multi-step flight trajectory prediction plays an important role in Air Traffic Control, which can ensure the safety of air transportation. Two main issues limit the flight trajectory prediction performance of existing works. The first issue is the negative impact on prediction accuracy caused by the significant differences in data range. The second issue is that real-world flight trajectories involve underlying temporal dependencies, and existing methods fail to reveal the hidden complex temporal variations and only extract features from one single time scale. To address the above issues, we propose FlightPatchNet, a multi-scale patch network with differential coding for flight trajectory prediction. Specifically, FlightPatchNet first utilizes the differential coding to encode the original values of longitude and latitude into first-order differences and generates embeddings for all variables at each time step. Then, a global temporal attention is introduced to explore the dependencies between different time steps. To fully explore the diverse temporal patterns in flight trajectories, a multi-scale patch network is delicately designed to serve as the backbone. The multi-scale patch network exploits stacked patch mixer blocks to capture inter- and intra-patch dependencies under different time scales, and further integrates multi-scale temporal features across different scales and variables. Finally, FlightPatchNet ensembles multiple predictors to make direct multi-step prediction. Extensive experiments on ADS-B datasets demonstrate that our model outperforms the competitive baselines. Code is available at: https://github.com/FlightTrajectoryResearch/FlightPatchNet.

Via

Access Paper or Ask Questions

A Solution-based LLM API-using Methodology for Academic Information Seeking

May 24, 2024

Yuanchun Wang, Jifan Yu, Zijun Yao, Jing Zhang, Yuyang Xie, Shangqing Tu, Yiyang Fu, Youhe Feng, Jinkai Zhang, Jingyao Zhang(+6 more)

Abstract:Applying large language models (LLMs) for academic API usage shows promise in reducing researchers' academic information seeking efforts. However, current LLM API-using methods struggle with complex API coupling commonly encountered in academic queries. To address this, we introduce SoAy, a solution-based LLM API-using methodology for academic information seeking. It uses code with a solution as the reasoning method, where a solution is a pre-constructed API calling sequence. The addition of the solution reduces the difficulty for the model to understand the complex relationships between APIs. Code improves the efficiency of reasoning. To evaluate SoAy, we introduce SoAyBench, an evaluation benchmark accompanied by SoAyEval, built upon a cloned environment of APIs from AMiner. Experimental results demonstrate a 34.58-75.99\% performance improvement compared to state-of-the-art LLM API-based baselines. All datasets, codes, tuned models, and deployed online services are publicly accessible at https://github.com/RUCKBReasoning/SoAy.

* 22 pages, 13 figures

Via

Access Paper or Ask Questions

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

May 24, 2024

Xinlei Niu, Jing Zhang, Christian Walder, Charles Patrick Martin

Figure 1 for SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Figure 2 for SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Figure 3 for SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Figure 4 for SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Abstract:We present SoundLoCD, a novel text-to-sound generation framework, which incorporates a LoRA-based conditional discrete contrastive latent diffusion model. Unlike recent large-scale sound generation models, our model can be efficiently trained under limited computational resources. The integration of a contrastive learning strategy further enhances the connection between text conditions and the generated outputs, resulting in coherent and high-fidelity performance. Our experiments demonstrate that SoundLoCD outperforms the baseline with greatly reduced computational resources. A comprehensive ablation study further validates the contribution of each component within SoundLoCD. Demo page: \url{https://XinleiNIU.github.io/demo-SoundLoCD/}.

Via

Access Paper or Ask Questions

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

May 20, 2024

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, Kaiqi Huang

Abstract:Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. By leveraging high-level semantic information, VLT guides object tracking, alleviating the constraints associated with relying on a visual modality. Nevertheless, most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. Moreover, coordinating human annotators for high-quality annotations is laborious and time-consuming. To address these challenges, we introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity. (1) DTLLM-VLT generates scientific and multi-granularity text descriptions using a cohesive prompt framework. Its succinct and highly adaptable design allows seamless integration into various visual tracking benchmarks. (2) We select three prominent benchmarks to deploy our approach: short-term tracking, long-term tracking, and global instance tracking. We offer four granularity combinations for these benchmarks, considering the extent and density of semantic information, thereby showcasing the practicality and versatility of DTLLM-VLT. (3) We conduct comparative experiments on VLT benchmarks with different text granularities, evaluating and analyzing the impact of diverse text on tracking performance. Conclusionally, this work leverages LLM to provide multi-granularity semantic information for VLT task from efficient and diverse perspectives, enabling fine-grained evaluation of multi-modal trackers. In the future, we believe this work can be extended to more datasets to support vision datasets understanding.

* Accepted by CVPR Workshop 2024, Oral Presentation

Via

Access Paper or Ask Questions

Team Samsung-RAL: Technical Report for 2024 RoboDrive Challenge-Robust Map Segmentation Track

May 17, 2024

Xiaoshuai Hao, Yifan Yang, Hui Zhang, Mengchuan Wei, Yi Zhou, Haimei Zhao, Jing Zhang

Figure 1 for Team Samsung-RAL: Technical Report for 2024 RoboDrive Challenge-Robust Map Segmentation Track

Figure 2 for Team Samsung-RAL: Technical Report for 2024 RoboDrive Challenge-Robust Map Segmentation Track

Figure 3 for Team Samsung-RAL: Technical Report for 2024 RoboDrive Challenge-Robust Map Segmentation Track

Figure 4 for Team Samsung-RAL: Technical Report for 2024 RoboDrive Challenge-Robust Map Segmentation Track

Abstract:In this report, we describe the technical details of our submission to the 2024 RoboDrive Challenge Robust Map Segmentation Track. The Robust Map Segmentation track focuses on the segmentation of complex driving scene elements in BEV maps under varied driving conditions. Semantic map segmentation provides abundant and precise static environmental information crucial for autonomous driving systems' planning and navigation. While current methods excel in ideal circumstances, e.g., clear daytime conditions and fully functional sensors, their resilience to real-world challenges like adverse weather and sensor failures remains unclear, raising concerns about system safety. In this paper, we explored several methods to improve the robustness of the map segmentation task. The details are as follows: 1) Robustness analysis of utilizing temporal information; 2) Robustness analysis of utilizing different backbones; and 3) Data Augmentation to boost corruption robustness. Based on the evaluation results, we draw several important findings including 1) The temporal fusion module is effective in improving the robustness of the map segmentation model; 2) A strong backbone is effective for improving the corruption robustness; and 3) Some data augmentation methods are effective in improving the robustness of map segmentation models. These novel findings allowed us to achieve promising results in the 2024 RoboDrive Challenge-Robust Map Segmentation Track.

* ICRA 2024 RoboDrive Challenge Robust Map Segmentation Track 3rd Place Technical Report. arXiv admin note: text overlap with arXiv:2205.09743 by other authors

Via

Access Paper or Ask Questions

LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

May 16, 2024

Wentao Jiang, Jing Zhang, Di Wang, Qiming Zhang, Zengmao Wang, Bo Du

Figure 1 for LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

Figure 2 for LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

Figure 3 for LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

Figure 4 for LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

Abstract:Due to spatial redundancy in remote sensing images, sparse tokens containing rich information are usually involved in self-attention (SA) to reduce the overall token numbers within the calculation, avoiding the high computational cost issue in Vision Transformers. However, such methods usually obtain sparse tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to reach a better balance between efficiency and performance. Different from them, this paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information meanwhile improving the inference speed. Technically, the meta tokens are first initialized from image tokens via cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote information exchange between image tokens and meta tokens, where they serve as query and key (value) tokens alternatively in a dual-branch structure, significantly reducing the computational complexity compared to self-attention. By employing DCA in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 \times$ speedup, fewer parameters, and competitive performance compared to the baseline models, and achieves a better trade-off between efficiency and performance.

* Accepted by IJCAI'2024. The code is available at https://github.com/ViTAE-Transformer/LeMeViT

Via

Access Paper or Ask Questions

The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition

May 14, 2024

Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Yaru Niu, Wei Tsang Ooi, Benoit R. Cottereau, Lai Xing Ng, Yuexin Ma, Wenwei Zhang, Liang Pan(+81 more)

Figure 1 for The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition

Figure 2 for The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition

Figure 3 for The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition

Figure 4 for The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition

Abstract:In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field.

* ICRA 2024; 31 pages, 24 figures, 5 tables; Code at https://robodrive-24.github.io/

Via

Access Paper or Ask Questions

OXYGENERATOR: Reconstructing Global Ocean Deoxygenation Over a Century with Deep Learning

May 12, 2024

Bin Lu, Ze Zhao, Luyu Han, Xiaoying Gan, Yuntao Zhou, Lei Zhou, Luoyi Fu, Xinbing Wang, Chenghu Zhou, Jing Zhang

Figure 1 for OXYGENERATOR: Reconstructing Global Ocean Deoxygenation Over a Century with Deep Learning

Figure 2 for OXYGENERATOR: Reconstructing Global Ocean Deoxygenation Over a Century with Deep Learning

Figure 3 for OXYGENERATOR: Reconstructing Global Ocean Deoxygenation Over a Century with Deep Learning

Figure 4 for OXYGENERATOR: Reconstructing Global Ocean Deoxygenation Over a Century with Deep Learning

Abstract:Accurately reconstructing the global ocean deoxygenation over a century is crucial for assessing and protecting marine ecosystem. Existing expert-dominated numerical simulations fail to catch up with the dynamic variation caused by global warming and human activities. Besides, due to the high-cost data collection, the historical observations are severely sparse, leading to big challenge for precise reconstruction. In this work, we propose OxyGenerator, the first deep learning based model, to reconstruct the global ocean deoxygenation from 1920 to 2023. Specifically, to address the heterogeneity across large temporal and spatial scales, we propose zoning-varying graph message-passing to capture the complex oceanographic correlations between missing values and sparse observations. Additionally, to further calibrate the uncertainty, we incorporate inductive bias from dissolved oxygen (DO) variations and chemical effects. Compared with in-situ DO observations, OxyGenerator significantly outperforms CMIP6 numerical simulations, reducing MAPE by 38.77%, demonstrating a promising potential to understand the "breathless ocean" in data-driven manner.

* Accepted to ICML 2024

Via

Access Paper or Ask Questions