Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Zhao

Frank

MMLF: Multi-modal Multi-class Late Fusion for Object Detection with Uncertainty Estimation

Oct 11, 2024

Qihang Yang, Yang Zhao, Hong Cheng

Abstract:Autonomous driving necessitates advanced object detection techniques that integrate information from multiple modalities to overcome the limitations associated with single-modal approaches. The challenges of aligning diverse data in early fusion and the complexities, along with overfitting issues introduced by deep fusion, underscore the efficacy of late fusion at the decision level. Late fusion ensures seamless integration without altering the original detector's network structure. This paper introduces a pioneering Multi-modal Multi-class Late Fusion method, designed for late fusion to enable multi-class detection. Fusion experiments conducted on the KITTI validation and official test datasets illustrate substantial performance improvements, presenting our model as a versatile solution for multi-modal object detection in autonomous driving. Moreover, our approach incorporates uncertainty analysis into the classification fusion process, rendering our model more transparent and trustworthy and providing more reliable insights into category predictions.

Via

Access Paper or Ask Questions

Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Oct 08, 2024

Sha Guo, Zhuo Chen, Yang Zhao, Ning Zhang, Xiaotong Li, Lingyu Duan

Figure 1 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 2 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 3 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Figure 4 for Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Abstract:Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.

* in Proceedings of the 31st ACM International Conference on Multimedia, pp. 1431-1442, 2023

Via

Access Paper or Ask Questions

Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Oct 06, 2024

Yang Zhao, Yixin Wang, Mingzhang Yin

Abstract:Aligning Large Language Models (LLMs) with diverse human preferences is a pivotal technique for controlling model behaviors and enhancing generation quality. Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and their variants optimize language models by pairwise comparisons. However, when multiple responses are available, these approaches fall short of leveraging the extensive information in the ranking given by the reward models or human feedback. In this work, we propose a novel listwise approach named Ordinal Preference Optimization (OPO), which employs the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, to better utilize relative proximity within ordinal multiple responses. We develop an end-to-end preference optimization algorithm by approximating NDCG with a differentiable surrogate loss. This approach builds a connection between ranking models in information retrieval and the alignment problem. In aligning multi-response datasets assigned with ordinal rewards, OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval. Moreover, we demonstrate that increasing the pool of negative samples can enhance model performance by reducing the adverse effects of trivial negatives.

Via

Access Paper or Ask Questions

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Oct 03, 2024

Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, Xihui Liu

Figure 1 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Figure 2 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Figure 3 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Figure 4 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Abstract:It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video.

* Project page: https://epiphqny.github.io/Loong-video/

Via

Access Paper or Ask Questions

NTIRE 2024 Challenge on Stereo Image Super-Resolution: Methods and Results

Sep 25, 2024

Longguang Wang, Yulan Guo, Juncheng Li, Hongda Liu, Yang Zhao, Yingqian Wang, Zhi Jin, Shuhang Gu, Radu Timofte

Figure 1 for NTIRE 2024 Challenge on Stereo Image Super-Resolution: Methods and Results

Figure 2 for NTIRE 2024 Challenge on Stereo Image Super-Resolution: Methods and Results

Figure 3 for NTIRE 2024 Challenge on Stereo Image Super-Resolution: Methods and Results

Figure 4 for NTIRE 2024 Challenge on Stereo Image Super-Resolution: Methods and Results

Abstract:This paper summarizes the 3rd NTIRE challenge on stereo image super-resolution (SR) with a focus on new solutions and results. The task of this challenge is to super-resolve a low-resolution stereo image pair to a high-resolution one with a magnification factor of x4 under a limited computational budget. Compared with single image SR, the major challenge of this challenge lies in how to exploit additional information in another viewpoint and how to maintain stereo consistency in the results. This challenge has 2 tracks, including one track on bicubic degradation and one track on real degradations. In total, 108 and 70 participants were successfully registered for each track, respectively. In the test phase, 14 and 13 teams successfully submitted valid results with PSNR (RGB) scores better than the baseline. This challenge establishes a new benchmark for stereo image SR.

Via

Access Paper or Ask Questions

Robust Beamforming Design for Near-Field DMA-NOMA mmWave Communications With Imperfect Position Information

Sep 24, 2024

Yue Xiu, Yang Zhao, Songjie Yang, Yufeng Zhang, Dusit Niyato, Hongyang Du, Ning Wei

Figure 1 for Robust Beamforming Design for Near-Field DMA-NOMA mmWave Communications With Imperfect Position Information

Figure 2 for Robust Beamforming Design for Near-Field DMA-NOMA mmWave Communications With Imperfect Position Information

Figure 3 for Robust Beamforming Design for Near-Field DMA-NOMA mmWave Communications With Imperfect Position Information

Figure 4 for Robust Beamforming Design for Near-Field DMA-NOMA mmWave Communications With Imperfect Position Information

Abstract:For millimeter-wave (mmWave) non-orthogonal multiple access (NOMA) communication systems, we propose an innovative near-field (NF) transmission framework based on dynamic metasurface antenna (DMA) technology. In this framework, a base station (BS) utilizes the DMA hybrid beamforming technology combined with the NOMA principle to maximize communication efficiency between near-field users (NUs) and far-field users (FUs). In conventional communication systems, obtaining channel state information (CSI) requires substantial pilot signals, significantly reducing system communication efficiency. We propose a beamforming design scheme based on position information to address with this challenge. This scheme does not depend on pilot signals but indirectly obtains CSI by analyzing the geometric relationship between user position information and channel models. However, in practical applications, the accuracy of position information is challenging to guarantee and may contain errors. We propose a robust beamforming design strategy based on the worst-case scenario to tackle this issue. Facing with the multi-variable coupled non-convex problems, we employ a dual-loop iterative joint optimization algorithm to update beamforming using block coordinate descent (BCD) and derive the optimal power allocation (PA) expression. We analyze its convergence and complexity to verify the proposed algorithm's performance and robustness thoroughly. We validate the theoretical derivation of the CSI error bound through simulation experiments. Numerical results show that our proposed scheme performs better than traditional beamforming schemes. Additionally, the transmission framework exhibits strong robustness to NU and FU position errors, laying a solid foundation for the practical application of mmWave NOMA communication systems.

Via

Access Paper or Ask Questions

Supervised Fine-Tuning: An Activation Pattern Optimization Process for Attention Heads

Sep 24, 2024

Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Ting Liu, Bing Qin

Figure 1 for Supervised Fine-Tuning: An Activation Pattern Optimization Process for Attention Heads

Figure 2 for Supervised Fine-Tuning: An Activation Pattern Optimization Process for Attention Heads

Figure 3 for Supervised Fine-Tuning: An Activation Pattern Optimization Process for Attention Heads

Figure 4 for Supervised Fine-Tuning: An Activation Pattern Optimization Process for Attention Heads

Abstract:Though demonstrating promising potential, LLMs' performance on complex tasks, such as advanced mathematics and complex disease diagnosis is still unsatisfactory. A key issue is the present LLMs learn in a data-driven schema, while the instruction dataset about these complex tasks is both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on those simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could be highly beneficial in enhancing the efficiency and effectiveness of the LLM's ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples. Based on these insights, we conduct experiments to examine whether these conclusions could effectively enhance the efficiency and effectiveness of SFT, particularly in handling complex tasks and when instructional resources are scarce. Our research not only uncovers the underlying reasons behind LLMs' rapid learning and generalization mechanisms but also provides practical solutions for addressing data challenges in complex and specialized tasks.

* in review

Via

Access Paper or Ask Questions

Delay Minimization for Movable Antennas-Enabled Anti-Jamming Communications With Mobile Edge Computing

Sep 22, 2024

Yue Xiu, Yang Zhao, Songjie Yang, Minrui Xu, Dusit Niyato, Yueyang Li, Ning Wei

Figure 1 for Delay Minimization for Movable Antennas-Enabled Anti-Jamming Communications With Mobile Edge Computing

Figure 2 for Delay Minimization for Movable Antennas-Enabled Anti-Jamming Communications With Mobile Edge Computing

Figure 3 for Delay Minimization for Movable Antennas-Enabled Anti-Jamming Communications With Mobile Edge Computing

Figure 4 for Delay Minimization for Movable Antennas-Enabled Anti-Jamming Communications With Mobile Edge Computing

Abstract:In future 6G networks, anti-jamming will become a critical challenge, particularly with the development of intelligent jammers that can initiate malicious interference, posing a significant security threat to communication transmission. Additionally, 6G networks have introduced mobile edge computing (MEC) technology to reduce system delay for edge user equipment (UEs). Thus, one of the key challenges in wireless communications is minimizing the system delay while mitigating interference and improving the communication rate. However, the current fixed-position antenna (FPA) techniques have limited degrees of freedom (DoF) and high power consumption, making them inadequate for communication in highly interfering environments. To address these challenges, this paper proposes a novel MEC anti-jamming communication architecture supported by mobile antenna (MA) technology. The core of the MA technique lies in optimizing the position of the antennas to increase DoF. The increase in DoF enhances the system's anti-jamming capabilities and reduces system delay. In this study, our goal is to reduce system delay while ensuring communication security and computational requirements. We design the position of MAs for UEs and the base station (BS), optimize the transmit beamforming at the UEs and the receive beamforming at the BS, and adjust the offloading rates and resource allocation for computation tasks at the MEC server. Since the optimization problem is a non-convex multi-variable coupled problem, we propose an algorithm based on penalty dual decomposition (PDD) combined with successive convex approximation (SCA). The simulation results demonstrate that the proposed MA architecture and the corresponding schemes offer superior anti-jamming capabilities and reduce the system delay compared to FPA.

Via

Access Paper or Ask Questions

MSDet: Receptive Field Enhanced Multiscale Detection for Tiny Pulmonary Nodule

Sep 21, 2024

Guohui Cai, Ying Cai, Zeyu Zhang, Daji Ergu, Yuanzhouhan Cao, Binbin Hu, Zhibin Liao, Yang Zhao

Figure 1 for MSDet: Receptive Field Enhanced Multiscale Detection for Tiny Pulmonary Nodule

Figure 2 for MSDet: Receptive Field Enhanced Multiscale Detection for Tiny Pulmonary Nodule

Figure 3 for MSDet: Receptive Field Enhanced Multiscale Detection for Tiny Pulmonary Nodule

Figure 4 for MSDet: Receptive Field Enhanced Multiscale Detection for Tiny Pulmonary Nodule

Abstract:Pulmonary nodules are critical indicators for the early diagnosis of lung cancer, making their detection essential for timely treatment. However, traditional CT imaging methods suffered from cumbersome procedures, low detection rates, and poor localization accuracy. The subtle differences between pulmonary nodules and surrounding tissues in complex lung CT images, combined with repeated downsampling in feature extraction networks, often lead to missed or false detections of small nodules. Existing methods such as FPN, with its fixed feature fusion and limited receptive field, struggle to effectively overcome these issues. To address these challenges, our paper proposed three key contributions: Firstly, we proposed MSDet, a multiscale attention and receptive field network for detecting tiny pulmonary nodules. Secondly, we proposed the extended receptive domain (ERD) strategy to capture richer contextual information and reduce false positives caused by nodule occlusion. We also proposed the position channel attention mechanism (PCAM) to optimize feature learning and reduce multiscale detection errors, and designed the tiny object detection block (TODB) to enhance the detection of tiny nodules. Lastly, we conducted thorough experiments on the public LUNA16 dataset, achieving state-of-the-art performance, with an mAP improvement of 8.8% over the previous state-of-the-art method YOLOv8. These advancements significantly boosted detection accuracy and reliability, providing a more effective solution for early lung cancer diagnosis. The code will be available at https://github.com/CaiGuoHui123/MSDet

Via

Access Paper or Ask Questions

Hybrid Cost Volume for Memory-Efficient Optical Flow

Sep 06, 2024

Yang Zhao, Gangwei Xu, Gang Wu

Figure 1 for Hybrid Cost Volume for Memory-Efficient Optical Flow

Figure 2 for Hybrid Cost Volume for Memory-Efficient Optical Flow

Figure 3 for Hybrid Cost Volume for Memory-Efficient Optical Flow

Figure 4 for Hybrid Cost Volume for Memory-Efficient Optical Flow

Abstract:Current state-of-the-art flow methods are mostly based on dense all-pairs cost volumes. However, as image resolution increases, the computational and spatial complexity of constructing these cost volumes grows at a quartic rate, making these methods impractical for high-resolution images. In this paper, we propose a novel Hybrid Cost Volume for memory-efficient optical flow, named HCV. To construct HCV, we first propose a Top-k strategy to separate the 4D cost volume into two global 3D cost volumes. These volumes significantly reduce memory usage while retaining a substantial amount of matching information. We further introduce a local 4D cost volume with a local search space to supplement the local information for HCV. Based on HCV, we design a memory-efficient optical flow network, named HCVFlow. Compared to the recurrent flow methods based the all-pairs cost volumes, our HCVFlow significantly reduces memory consumption while ensuring high accuracy. We validate the effectiveness and efficiency of our method on the Sintel and KITTI datasets and real-world 4K (2160*3840) resolution images. Extensive experiments show that our HCVFlow has very low memory usage and outperforms other memory-efficient methods in terms of accuracy. The code is publicly available at https://github.com/gangweiX/HCVFlow.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions