Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Sun

Max

An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

Mar 04, 2025

Wei Sun, Qianlong Du, Fuwei Cui, Jiajun Zhang

Figure 1 for An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

Figure 2 for An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

Figure 3 for An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

Figure 4 for An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning

Abstract:Enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) is of great scientific and practical significance. Researchers typically employ process-supervised reward models (PRMs) to guide the reasoning process, effectively improving the models' reasoning abilities. However, existing methods for constructing process supervision training data, such as manual annotation and per-step Monte Carlo estimation, are often costly or suffer from poor quality. To address these challenges, this paper introduces a framework called EpicPRM, which annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency. Using this approach, we efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps. Compared to other publicly available datasets, the PRM trained on Epic50k demonstrates significantly superior performance. Getting Epic50k at https://github.com/xiaolizh1/EpicPRM.

Via

Access Paper or Ask Questions

Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

Feb 11, 2025

Xiang Li, Pengfei Li, Yupeng Zheng, Wei Sun, Yan Wang, Yilun Chen

Figure 1 for Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

Figure 2 for Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

Figure 3 for Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

Figure 4 for Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

Abstract:Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels through a novel two-stage training paradigm: the self-supervised pre-training stage and the fully-supervised fine-tuning stage. Specifically, during the pre-training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state-conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.

* Accepted by ICLR 2025

Via

Access Paper or Ask Questions

CT-UIO: Continuous-Time UWB-Inertial-Odometer Localization Using Non-Uniform B-spline with Fewer Anchors

Feb 10, 2025

Jian Sun, Wei Sun, Genwei Zhang, Kailun Yang, Song Li, Xiangqi Meng, Na Deng, Chongbin Tan

Abstract:Ultra-wideband (UWB) based positioning with fewer anchors has attracted significant research interest in recent years, especially under energy-constrained conditions. However, most existing methods rely on discrete-time representations and smoothness priors to infer a robot's motion states, which often struggle with ensuring multi-sensor data synchronization. In this paper, we present an efficient UWB-Inertial-odometer localization system, utilizing a non-uniform B-spline framework with fewer anchors. Unlike traditional uniform B-spline-based continuous-time methods, we introduce an adaptive knot-span adjustment strategy for non-uniform continuous-time trajectory representation. This is accomplished by adjusting control points dynamically based on movement speed. To enable efficient fusion of IMU and odometer data, we propose an improved Extended Kalman Filter (EKF) with innovation-based adaptive estimation to provide short-term accurate motion prior. Furthermore, to address the challenge of achieving a fully observable UWB localization system under few-anchor conditions, the Virtual Anchor (VA) generation method based on multiple hypotheses is proposed. At the backend, we propose a CT-UIO factor graph with an adaptive sliding window for global trajectory estimation. Comprehensive experiments conducted on corridor and exhibition hall datasets validate the proposed system's high precision and robust performance. The codebase and datasets of this work will be open-sourced at https://github.com/JasonSun623/CT-UIO.

* The codebase and datasets will be open-sourced at https://github.com/JasonSun623/CT-UIO

Via

Access Paper or Ask Questions

Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation

Feb 04, 2025

Jian Liu, Wei Sun, Hui Yang, Pengchao Deng, Chongpei Liu, Nicu Sebe, Hossein Rahmani, Ajmal Mian

Figure 1 for Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation

Figure 2 for Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation

Figure 3 for Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation

Figure 4 for Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation

Abstract:Nine-degrees-of-freedom (9-DoF) object pose and size estimation is crucial for enabling augmented reality and robotic manipulation. Category-level methods have received extensive research attention due to their potential for generalization to intra-class unknown objects. However, these methods require manual collection and labeling of large-scale real-world training data. To address this problem, we introduce a diffusion-based paradigm for domain-generalized category-level 9-DoF object pose estimation. Our motivation is to leverage the latent generalization ability of the diffusion model to address the domain generalization challenge in object pose estimation. This entails training the model exclusively on rendered synthetic data to achieve generalization to real-world scenes. We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective. Our model does not require any 3D shape priors during training or inference. By employing the Denoising Diffusion Implicit Model, we demonstrate that the reverse diffusion process can be executed in as few as 3 steps, achieving near real-time performance. Finally, we design a robotic grasping system comprising both hardware and software components. Through comprehensive experiments on two benchmark datasets and the real-world robotic system, we show that our method achieves state-of-the-art domain generalization performance. Our code will be made public at https://github.com/CNJianLiu/Diff9D.

* 17 pages, 13 figures

Via

Access Paper or Ask Questions

AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Jan 30, 2025

Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai

Figure 1 for AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Figure 2 for AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Figure 3 for AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Figure 4 for AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Abstract:Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduce AGAVQA, the first large-scale AGAV quality assessment dataset, comprising 3,382 AGAVs from 16 VTA methods. AGAVQA includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further propose AGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. The project page is available at https://agav-rater.github.io.

Via

Access Paper or Ask Questions

Do We Really Need to Design New Byzantine-robust Aggregation Rules?

Jan 29, 2025

Minghong Fang, Seyedsina Nabavirazavi, Zhuqing Liu, Wei Sun, Sundararaja Sitharama Iyengar, Haibo Yang

Figure 1 for Do We Really Need to Design New Byzantine-robust Aggregation Rules?

Figure 2 for Do We Really Need to Design New Byzantine-robust Aggregation Rules?

Figure 3 for Do We Really Need to Design New Byzantine-robust Aggregation Rules?

Figure 4 for Do We Really Need to Design New Byzantine-robust Aggregation Rules?

Abstract:Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model through a server, without exchanging their private training data. However, the decentralized aspect of FL makes it susceptible to poisoning attacks, where malicious clients can manipulate the global model by sending altered local model updates. To counter these attacks, a variety of aggregation rules designed to be resilient to Byzantine failures have been introduced. Nonetheless, these methods can still be vulnerable to sophisticated attacks or depend on unrealistic assumptions about the server. In this paper, we demonstrate that there is no need to design new Byzantine-robust aggregation rules; instead, FL can be secured by enhancing the robustness of well-established aggregation rules. To this end, we present FoundationFL, a novel defense mechanism against poisoning attacks. FoundationFL involves the server generating synthetic updates after receiving local model updates from clients. It then applies existing Byzantine-robust foundational aggregation rules, such as Trimmed-mean or Median, to combine clients' model updates with the synthetic ones. We theoretically establish the convergence performance of FoundationFL under Byzantine settings. Comprehensive experiments across several real-world datasets validate the efficiency of our FoundationFL method.

* To appear in NDSS 2025

Via

Access Paper or Ask Questions

Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Jan 23, 2025

Linh Tran, Wei Sun, Stacy Patterson, Ana Milanova

Figure 1 for Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Figure 2 for Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Figure 3 for Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Figure 4 for Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

Abstract:Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio. Federated Prompt Learning (FPL) is a recently proposed approach that combines pre-trained multimodal LLMs such as vision-language models with federated learning to create personalized, privacy-preserving AI systems. However, balancing the competing goals of personalization, generalization, and privacy remains a significant challenge. Over-personalization can lead to overfitting, reducing generalizability, while stringent privacy measures, such as differential privacy, can hinder both personalization and generalization. In this paper, we propose a Differentially Private Federated Prompt Learning (DP-FPL) approach to tackle this challenge by leveraging a low-rank adaptation scheme to capture generalization while maintaining a residual term that preserves expressiveness for personalization. To ensure privacy, we introduce a novel method where we apply local differential privacy to the two low-rank components of the local prompt, and global differential privacy to the global prompt. Our approach mitigates the impact of privacy noise on the model performance while balancing the tradeoff between personalization and generalization. Extensive experiments demonstrate the effectiveness of our approach over other benchmarks.

* Accepted to ICLR 2025 main conference track

Via

Access Paper or Ask Questions

Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation

Nov 29, 2024

Siqing Zhang, Yuchen Ding, Wei Tang, Wei Sun, Yong Liao, Peng Yuan Zhou

Figure 1 for Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation

Figure 2 for Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation

Figure 3 for Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation

Figure 4 for Privacy-Preserving Orthogonal Aggregation for Guaranteeing Gender Fairness in Federated Recommendation

Abstract:Under stringent privacy constraints, whether federated recommendation systems can achieve group fairness remains an inadequately explored question. Taking gender fairness as a representative issue, we identify three phenomena in federated recommendation systems: performance difference, data imbalance, and preference disparity. We discover that the state-of-the-art methods only focus on the first phenomenon. Consequently, their imposition of inappropriate fairness constraints detrimentally affects the model training. Moreover, due to insufficient sensitive attribute protection of existing works, we can infer the gender of all users with 99.90% accuracy even with the addition of maximal noise. In this work, we propose Privacy-Preserving Orthogonal Aggregation (PPOA), which employs the secure aggregation scheme and quantization technique, to prevent the suppression of minority groups by the majority and preserve the distinct preferences for better group fairness. PPOA can assist different groups in obtaining their respective model aggregation results through a designed orthogonal mapping while keeping their attributes private. Experimental results on three real-world datasets demonstrate that PPOA enhances recommendation effectiveness for both females and males by up to 8.25% and 6.36%, respectively, with a maximum overall improvement of 7.30%, and achieves optimal fairness in most cases. Extensive ablation experiments and visualizations indicate that PPOA successfully maintains preferences for different gender groups.

* accepted by WSDM 2025

Via

Access Paper or Ask Questions

Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Nov 25, 2024

Zhichao Zhang, Wei Sun, Xinyue Li, Yunhao Li, Qihang Ge, Jun Jia, Zicheng Zhang, Zhongpeng Ji, Fengyu Sun, Shangling Jui(+2 more)

Figure 1 for Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Figure 2 for Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Figure 3 for Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Figure 4 for Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Abstract:AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 3,200 AGVs derived from 8 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released in public at https://github.com/zczhang-sjtu/GHVQ.git

Via

Access Paper or Ask Questions

VQA$^2$:Visual Question Answering for Video Quality Assessment

Nov 06, 2024

Ziheng Jia, Zicheng Zhang, Jiaying Qian, Haoning Wu, Wei Sun, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai, Xiongkuo Min

Figure 1 for VQA$^2$:Visual Question Answering for Video Quality Assessment

Figure 2 for VQA$^2$:Visual Question Answering for Video Quality Assessment

Figure 3 for VQA$^2$:Visual Question Answering for Video Quality Assessment

Figure 4 for VQA$^2$:Visual Question Answering for Video Quality Assessment

Abstract:The advent and proliferation of large multi-modal models (LMMs) have introduced a new paradigm to video-related computer vision fields, including training and inference methods based on visual question answering (VQA). These methods enable models to handle multiple downstream tasks robustly. Video Quality Assessment (VQA), a classic field in low-level visual quality evaluation, originally focused on quantitative video quality scoring. However, driven by advances in LMMs, it is now evolving towards more comprehensive visual quality understanding tasks. Visual question answering has significantly improved low-level visual evaluation within the image domain recently. However, related work is almost nonexistent in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset the first visual question answering instruction dataset entirely focuses on video quality assessment, and based on it, we propose the VQA2 series models The VQA2 Instruction Dataset consists of three stages and covers various video types, containing 157,735 instruction question-answer pairs, including both manually annotated and synthetic data. We conduct extensive experiments on both video quality scoring and video quality understanding tasks. Results demonstrate that the VQA2 series models achieve state-of-the-art (SOTA) performance in quality scoring tasks, and their performance in visual quality question answering surpasses the renowned GPT-4o. Additionally, our final model, the VQA2-Assistant, performs well across both scoring and question-answering tasks, validating its versatility.

* 10 pages 3 figures

Via

Access Paper or Ask Questions