Abstract:Understanding spatial dynamics and semantics in point cloud is fundamental for comprehensive 3D comprehension. While reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) have recently achieved remarkable breakthroughs in large language models by incentivizing reasoning capabilities through strategic reward design, their potential remains largely unexplored in the 3D perception domain. This naturally raises a pivotal question: Can RL-based methods effectively empower 3D point cloud fine-tuning? In this paper, we propose PointRFT, the first reinforcement fine-tuning paradigm tailored specifically for point cloud representation learning. We select three prevalent 3D foundation models and devise specialized accuracy reward and dispersion reward functions to stabilize training and mitigate distribution shifts. Through comprehensive few-shot classification experiments comparing distinct training paradigms, we demonstrate that PointRFT consistently outperforms vanilla supervised fine-tuning (SFT) across diverse benchmarks. Furthermore, when organically integrated into a hybrid Pretraining-SFT-RFT paradigm, the representational capacity of point cloud foundation models is substantially unleashed, achieving state-of-the-art performance particularly under data-scarce scenarios.
Abstract:Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textit{PointDico}, a novel model that seamlessly integrates these methods. \textit{PointDico} learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textit{PointDico} achieves a new state-of-the-art in 3D representation learning, \textit{e.g.}, \textbf{94.32\%} accuracy on ScanObjectNN, \textbf{86.5\%} Inst. mIoU on ShapeNetPart.




Abstract:As an emerging paradigm of federated learning, asynchronous federated learning offers significant speed advantages over traditional synchronous federated learning. Unlike synchronous federated learning, which requires waiting for all clients to complete updates before aggregation, asynchronous federated learning aggregates the models that have arrived in realtime, greatly improving training speed. However, this mechanism also introduces the issue of client model version inconsistency. When the differences between models of different versions during aggregation become too large, it may lead to conflicts, thereby reducing the models accuracy. To address this issue, this paper proposes an asynchronous federated learning version correction algorithm based on knowledge distillation, named FedADT. FedADT applies knowledge distillation before aggregating gradients, using the latest global model to correct outdated information, thus effectively reducing the negative impact of outdated gradients on the training process. Additionally, FedADT introduces an adaptive weighting function that adjusts the knowledge distillation weight according to different stages of training, helps mitigate the misleading effects caused by the poorer performance of the global model in the early stages of training. This method significantly improves the overall performance of asynchronous federated learning without adding excessive computational overhead. We conducted experimental comparisons with several classical algorithms, and the results demonstrate that FedADT achieves significant improvements over other asynchronous methods and outperforms all methods in terms of convergence speed.