Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jin Xie

FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

Oct 25, 2024

Tianyu Zhang, Guocheng Qian, Jin Xie, Jian Yang

Figure 1 for FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

Figure 2 for FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

Figure 3 for FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

Figure 4 for FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

Abstract:Point cloud frame interpolation is a challenging task that involves accurate scene flow estimation across frames and maintaining the geometry structure. Prevailing techniques often rely on pre-trained motion estimators or intensive testing-time optimization, resulting in compromised interpolation accuracy or prolonged inference. This work presents FastPCI that introduces Pyramid Convolution-Transformer architecture for point cloud frame interpolation. Our hybrid Convolution-Transformer improves the local and long-range feature learning, while the pyramid network offers multilevel features and reduces the computation. In addition, FastPCI proposes a unique Dual-Direction Motion-Structure block for more accurate scene flow estimation. Our design is motivated by two facts: (1) accurate scene flow preserves 3D structure, and (2) point cloud at the previous timestep should be reconstructable using reverse motion from future timestep. Extensive experiments show that FastPCI significantly outperforms the state-of-the-art PointINet and NeuralPCI with notable gains (e.g. 26.6% and 18.3% reduction in Chamfer Distance in KITTI), while being more than 10x and 600x faster, respectively. Code is available at https://github.com/genuszty/FastPCI

* To appear in ECCV 2024

Via

Access Paper or Ask Questions

iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

Sep 05, 2024

Lin Sun, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang

Figure 1 for iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

Figure 2 for iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

Figure 3 for iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

Figure 4 for iSeg: An Iterative Refinement-based Framework for Training-free Segmentation

Abstract:Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. Inspired by this, researchers have explored employing stable diffusion for trainingfree segmentation. Most existing approaches either simply employ cross-attention map or refine it by self-attention map, to generate segmentation masks. We believe that iterative refinement with self-attention map would lead to better results. However, we mpirically demonstrate that such a refinement is sub-optimal likely due to the self-attention map containing irrelevant global information which hampers accurately refining cross-attention map with multiple iterations. To address this, we propose an iterative refinement framework for training-free segmentation, named iSeg, having an entropy-reduced self-attention module which utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined crossattention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kind of images and interactions.

Via

Access Paper or Ask Questions

Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

Jul 29, 2024

Yang Wu, Kaihua Zhang, Jianjun Qian, Jin Xie, Jian Yang

Figure 1 for Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

Figure 2 for Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

Figure 3 for Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

Figure 4 for Text2LiDAR: Text-guided LiDAR Point Cloud Generation via Equirectangular Transformer

Abstract:The complex traffic environment and various weather conditions make the collection of LiDAR data expensive and challenging. Achieving high-quality and controllable LiDAR data generation is urgently needed, controlling with text is a common practice, but there is little research in this field. To this end, we propose Text2LiDAR, the first efficient, diverse, and text-controllable LiDAR data generation model. Specifically, we design an equirectangular transformer architecture, utilizing the designed equirectangular attention to capture LiDAR features in a manner with data characteristics. Then, we design a control-signal embedding injector to efficiently integrate control signals through the global-to-focused attention mechanism. Additionally, we devise a frequency modulator to assist the model in recovering high-frequency details, ensuring the clarity of the generated point cloud. To foster development in the field and optimize text-controlled generation performance, we construct nuLiDARtext which offers diverse text descriptors for 34,149 LiDAR point clouds from 850 scenes. Experiments on uncontrolled and text-controlled generation in various forms on KITTI-360 and nuScenes datasets demonstrate the superiority of our approach.

Via

Access Paper or Ask Questions

MSfusion: A Dynamic Model Splitting Approach for Resource-Constrained Machines to Collaboratively Train Larger Models

Jul 04, 2024

Jin Xie, Songze Li

Abstract:Training large models requires a large amount of data, as well as abundant computation resources. While collaborative learning (e.g., federated learning) provides a promising paradigm to harness collective data from many participants, training large models remains a major challenge for participants with limited resources like mobile devices. We introduce MSfusion, an effective and efficient collaborative learning framework, tailored for training larger models on resourceconstraint machines through model splitting. Specifically, a double shifting model splitting scheme is designed such that in each training round, each participant is assigned a subset of model parameters to train over local data, and aggregates with sub-models of other peers on common parameters. While model splitting significantly reduces the computation and communication costs of individual participants, additional novel designs on adaptive model overlapping and contrastive loss functions help MSfusion to maintain training effectiveness, against model shift across participants. Extensive experiments on image and NLP tasks illustrate significant advantages of MSfusion in performance and efficiency for training large models, and its strong scalability: computation cost of each participant reduces significantly as the number of participants increases.

* 12 pages, 9 figures

Via

Access Paper or Ask Questions

FedMeS: Personalized Federated Continual Learning Leveraging Local Memory

Apr 19, 2024

Jin Xie, Chenqing Zhu, Songze Li

Figure 1 for FedMeS: Personalized Federated Continual Learning Leveraging Local Memory

Figure 2 for FedMeS: Personalized Federated Continual Learning Leveraging Local Memory

Figure 3 for FedMeS: Personalized Federated Continual Learning Leveraging Local Memory

Figure 4 for FedMeS: Personalized Federated Continual Learning Leveraging Local Memory

Abstract:We focus on the problem of Personalized Federated Continual Learning (PFCL): a group of distributed clients, each with a sequence of local tasks on arbitrary data distributions, collaborate through a central server to train a personalized model at each client, with the model expected to achieve good performance on all local tasks. We propose a novel PFCL framework called Federated Memory Strengthening FedMeS to address the challenges of client drift and catastrophic forgetting. In FedMeS, each client stores samples from previous tasks using a small amount of local memory, and leverages this information to both 1) calibrate gradient updates in training process; and 2) perform KNN-based Gaussian inference to facilitate personalization. FedMeS is designed to be task-oblivious, such that the same inference process is applied to samples from all tasks to achieve good performance. FedMeS is analyzed theoretically and evaluated experimentally. It is shown to outperform all baselines in average accuracy and forgetting rate, over various combinations of datasets, task distributions, and client numbers.

Via

Access Paper or Ask Questions

VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection

Apr 15, 2024

Bonan Ding, Jin Xie, Jing Nie, Jiale Cao

Figure 1 for VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection

Figure 2 for VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection

Figure 3 for VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection

Figure 4 for VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection

Abstract:Due to its cost-effectiveness and widespread availability, monocular 3D object detection, which relies solely on a single camera during inference, holds significant importance across various applications, including autonomous driving and robotics. Nevertheless, directly predicting the coordinates of objects in 3D space from monocular images poses challenges. Therefore, an effective solution involves transforming monocular images into LiDAR-like representations and employing a LiDAR-based 3D object detector to predict the 3D coordinates of objects. The key step in this method is accurately converting the monocular image into a reliable point cloud form. In this paper, we present VFMM3D, an innovative approach that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations. VFMM3D utilizes the Segment Anything Model (SAM) and Depth Anything Model (DAM) to generate high-quality pseudo-LiDAR data enriched with rich foreground information. Specifically, the Depth Anything Model (DAM) is employed to generate dense depth maps. Subsequently, the Segment Anything Model (SAM) is utilized to differentiate foreground and background regions by predicting instance masks. These predicted instance masks and depth maps are then combined and projected into 3D space to generate pseudo-LiDAR points. Finally, any object detectors based on point clouds can be utilized to predict the 3D coordinates of objects. Comprehensive experiments are conducted on the challenging 3D object detection dataset KITTI. Our VFMM3D establishes a new state-of-the-art performance. Additionally, experimental results demonstrate the generality of VFMM3D, showcasing its seamless integration into various LiDAR-based 3D object detectors.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Apr 11, 2024

Hefeng Wang, Jiale Cao, Jin Xie, Aiping Yang, Yanwei Pang

Figure 1 for Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Figure 2 for Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Figure 3 for Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Figure 4 for Implicit and Explicit Language Guidance for Diffusion-based Visual Perception

Abstract:Text-to-image diffusion models have shown powerful ability on conditional image synthesis. With large-scale vision-language pre-training, diffusion models are able to generate high-quality images with rich texture and reasonable structure under different text prompts. However, it is an open problem to adapt the pre-trained diffusion model for visual perception. In this paper, we propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP. Our IEDP comprises of an implicit language guidance branch and an explicit language guidance branch. The implicit branch employs frozen CLIP image encoder to directly generate implicit text embeddings that are fed to diffusion model, without using explicit text prompts. The explicit branch utilizes the ground-truth labels of corresponding images as text prompts to condition feature extraction of diffusion model. During training, we jointly train diffusion model by sharing the model weights of these two branches. As a result, implicit and explicit branches can jointly guide feature learning. During inference, we only employ implicit branch for final prediction, which does not require any ground-truth labels. Experiments are performed on two typical perception tasks, including semantic segmentation and depth estimation. Our IEDP achieves promising performance on both tasks. For semantic segmentation, our IEDP has the mIoU score of 55.9% on AD20K validation set, which outperforms the baseline method VPD by 2.2%. For depth estimation, our IEDP outperforms the baseline method VPD with a relative gain of 10.2%.

Via

Access Paper or Ask Questions

Diff-Reg v1: Diffusion Matching Model for Registration Problem

Mar 29, 2024

Qianliang Wu, Haobo Jiang, Lei Luo, Jun Li, Yaqing Ding, Jin Xie, Jian Yang

Figure 1 for Diff-Reg v1: Diffusion Matching Model for Registration Problem

Figure 2 for Diff-Reg v1: Diffusion Matching Model for Registration Problem

Figure 3 for Diff-Reg v1: Diffusion Matching Model for Registration Problem

Figure 4 for Diff-Reg v1: Diffusion Matching Model for Registration Problem

Abstract:Establishing reliable correspondences is essential for registration tasks such as 3D and 2D3D registration. Existing methods commonly leverage geometric or semantic point features to generate potential correspondences. However, these features may face challenges such as large deformation, scale inconsistency, and ambiguous matching problems (e.g., symmetry). Additionally, many previous methods, which rely on single-pass prediction, may struggle with local minima in complex scenarios. To mitigate these challenges, we introduce a diffusion matching model for robust correspondence construction. Our approach treats correspondence estimation as a denoising diffusion process within the doubly stochastic matrix space, which gradually denoises (refines) a doubly stochastic matching matrix to the ground-truth one for high-quality correspondence estimation. It involves a forward diffusion process that gradually introduces Gaussian noise into the ground truth matching matrix and a reverse denoising process that iteratively refines the noisy matching matrix. In particular, the feature extraction from the backbone occurs only once during the inference phase. Our lightweight denoising module utilizes the same feature at each reverse sampling step. Evaluation of our method on both 3D and 2D3D registration tasks confirms its effectiveness.

* arXiv admin note: text overlap with arXiv:2401.00436

Via

Access Paper or Ask Questions

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Mar 19, 2024

Wenqi Zhu, Jiale Cao, Jin Xie, Shuangming Yang, Yanwei Pang

Figure 1 for CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Figure 2 for CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Figure 3 for CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Figure 4 for CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Abstract:Open-vocabulary video instance segmentation strives to segment and track instances belonging to an open set of categories in a video. The vision-language model Contrastive Language-Image Pre-training (CLIP) has shown strong zero-shot classification ability in image-level open-vocabulary task. In this paper, we propose a simple encoder-decoder network, called CLIP-VIS, to adapt CLIP for open-vocabulary video instance segmentation. Our CLIP-VIS adopts frozen CLIP image encoder and introduces three modules, including class-agnostic mask generation, temporal topK-enhanced matching, and weighted open-vocabulary classification. Given a set of initial queries, class-agnostic mask generation employs a transformer decoder to predict query masks and corresponding object scores and mask IoU scores. Then, temporal topK-enhanced matching performs query matching across frames by using K mostly matched frames. Finally, weighted open-vocabulary classification first generates query visual features with mask pooling, and second performs weighted classification using object scores and mask IoU scores. Our CLIP-VIS does not require the annotations of instance categories and identities. The experiments are performed on various video instance segmentation datasets, which demonstrate the effectiveness of our proposed method, especially on novel categories. When using ConvNeXt-B as backbone, our CLIP-VIS achieves the AP and APn scores of 32.1% and 40.3% on validation set of LV-VIS dataset, which outperforms OV2Seg by 11.0% and 24.0% respectively. We will release the source code and models at https://github.com/zwq456/CLIP-VIS.git.

Via

Access Paper or Ask Questions

Active Simultaneously Transmitting and Reflecting Surface Assisted NOMA Networks

Jan 25, 2024

Xinwei Yue, Jin Xie, Chongjun Ouyang, Yuanwei Liu, Xia Shen, Zhiguo Ding

Abstract:The novel active simultaneously transmitting and reflecting surface (ASTARS) has recently received a lot of attention due to its capability to conquer the multiplicative fading loss and achieve full-space smart radio environments. This paper introduces the ASTARS to assist non-orthogonal multiple access (NOMA) communications, where the stochastic geometry theory is used to model the spatial positions of pairing users. We design the independent reflection/transmission phase-shift controllers of ASTARS to align the phases of cascaded channels at pairing users. We derive new closed-form and asymptotic expressions of the outage probability and ergodic data rate for ASTARS-NOMA networks in the presence of perfect/imperfect successive interference cancellation (pSIC). The diversity orders and multiplexing gains for ASTARS-NOMA are derived to provide more insights. Furthermore, the system throughputs of ASTARS-NOMA are investigated in both delay-tolerant and delay-limited transmission modes. The numerical results are presented and show that: 1) ASTARS-NOMA with pSIC outperforms ASTARS assisted-orthogonal multiple access (ASTARS-OMA) in terms of outage probability and ergodic data rate; 2) The outage probability of ASTARS-NOMA can be further reduced within a certain range by increasing the power amplification factors; 3) The system throughputs of ASTARS-NOMA are superior to that of ASTARS-OMA in both delay-limited and delay-tolerant transmission modes.

Via

Access Paper or Ask Questions