Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Raouf Hamzaoui

A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation

Dec 25, 2025

Philip Xu, David Elizondo, Raouf Hamzaoui

Abstract:We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.

Via

Access Paper or Ask Questions

UGAE: Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds

Oct 27, 2025

Pan Zhao, Hui Yuan, Chongzhen Tian, Tian Guo, Raouf Hamzaoui, Zhigeng Pan

Abstract:Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based sparse convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute compression. Finally, at the decoder side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), UGAE achieved an average BD-PSNR gain of 9.98 dB and 90.98% BD-bitrate savings for geometry under the D1 metric, as well as a 3.67 dB BD-PSNR improvement with 56.88% BD-bitrate savings for attributes on the Y component. Additionally, it improved perceptual quality significantly.

Via

Access Paper or Ask Questions

STQE: Spatial-Temporal Quality Enhancement for G-PCC Compressed Dynamic Point Clouds

Jul 23, 2025

Tian Guo, Hui Yuan, Xiaolong Mao, Shiqi Jiang, Raouf Hamzaoui, Sam Kwong

Abstract:Very few studies have addressed quality enhancement for compressed dynamic point clouds. In particular, the effective exploitation of spatial-temporal correlations between point cloud frames remains largely unexplored. Addressing this gap, we propose a spatial-temporal attribute quality enhancement (STQE) network that exploits both spatial and temporal correlations to improve the visual quality of G-PCC compressed dynamic point clouds. Our contributions include a recoloring-based motion compensation module that remaps reference attribute information to the current frame geometry to achieve precise inter-frame geometric alignment, a channel-aware temporal attention module that dynamically highlights relevant regions across bidirectional reference frames, a Gaussian-guided neighborhood feature aggregation module that efficiently captures spatial dependencies between geometry and color attributes, and a joint loss function based on the Pearson correlation coefficient, designed to alleviate over-smoothing effects typical of point-wise mean squared error optimization. When applied to the latest G-PCC test model, STQE achieved improvements of 0.855 dB, 0.682 dB, and 0.828 dB in delta PSNR, with Bj{\o}ntegaard Delta rate (BD-rate) reductions of -25.2%, -31.6%, and -32.5% for the Luma, Cb, and Cr components, respectively.

Via

Access Paper or Ask Questions

LPCM: Learning-based Predictive Coding for LiDAR Point Cloud Compression

May 26, 2025

Chang Sun, Hui Yuan, Shiqi Jiang, Da Ai, Wei Zhang, Raouf Hamzaoui

Abstract:Since the data volume of LiDAR point clouds is very huge, efficient compression is necessary to reduce their storage and transmission costs. However, existing learning-based compression methods do not exploit the inherent angular resolution of LiDAR and ignore the significant differences in the correlation of geometry information at different bitrates. The predictive geometry coding method in the geometry-based point cloud compression (G-PCC) standard uses the inherent angular resolution to predict the azimuth angles. However, it only models a simple linear relationship between the azimuth angles of neighboring points. Moreover, it does not optimize the quantization parameters for residuals on each coordinate axis in the spherical coordinate system. We propose a learning-based predictive coding method (LPCM) with both high-bitrate and low-bitrate coding modes. LPCM converts point clouds into predictive trees using the spherical coordinate system. In high-bitrate coding mode, we use a lightweight Long-Short-Term Memory-based predictive (LSTM-P) module that captures long-term geometry correlations between different coordinates to efficiently predict and compress the elevation angles. In low-bitrate coding mode, where geometry correlation degrades, we introduce a variational radius compression (VRC) module to directly compress the point radii. Then, we analyze why the quantization of spherical coordinates differs from that of Cartesian coordinates and propose a differential evolution (DE)-based quantization parameter selection method, which improves rate-distortion performance without increasing coding time. Experimental results on the LiDAR benchmark \textit{SemanticKITTI} and the MPEG-specified \textit{Ford} datasets show that LPCM outperforms G-PCC and other learning-based methods.

* 13 pages long, 8 figures and over 50 references. Submitted with IEEEtran journal mode. All figures are included in PDF format and the bibliography is resolved manually

Via

Access Paper or Ask Questions

EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Mar 19, 2025

Yuanchao Yue, Hui Yuan, Qinglong Miao, Xiaolong Mao, Raouf Hamzaoui, Peter Eisert

Figure 1 for EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Figure 2 for EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Figure 3 for EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Figure 4 for EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Abstract:Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

Via

Access Paper or Ask Questions

Interleaved Block-based Learned Image Compression with Feature Enhancement and Quantization Error Compensation

Feb 21, 2025

Shiqi Jiang, Hui Yuan, Shuai Li, Raouf Hamzaoui, Xu Wang, Junyan Huo

Abstract:In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature extraction module shuffles the pixels in the image, splits the resulting image into sub-images, and extracts coarse features from the sub-images. Our feature refinement module stacks the coarse features and uses an attention refinement block composed of concatenated three-dimensional convolution residual blocks to learn more compact latent features by exploiting correlations across channels, within sub-images (intra-sub-image correlations), and across sub-images (inter-sub-image correlations). Our feature enhancement module reduces information loss in the decoded features following quantization. We also propose a quantization error compensation module that mitigates the quantization mismatch between training and testing. Our four modules can be readily integrated into state-of-the-art LIC methods. Experiments show that combining our modules with Tiny-LIC outperforms existing LIC methods and image compression standards in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) on the Kodak dataset and the CLIC dataset.

Via

Access Paper or Ask Questions

CS-Net:Contribution-based Sampling Network for Point Cloud Simplification

Jan 18, 2025

Tian Guo, Chen Chen, Hui Yuan, Xiaolong Mao, Raouf Hamzaoui, Junhui Hou

Figure 1 for CS-Net:Contribution-based Sampling Network for Point Cloud Simplification

Figure 2 for CS-Net:Contribution-based Sampling Network for Point Cloud Simplification

Figure 3 for CS-Net:Contribution-based Sampling Network for Point Cloud Simplification

Figure 4 for CS-Net:Contribution-based Sampling Network for Point Cloud Simplification

Abstract:Point cloud sampling plays a crucial role in reducing computation costs and storage requirements for various vision tasks. Traditional sampling methods, such as farthest point sampling, lack task-specific information and, as a result, cannot guarantee optimal performance in specific applications. Learning-based methods train a network to sample the point cloud for the targeted downstream task. However, they do not guarantee that the sampled points are the most relevant ones. Moreover, they may result in duplicate sampled points, which requires completion of the sampled point cloud through post-processing techniques. To address these limitations, we propose a contribution-based sampling network (CS-Net), where the sampling operation is formulated as a Top-k operation. To ensure that the network can be trained in an end-to-end way using gradient descent algorithms, we use a differentiable approximation to the Top-k operation via entropy regularization of an optimal transport problem. Our network consists of a feature embedding module, a cascade attention module, and a contribution scoring module. The feature embedding module includes a specifically designed spatial pooling layer to reduce parameters while preserving important features. The cascade attention module combines the outputs of three skip connected offset attention layers to emphasize the attractive features and suppress less important ones. The contribution scoring module generates a contribution score for each point and guides the sampling process to prioritize the most important ones. Experiments on the ModelNet40 and PU147 showed that CS-Net achieved state-of-the-art performance in two semantic-based downstream tasks (classification and registration) and two reconstruction-based tasks (compression and surface reconstruction).

Via

Access Paper or Ask Questions

SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds

Sep 16, 2024

Xiaolong Mao, Hui Yuan, Tian Guo, Shiqi Jiang, Raouf Hamzaoui, Sam Kwong

Figure 1 for SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds

Figure 2 for SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds

Figure 3 for SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds

Figure 4 for SPAC: Sampling-based Progressive Attribute Compression for Dense Point Clouds

Abstract:We propose an end-to-end attribute compression method for dense point clouds. The proposed method combines a frequency sampling module, an adaptive scale feature extraction module with geometry assistance, and a global hyperprior entropy model. The frequency sampling module uses a Hamming window and the Fast Fourier Transform to extract high-frequency components of the point cloud. The difference between the original point cloud and the sampled point cloud is divided into multiple sub-point clouds. These sub-point clouds are then partitioned using an octree, providing a structured input for feature extraction. The feature extraction module integrates adaptive convolutional layers and uses offset-attention to capture both local and global features. Then, a geometry-assisted attribute feature refinement module is used to refine the extracted attribute features. Finally, a global hyperprior model is introduced for entropy encoding. This model propagates hyperprior parameters from the deepest (base) layer to the other layers, further enhancing the encoding efficiency. At the decoder, a mirrored network is used to progressively restore features and reconstruct the color attribute through transposed convolutional layers. The proposed method encodes base layer information at a low bitrate and progressively adds enhancement layer information to improve reconstruction accuracy. Compared to the latest G-PCC test model (TMC13v23) under the MPEG common test conditions (CTCs), the proposed method achieved an average Bjontegaard delta bitrate reduction of 24.58% for the Y component (21.23% for YUV combined) on the MPEG Category Solid dataset and 22.48% for the Y component (17.19% for YUV combined) on the MPEG Category Dense dataset. This is the first instance of a learning-based codec outperforming the G-PCC standard on these datasets under the MPEG CTCs.

* 136pages, 13 figures

Via

Access Paper or Ask Questions

Feature Compression for Cloud-Edge Multimodal 3D Object Detection

Sep 06, 2024

Chongzhen Tian, Zhengxin Li, Hui Yuan, Raouf Hamzaoui, Liquan Shen, Sam Kwong

Figure 1 for Feature Compression for Cloud-Edge Multimodal 3D Object Detection

Figure 2 for Feature Compression for Cloud-Edge Multimodal 3D Object Detection

Figure 3 for Feature Compression for Cloud-Edge Multimodal 3D Object Detection

Figure 4 for Feature Compression for Cloud-Edge Multimodal 3D Object Detection

Abstract:Machine vision systems, which can efficiently manage extensive visual perception tasks, are becoming increasingly popular in industrial production and daily life. Due to the challenge of simultaneously obtaining accurate depth and texture information with a single sensor, multimodal data captured by cameras and LiDAR is commonly used to enhance performance. Additionally, cloud-edge cooperation has emerged as a novel computing approach to improve user experience and ensure data security in machine vision systems. This paper proposes a pioneering solution to address the feature compression problem in multimodal 3D object detection. Given a sparse tensor-based object detection network at the edge device, we introduce two modes to accommodate different application requirements: Transmission-Friendly Feature Compression (T-FFC) and Accuracy-Friendly Feature Compression (A-FFC). In T-FFC mode, only the output of the last layer of the network's backbone is transmitted from the edge device. The received feature is processed at the cloud device through a channel expansion module and two spatial upsampling modules to generate multi-scale features. In A-FFC mode, we expand upon the T-FFC mode by transmitting two additional types of features. These added features enable the cloud device to generate more accurate multi-scale features. Experimental results on the KITTI dataset using the VirConv-L detection network showed that T-FFC was able to compress the features by a factor of 6061 with less than a 3% reduction in detection performance. On the other hand, A-FFC compressed the features by a factor of about 901 with almost no degradation in detection performance. We also designed optional residual extraction and 3D object reconstruction modules to facilitate the reconstruction of detected objects. The reconstructed objects effectively reflected details of the original objects.

Via

Access Paper or Ask Questions

3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution

Jul 24, 2024

Congrui Fu, Hui Yuan, Liquan Shen, Raouf Hamzaoui, Hao Zhang

Figure 1 for 3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution

Figure 2 for 3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution

Figure 3 for 3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution

Figure 4 for 3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution

Abstract:In many applications, including surveillance, entertainment, and restoration, there is a need to increase both the spatial resolution and the frame rate of a video sequence. The aim is to improve visual quality, refine details, and create a more realistic viewing experience. Existing space-time video super-resolution methods do not effectively use spatio-temporal information. To address this limitation, we propose a generative adversarial network for joint space-time video super-resolution. The generative network consists of three operations: shallow feature extraction, deep feature extraction, and reconstruction. It uses three-dimensional (3D) convolutions to process temporal and spatial information simultaneously and includes a novel 3D attention mechanism to extract the most important channel and spatial information. The discriminative network uses a two-branch structure to handle details and motion information, making the generated results more accurate. Experimental results on the Vid4, Vimeo-90K, and REDS datasets demonstrate the effectiveness of the proposed method. The source code is publicly available at https://github.com/FCongRui/3DAttGan.git.

Via

Access Paper or Ask Questions