Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Girish Chandar Ganesan

MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

May 08, 2025

Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, Xiaoming Liu

Figure 1 for MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Figure 2 for MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Figure 3 for MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Figure 4 for MonoCoP: Chain-of-Prediction for Monocular 3D Object Detection

Abstract:Accurately predicting 3D attributes is crucial for monocular 3D object detection (Mono3D), with depth estimation posing the greatest challenge due to the inherent ambiguity in mapping 2D images to 3D space. While existing methods leverage multiple depth cues (e.g., estimating depth uncertainty, modeling depth error) to improve depth accuracy, they overlook that accurate depth prediction requires conditioning on other 3D attributes, as these attributes are intrinsically inter-correlated through the 3D to 2D projection, which ultimately limits overall accuracy and stability. Inspired by Chain-of-Thought (CoT) in large language models (LLMs), this paper proposes MonoCoP, which leverages a Chain-of-Prediction (CoP) to predict attributes sequentially and conditionally via three key designs. First, it employs a lightweight AttributeNet (AN) for each 3D attribute to learn attribute-specific features. Next, MonoCoP constructs an explicit chain to propagate these learned features from one attribute to the next. Finally, MonoCoP uses a residual connection to aggregate features for each attribute along the chain, ensuring that later attribute predictions are conditioned on all previously processed attributes without forgetting the features of earlier ones. Experimental results show that our MonoCoP achieves state-of-the-art (SoTA) performance on the KITTI leaderboard without requiring additional data and further surpasses existing methods on the Waymo and nuScenes frontal datasets.

Via

Access Paper or Ask Questions

RePLAy: Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Jul 27, 2024

Shengjie Zhu, Girish Chandar Ganesan, Abhinav Kumar, Xiaoming Liu

Figure 1 for RePLAy: Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Figure 2 for RePLAy: Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Figure 3 for RePLAy: Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Figure 4 for RePLAy: Remove Projective LiDAR Depthmap Artifacts via Exploiting Epipolar Geometry

Abstract:3D sensing is a fundamental task for Autonomous Vehicles. Its deployment often relies on aligned RGB cameras and LiDAR. Despite meticulous synchronization and calibration, systematic misalignment persists in LiDAR projected depthmap. This is due to the physical baseline distance between the two sensors. The artifact is often reflected as background LiDAR incorrectly projected onto the foreground, such as cars and pedestrians. The KITTI dataset uses stereo cameras as a heuristic solution to remove artifacts. However most AV datasets, including nuScenes, Waymo, and DDAD, lack stereo images, making the KITTI solution inapplicable. We propose RePLAy, a parameter-free analytical solution to remove the projective artifacts. We construct a binocular vision system between a hypothesized virtual LiDAR camera and the RGB camera. We then remove the projective artifacts by determining the epipolar occlusion with the proposed analytical solution. We show unanimous improvement in the State-of-The-Art (SoTA) monocular depth estimators and 3D object detectors with the artifacts-free depthmaps.

Via

Access Paper or Ask Questions