Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qitao Zhao

Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning

Feb 23, 2026

Zhongxiao Cong, Qitao Zhao, Minsik Jeon, Shubham Tulsiani

Abstract:Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.

* CVPR 2026. Project website: https://flow3r-project.github.io/

Via

Access Paper or Ask Questions

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Dec 11, 2025

Qitao Zhao, Hao Tan, Qianqian Wang, Sai Bi, Kai Zhang, Kalyan Sunkavalli, Shubham Tulsiani, Hanwen Jiang

Figure 1 for E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Figure 2 for E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Figure 3 for E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Figure 4 for E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Abstract:Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

* Project website: https://qitaozhao.github.io/E-RayZer

Via

Access Paper or Ask Questions

DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

May 08, 2025

Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

Figure 1 for DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Figure 2 for DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Figure 3 for DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Figure 4 for DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Abstract:Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.

* CVPR 2025. Project website: https://qitaozhao.github.io/DiffusionSfM

Via

Access Paper or Ask Questions

Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Dec 04, 2024

Qitao Zhao, Shubham Tulsiani

Figure 1 for Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Figure 2 for Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Figure 3 for Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Figure 4 for Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Abstract:Inferring the 3D structure underlying a set of multi-view images typically requires solving two co-dependent tasks -- accurate 3D reconstruction requires precise camera poses, and predicting camera poses relies on (implicitly or explicitly) modeling the underlying 3D. The classical framework of analysis by synthesis casts this inference as a joint optimization seeking to explain the observed pixels, and recent instantiations learn expressive 3D representations (e.g., Neural Fields) with gradient-descent-based pose refinement of initial pose estimates. However, given a sparse set of observed views, the observations may not provide sufficient direct evidence to obtain complete and accurate 3D. Moreover, large errors in pose estimation may not be easily corrected and can further degrade the inferred 3D. To allow robust 3D reconstruction and pose estimation in this challenging setup, we propose SparseAGS, a method that adapts this analysis-by-synthesis approach by: a) including novel-view-synthesis-based generative priors in conjunction with photometric objectives to improve the quality of the inferred 3D, and b) explicitly reasoning about outliers and using a discrete search with a continuous optimization-based strategy to correct them. We validate our framework across real-world and synthetic datasets in combination with several off-the-shelf pose estimation systems as initialization. We find that it significantly improves the base systems' pose accuracy while yielding high-quality 3D reconstructions that outperform the results from current multi-view reconstruction baselines.

* NeurIPS 2024. Project website: https://qitaozhao.github.io/SparseAGS

Via

Access Paper or Ask Questions

A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

Nov 09, 2023

Qitao Zhao, Ce Zheng, Mengyuan Liu, Chen Chen

Figure 1 for A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

Figure 2 for A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

Figure 3 for A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

Figure 4 for A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

Abstract:The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues (i.e., using a daunting number of video frames) for improved accuracy, which incurs performance saturation, intractable computation and the non-causal problem. This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues. To address this issue, we propose a straightforward yet powerful solution: leveraging the readily available intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors -- no finetuning on the 3D task is even needed. The key observation is that, while the pose detector learns to localize 2D joints, such representations (e.g., feature maps) implicitly encode the joint-centric spatial context thanks to the regional operations in backbone networks. We design a simple baseline named Context-Aware PoseFormer to showcase its effectiveness. Without access to any temporal information, the proposed method significantly outperforms its context-agnostic counterpart, PoseFormer, and other state-of-the-art methods using up to hundreds of video frames regarding both speed and precision. Project page: https://qitaozhao.github.io/ContextAware-PoseFormer

* Accepted to NeurIPS 2023

Via

Access Paper or Ask Questions

PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Mar 30, 2023

Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, Chen Chen

Figure 1 for PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Figure 2 for PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Figure 3 for PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Figure 4 for PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation

Abstract:Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants. Code is released at \url{https://github.com/QitaoZhao/PoseFormerV2}.

* Accepted to CVPR 2023. Project page: https://qitaozhao.github.io/PoseFormerV2

Via

Access Paper or Ask Questions