Alert button
Picture for Abhishek Kar

Abhishek Kar

Alert button

Accelerating Neural Field Training via Soft Mining

Nov 29, 2023
Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi

We present an approach to accelerate Neural Field training by efficiently selecting sampling locations. While Neural Fields have recently become popular, it is often trained by uniformly sampling the training domain, or through handcrafted heuristics. We show that improved convergence and final training quality can be achieved by a soft mining technique based on importance sampling: rather than either considering or ignoring a pixel completely, we weigh the corresponding loss by a scalar. To implement our idea we use Langevin Monte-Carlo sampling. We show that by doing so, regions with higher error are being selected more frequently, leading to more than 2x improvement in convergence speed. The code and related resources for this study are publicly available at https://ubc-vision.github.io/nf-soft-mining/.

Viaarxiv icon

LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs

Jun 08, 2023
Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, Ameesh Makadia

Figure 1 for LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs
Figure 2 for LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs
Figure 3 for LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs
Figure 4 for LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs

A critical obstacle preventing NeRF models from being deployed broadly in the wild is their reliance on accurate camera poses. Consequently, there is growing interest in extending NeRF models to jointly optimize camera poses and scene representation, which offers an alternative to off-the-shelf SfM pipelines which have well-understood failure modes. Existing approaches for unposed NeRF operate under limited assumptions, such as a prior pose distribution or coarse pose initialization, making them less effective in a general setting. In this work, we propose a novel approach, LU-NeRF, that jointly estimates camera poses and neural radiance fields with relaxed assumptions on pose configuration. Our approach operates in a local-to-global manner, where we first optimize over local subsets of the data, dubbed mini-scenes. LU-NeRF estimates local pose and geometry for this challenging few-shot task. The mini-scene poses are brought into a global reference frame through a robust pose synchronization step, where a final global optimization of pose and scene can be performed. We show our LU-NeRF pipeline outperforms prior attempts at unposed NeRF without making restrictive assumptions on the pose prior. This allows us to operate in the general SE(3) pose setting, unlike the baselines. Our results also indicate our model can be complementary to feature-based SfM pipelines as it compares favorably to COLMAP on low-texture and low-resolution images.

* Project website: https://people.cs.umass.edu/~zezhoucheng/lu-nerf/ 
Viaarxiv icon

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Jun 02, 2023
Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek Kar, Mohammad Norouzi, Deqing Sun, David J. Fleet

Figure 1 for The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation
Figure 2 for The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation
Figure 3 for The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation
Figure 4 for The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Denoising diffusion probabilistic models have transformed image generation with their impressive fidelity and diversity. We show that they also excel in estimating optical flow and monocular depth, surprisingly, without task-specific architectures and loss functions that are predominant for these tasks. Compared to the point estimates of conventional regression-based methods, diffusion models also enable Monte Carlo inference, e.g., capturing uncertainty and ambiguity in flow and depth. With self-supervised pre-training, the combined use of synthetic and real data for supervised training, and technical innovations (infilling and step-unrolled denoising diffusion training) to handle noisy-incomplete training data, and a simple form of coarse-to-fine refinement, one can train state-of-the-art diffusion models for depth and optical flow estimation. Extensive experiments focus on quantitative performance against benchmarks, ablations, and the model's ability to capture uncertainty and multimodality, and impute missing values. Our model, DDVM (Denoising Diffusion Vision Model), obtains a state-of-the-art relative depth error of 0.074 on the indoor NYU benchmark and an Fl-all outlier rate of 3.26\% on the KITTI optical flow benchmark, about 25\% better than the best published method. For an overview see https://diffusion-vision.github.io.

Viaarxiv icon

Unsupervised Semantic Correspondence Using Stable Diffusion

May 24, 2023
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi

Figure 1 for Unsupervised Semantic Correspondence Using Stable Diffusion
Figure 2 for Unsupervised Semantic Correspondence Using Stable Diffusion
Figure 3 for Unsupervised Semantic Correspondence Using Stable Diffusion
Figure 4 for Unsupervised Semantic Correspondence Using Stable Diffusion

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

Viaarxiv icon

$\text{DC}^2$: Dual-Camera Defocus Control by Learning to Refocus

Apr 06, 2023
Hadi Alzayer, Abdullah Abuolaim, Leung Chun Chan, Yang Yang, Ying Chen Lou, Jia-Bin Huang, Abhishek Kar

Figure 1 for $\text{DC}^2$: Dual-Camera Defocus Control by Learning to Refocus
Figure 2 for $\text{DC}^2$: Dual-Camera Defocus Control by Learning to Refocus
Figure 3 for $\text{DC}^2$: Dual-Camera Defocus Control by Learning to Refocus
Figure 4 for $\text{DC}^2$: Dual-Camera Defocus Control by Learning to Refocus

Smartphone cameras today are increasingly approaching the versatility and quality of professional cameras through a combination of hardware and software advancements. However, fixed aperture remains a key limitation, preventing users from controlling the depth of field (DoF) of captured images. At the same time, many smartphones now have multiple cameras with different fixed apertures -- specifically, an ultra-wide camera with wider field of view and deeper DoF and a higher resolution primary camera with shallower DoF. In this work, we propose $\text{DC}^2$, a system for defocus control for synthetically varying camera aperture, focus distance and arbitrary defocus effects by fusing information from such a dual-camera system. Our key insight is to leverage real-world smartphone camera dataset by using image refocus as a proxy task for learning to control defocus. Quantitative and qualitative evaluations on real-world data demonstrate our system's efficacy where we outperform state-of-the-art on defocus deblurring, bokeh rendering, and image refocus. Finally, we demonstrate creative post-capture defocus control enabled by our method, including tilt-shift and content-based defocus effects.

* CVPR 2023. See the project page at https://defocus-control.github.io 
Viaarxiv icon

ASIC: Aligning Sparse in-the-wild Image Collections

Mar 28, 2023
Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, Abhishek Kar

Figure 1 for ASIC: Aligning Sparse in-the-wild Image Collections
Figure 2 for ASIC: Aligning Sparse in-the-wild Image Collections
Figure 3 for ASIC: Aligning Sparse in-the-wild Image Collections
Figure 4 for ASIC: Aligning Sparse in-the-wild Image Collections

We present a method for joint alignment of sparse in-the-wild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the long-tail of the objects present in the world. We present a self-supervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at \url{https://kampta.github.io/asic}.

* Web: https://kampta.github.io/asic 
Viaarxiv icon

Monocular Depth Estimation using Diffusion Models

Feb 28, 2023
Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, David J. Fleet

Figure 1 for Monocular Depth Estimation using Diffusion Models
Figure 2 for Monocular Depth Estimation using Diffusion Models
Figure 3 for Monocular Depth Estimation using Diffusion Models
Figure 4 for Monocular Depth Estimation using Diffusion Models

We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io

Viaarxiv icon

SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections

May 31, 2022
Mark Boss, Andreas Engelhardt, Abhishek Kar, Yuanzhen Li, Deqing Sun, Jonathan T. Barron, Hendrik P. A. Lensch, Varun Jampani

Figure 1 for SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections
Figure 2 for SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections
Figure 3 for SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections
Figure 4 for SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections

Inverse rendering of an object under entirely unknown capture conditions is a fundamental challenge in computer vision and graphics. Neural approaches such as NeRF have achieved photorealistic results on novel view synthesis, but they require known camera poses. Solving this problem with unknown camera poses is highly challenging as it requires joint optimization over shape, radiance, and pose. This problem is exacerbated when the input images are captured in the wild with varying backgrounds and illuminations. Standard pose estimation techniques fail in such image collections in the wild due to very few estimated correspondences across images. Furthermore, NeRF cannot relight a scene under any illumination, as it operates on radiance (the product of reflectance and illumination). We propose a joint optimization framework to estimate the shape, BRDF, and per-image camera pose and illumination. Our method works on in-the-wild online image collections of an object and produces relightable 3D assets for several use-cases such as AR/VR. To our knowledge, our method is the first to tackle this severely unconstrained task with minimal user interaction. Project page: https://markboss.me/publication/2022-samurai/ Video: https://youtu.be/LlYuGDjXp-8

Viaarxiv icon

Learned Monocular Depth Priors in Visual-Inertial Initialization

Apr 20, 2022
Yunwen Zhou, Abhishek Kar, Eric Turner, Adarsh Kowdle, Chao X. Guo, Ryan C. DuToit, Konstantine Tsotsos

Figure 1 for Learned Monocular Depth Priors in Visual-Inertial Initialization
Figure 2 for Learned Monocular Depth Priors in Visual-Inertial Initialization
Figure 3 for Learned Monocular Depth Priors in Visual-Inertial Initialization
Figure 4 for Learned Monocular Depth Priors in Visual-Inertial Initialization

Visual-inertial odometry (VIO) is the pose estimation backbone for most AR/VR and autonomous robotic systems today, in both academia and industry. However, these systems are highly sensitive to the initialization of key parameters such as sensor biases, gravity direction, and metric scale. In practical scenarios where high-parallax or variable acceleration assumptions are rarely met (e.g. hovering aerial robot, smartphone AR user not gesticulating with phone), classical visual-inertial initialization formulations often become ill-conditioned and/or fail to meaningfully converge. In this paper we target visual-inertial initialization specifically for these low-excitation scenarios critical to in-the-wild usage. We propose to circumvent the limitations of classical visual-inertial structure-from-motion (SfM) initialization by incorporating a new learning-based measurement as a higher-level input. We leverage learned monocular depth images (mono-depth) to constrain the relative depth of features, and upgrade the mono-depth to metric scale by jointly optimizing for its scale and shift. Our experiments show a significant improvement in problem conditioning compared to a classical formulation for visual-inertial initialization, and demonstrate significant accuracy and robustness improvements relative to the state-of-the-art on public benchmarks, particularly under motion-restricted scenarios. We further extend this improvement to implementation within an existing odometry system to illustrate the impact of our improved initialization method on resulting tracking trajectories.

Viaarxiv icon

SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Sep 02, 2021
Varun Jampani, Huiwen Chang, Kyle Sargent, Abhishek Kar, Richard Tucker, Michael Krainin, Dominik Kaeser, William T. Freeman, David Salesin, Brian Curless, Ce Liu

Figure 1 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting
Figure 2 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting
Figure 3 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting
Figure 4 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Single image 3D photography enables viewers to view a still image from novel viewpoints. Recent approaches combine monocular depth networks with inpainting networks to achieve compelling results. A drawback of these techniques is the use of hard depth layering, making them unable to model intricate appearance details such as thin hair-like structures. We present SLIDE, a modular and unified system for single image 3D photography that uses a simple yet effective soft layering strategy to better preserve appearance details in novel views. In addition, we propose a novel depth-aware training strategy for our inpainting module, better suited for the 3D photography task. The resulting SLIDE approach is modular, enabling the use of other components such as segmentation and matting for improved layering. At the same time, SLIDE uses an efficient layered depth formulation that only requires a single forward pass through the component networks to produce high quality 3D photos. Extensive experimental analysis on three view-synthesis datasets, in combination with user studies on in-the-wild image collections, demonstrate superior performance of our technique in comparison to existing strong baselines while being conceptually much simpler. Project page: https://varunjampani.github.io/slide

* ICCV 2021 (Oral); Project page: https://varunjampani.github.io/slide ; Video: https://www.youtube.com/watch?v=RQio7q-ueY8 
Viaarxiv icon