Alert button
Picture for David F. Fouhey

David F. Fouhey

Alert button

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Sep 21, 2023
Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .

* Project website: https://chat-with-nerf.github.io/ 
Viaarxiv icon

Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Jun 14, 2023
Nilesh Kulkarni, Linyi Jin, Justin Johnson, David F. Fouhey

Figure 1 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data
Figure 2 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data
Figure 3 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data
Figure 4 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This setting may help 3D reconstruction unlock the sea of accelerometer+RGBD data that is coming with new phones. Our system, D2-DRDF, can match and sometimes outperform current methods that use mesh supervision and shows better robustness to sparse data.

* Project page this https://nileshkulkarni.github.io/d2drdf/ 
Viaarxiv icon

Understanding 3D Object Interaction from a Single Image

May 16, 2023
Shengyi Qian, David F. Fouhey

Figure 1 for Understanding 3D Object Interaction from a Single Image
Figure 2 for Understanding 3D Object Interaction from a Single Image
Figure 3 for Understanding 3D Object Interaction from a Single Image
Figure 4 for Understanding 3D Object Interaction from a Single Image

Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data.

Viaarxiv icon

Perspective Fields for Single Image Camera Calibration

Dec 06, 2022
Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, David F. Fouhey

Figure 1 for Perspective Fields for Single Image Camera Calibration
Figure 2 for Perspective Fields for Single Image Camera Calibration
Figure 3 for Perspective Fields for Single Image Camera Calibration
Figure 4 for Perspective Fields for Single Image Camera Calibration

Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes minimal assumptions about the camera model and is invariant or equivariant to common image editing operations like cropping, warping, and rotation. It is also more interpretable and aligned with human perception. We train a neural network to predict Perspective Fields and the predicted Perspective Fields can be converted to calibration parameters easily. We demonstrate the robustness of our approach under various scenarios compared with camera calibration-based methods and show example applications in image compositing.

* Project Page https://jinlinyi.github.io/PerspectiveFields/ 
Viaarxiv icon

Large-Scale Spatial Cross-Calibration of Hinode/SOT-SP and SDO/HMI

Sep 29, 2022
David F. Fouhey, Richard E. L. Higgins, Spiro K. Antiochos, Graham Barnes, Marc L. DeRosa, J. Todd Hoeksema, K. D. Leka, Yang Liu, Peter W. Schuck, Tamas I. Gombosi

Figure 1 for Large-Scale Spatial Cross-Calibration of Hinode/SOT-SP and SDO/HMI
Figure 2 for Large-Scale Spatial Cross-Calibration of Hinode/SOT-SP and SDO/HMI
Figure 3 for Large-Scale Spatial Cross-Calibration of Hinode/SOT-SP and SDO/HMI
Figure 4 for Large-Scale Spatial Cross-Calibration of Hinode/SOT-SP and SDO/HMI

We investigate the cross-calibration of the Hinode/SOT-SP and SDO/HMI instrument meta-data, specifically the correspondence of the scaling and pointing information. Accurate calibration of these datasets gives the correspondence needed by inter-instrument studies and learning-based magnetogram systems, and is required for physically-meaningful photospheric magnetic field vectors. We approach the problem by robustly fitting geometric models on correspondences between images from each instrument's pipeline. This technique is common in computer vision, but several critical details are required when using scanning slit spectrograph data like Hinode/SOT-SP. We apply this technique to data spanning a decade of the Hinode mission. Our results suggest corrections to the published Level 2 Hinode/SOT-SP data. First, an analysis on approximately 2,700 scans suggests that the reported pixel size in Hinode/SOT-SP Level 2 data is incorrect by around 1%. Second, analysis of over 12,000 scans show that the pointing information is often incorrect by dozens of arcseconds with a strong bias. Regression of these corrections indicates that thermal effects have caused secular and cyclic drift in Hinode/SOT-SP pointing data over its mission. We offer two solutions. First, direct co-alignment with SDO/HMI data via our procedure can improve alignments for many Hinode/SOT-SP scans. Second, since the pointing errors are predictable, simple post-hoc corrections can substantially improve the pointing. We conclude by illustrating the impact of this updated calibration on derived physical data products needed for research and interpretation. Among other things, our results suggest that the pointing errors induce a hemispheric bias in estimates of radial current density.

* Under revisions at ApJS 
Viaarxiv icon

The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs

Aug 18, 2022
Chris Rockwell, Justin Johnson, David F. Fouhey

Figure 1 for The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs
Figure 2 for The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs
Figure 3 for The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs
Figure 4 for The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs

We present a simple baseline for directly estimating the relative pose (rotation and translation, including scale) between two images. Deep methods have recently shown strong progress but often require complex or multi-stage architectures. We show that a handful of modifications can be applied to a Vision Transformer (ViT) to bring its computations close to the Eight-Point Algorithm. This inductive bias enables a simple method to be competitive in multiple settings, often substantially improving over the state of the art with strong performance gains in limited data regimes.

* Accepted to 3DV 2022; Project Page: https://crockwell.github.io/rel_pose/ 
Viaarxiv icon

PlaneFormers: From Sparse View Planes to 3D Reconstruction

Aug 08, 2022
Samir Agarwala, Linyi Jin, Chris Rockwell, David F. Fouhey

Figure 1 for PlaneFormers: From Sparse View Planes to 3D Reconstruction
Figure 2 for PlaneFormers: From Sparse View Planes to 3D Reconstruction
Figure 3 for PlaneFormers: From Sparse View Planes to 3D Reconstruction
Figure 4 for PlaneFormers: From Sparse View Planes to 3D Reconstruction

We present an approach for the planar surface reconstruction of a scene from images with limited overlap. This reconstruction task is challenging since it requires jointly reasoning about single image 3D reconstruction, correspondence between images, and the relative camera pose between images. Past work has proposed optimization-based approaches. We introduce a simpler approach, the PlaneFormer, that uses a transformer applied to 3D-aware plane tokens to perform 3D reasoning. Our experiments show that our approach is substantially more effective than prior work, and that several 3D-specific design decisions are crucial for its success.

* Accepted to ECCV 2022 
Viaarxiv icon

Sound Localization by Self-Supervised Time Delay Estimation

Apr 26, 2022
Ziyang Chen, David F. Fouhey, Andrew Owens

Figure 1 for Sound Localization by Self-Supervised Time Delay Estimation
Figure 2 for Sound Localization by Self-Supervised Time Delay Estimation
Figure 3 for Sound Localization by Self-Supervised Time Delay Estimation
Figure 4 for Sound Localization by Self-Supervised Time Delay Estimation

Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: https://ificl.github.io/stereocrw/

Viaarxiv icon

Understanding 3D Object Articulation in Internet Videos

Mar 30, 2022
Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, David F. Fouhey

Figure 1 for Understanding 3D Object Articulation in Internet Videos
Figure 2 for Understanding 3D Object Articulation in Internet Videos
Figure 3 for Understanding 3D Object Articulation in Internet Videos
Figure 4 for Understanding 3D Object Articulation in Internet Videos

We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos. While seemingly easy for humans, this problem poses many challenges for computers. We propose to approach this problem by combining a top-down detection system that finds planes that can be articulated along with an optimization approach that solves for a 3D plane that can explain a sequence of observed articulations. We show that this system can be trained on a combination of videos and 3D scan datasets. When tested on a dataset of challenging Internet videos and the Charades dataset, our approach obtains strong performance. Project site: https://jasonqsy.github.io/Articulation3D

* CVPR 2022 
Viaarxiv icon

What's Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction

Dec 08, 2021
Nilesh Kulkarni, Justin Johnson, David F. Fouhey

Figure 1 for What's Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction
Figure 2 for What's Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction
Figure 3 for What's Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction
Figure 4 for What's Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction

We present an approach for scene-level 3D reconstruction, including occluded regions, from an unseen RGB image. Our approach is trained on real 3D scans and images. This problem has proved difficult for multiple reasons; Real scans are not watertight, precluding many methods; distances in scenes require reasoning across objects (making it even harder); and, as we show, uncertainty about surface locations motivates networks to produce outputs that lack basic distance function properties. We propose a new distance-like function that can be computed on unstructured scans and has good behavior under uncertainty about surface location. Computing this function over rays reduces the complexity further. We train a deep network to predict this function and show it outperforms other methods on Matterport3D, 3D Front, and ScanNet.

* Project Page see https://nileshkulkarni.github.io/scene_drdf 
Viaarxiv icon