Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yifan Jiang

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Dec 01, 2023

Zehao Zhu, Zhiwen Fan, Yifan Jiang, Zhangyang Wang

Figure 1 for FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Figure 2 for FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Figure 3 for FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Figure 4 for FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Abstract:Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: https://zehaozhu.github.io/FSGS/.

* Project page: https://zehaozhu.github.io/FSGS/

Via

Access Paper or Ask Questions

BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

Oct 10, 2023

Yifan Jiang, Filip Ilievski, Kaixin Ma, Zhivar Sourati

Figure 1 for BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

Figure 2 for BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

Figure 3 for BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

Figure 4 for BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

Abstract:The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.

Via

Access Paper or Ask Questions

Drag View: Generalizable Novel View Synthesis with Unposed Imagery

Oct 05, 2023

Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Hanwen Jiang, Dejia Xu, Zehao Zhu, Dilin Wang, Zhangyang Wang

Figure 1 for Drag View: Generalizable Novel View Synthesis with Unposed Imagery

Figure 2 for Drag View: Generalizable Novel View Synthesis with Unposed Imagery

Figure 3 for Drag View: Generalizable Novel View Synthesis with Unposed Imagery

Figure 4 for Drag View: Generalizable Novel View Synthesis with Unposed Imagery

Abstract:We introduce DragView, a novel and interactive framework for generating novel views of unseen scenes. DragView initializes the new view from a single source image, and the rendering is supported by a sparse set of unposed multi-view images, all seamlessly executed within a single feed-forward pass. Our approach begins with users dragging a source view through a local relative coordinate system. Pixel-aligned features are obtained by projecting the sampled 3D points along the target ray onto the source view. We then incorporate a view-dependent modulation layer to effectively handle occlusion during the projection. Additionally, we broaden the epipolar attention mechanism to encompass all source pixels, facilitating the aggregation of initialized coordinate-aligned point features from other unposed views. Finally, we employ another transformer to decode ray features into final pixel intensities. Crucially, our framework does not rely on either 2D prior models or the explicit estimation of camera poses. During testing, DragView showcases the capability to generalize to new scenes unseen during training, also utilizing only unposed support images, enabling the generation of photo-realistic new views characterized by flexible camera trajectories. In our experiments, we conduct a comprehensive comparison of the performance of DragView with recent scene representation networks operating under pose-free conditions, as well as with generalizable NeRFs subject to noisy test camera poses. DragView consistently demonstrates its superior performance in view synthesis quality, while also being more user-friendly. Project page: https://zhiwenfan.github.io/DragView/.

Via

Access Paper or Ask Questions

Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Oct 04, 2023

Yifan Jiang, Hao Tang, Jen-Hao Rick Chang, Liangchen Song, Zhangyang Wang, Liangliang Cao

Figure 1 for Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Figure 2 for Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Figure 3 for Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Figure 4 for Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Abstract:The task of novel view synthesis aims to generate unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance fields. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs. To tackle this issue, we propose Efficient-3DiM, a simple but effective framework to learn a single-image novel-view synthesizer. Motivated by our in-depth analysis of the inference process of diffusion models, we propose several pragmatic strategies to reduce the training overhead to a manageable scale, including a crafted timestep sampling strategy, a superior 3D feature extractor, and an enhanced training scheme. When combined, our framework is able to reduce the total training time from 10 days to less than 1 day, significantly accelerating the training process under the same computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method.

Via

Access Paper or Ask Questions

FIRE: Food Image to REcipe generation

Aug 28, 2023

Prateek Chhikara, Dhiraj Chaurasia, Yifan Jiang, Omkar Masur, Filip Ilievski

Figure 1 for FIRE: Food Image to REcipe generation

Figure 2 for FIRE: Food Image to REcipe generation

Figure 3 for FIRE: Food Image to REcipe generation

Figure 4 for FIRE: Food Image to REcipe generation

Abstract:Food computing has emerged as a prominent multidisciplinary field of research in recent years. An ambitious goal of food computing is to develop end-to-end intelligent systems capable of autonomously producing recipe information for a food image. Current image-to-recipe methods are retrieval-based and their success depends heavily on the dataset size and diversity, as well as the quality of learned embeddings. Meanwhile, the emergence of powerful attention-based vision and language models presents a promising avenue for accurate and generalizable recipe generation, which has yet to be extensively explored. This paper proposes FIRE, a novel multimodal methodology tailored to recipe generation in the food computing domain, which generates the food title, ingredients, and cooking instructions based on input food images. FIRE leverages the BLIP model to generate titles, utilizes a Vision Transformer with a decoder for ingredient extraction, and employs the T5 model to generate recipes incorporating titles and ingredients as inputs. We showcase two practical applications that can benefit from integrating FIRE with large language model prompting: recipe customization to fit recipes to user preferences and recipe-to-code transformation to enable automated cooking processes. Our experimental findings validate the efficacy of our proposed approach, underscoring its potential for future advancements and widespread adoption in food computing.

* 5 figures, 4 tables

Via

Access Paper or Ask Questions

Wasserstein distributional robustness of neural networks

Jun 16, 2023

Xingjian Bai, Guangyi He, Yifan Jiang, Jan Obloj

Figure 1 for Wasserstein distributional robustness of neural networks

Figure 2 for Wasserstein distributional robustness of neural networks

Figure 3 for Wasserstein distributional robustness of neural networks

Figure 4 for Wasserstein distributional robustness of neural networks

Abstract:Deep neural networks are known to be vulnerable to adversarial attacks (AA). For an image recognition task, this means that a small perturbation of the original can result in the image being misclassified. Design of such attacks as well as methods of adversarial training against them are subject of intense research. We re-cast the problem using techniques of Wasserstein distributionally robust optimization (DRO) and obtain novel contributions leveraging recent insights from DRO sensitivity analysis. We consider a set of distributional threat models. Unlike the traditional pointwise attacks, which assume a uniform bound on perturbation of each input data point, distributional threat models allow attackers to perturb inputs in a non-uniform way. We link these more general attacks with questions of out-of-sample performance and Knightian uncertainty. To evaluate the distributional robustness of neural networks, we propose a first-order AA algorithm and its multi-step version. Our attack algorithms include Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) as special cases. Furthermore, we provide a new asymptotic estimate of the adversarial accuracy against distributional threat models. The bound is fast to compute and first-order accurate, offering new insights even for the pointwise AA. It also naturally yields out-of-sample performance guarantees. We conduct numerical experiments on the CIFAR-10 dataset using DNNs on RobustBench to illustrate our theoretical results. Our code is available at https://github.com/JanObloj/W-DRO-Adversarial-Methods.

* 23 pages, 6 figures, 8 tables

Via

Access Paper or Ask Questions

POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

May 25, 2023

Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Dejia Xu, Hanwen Jiang, Zhangyang Wang

Figure 1 for POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

Figure 2 for POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

Figure 3 for POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

Figure 4 for POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference

Abstract:Despite the significant progress in six degrees-of-freedom (6DoF) object pose estimation, existing methods have limited applicability in real-world scenarios involving embodied agents and downstream 3D vision tasks. These limitations mainly come from the necessity of 3D models, closed-category detection, and a large number of densely annotated support views. To mitigate this issue, we propose a general paradigm for object pose estimation, called Promptable Object Pose Estimation (POPE). The proposed approach POPE enables zero-shot 6DoF object pose estimation for any target object in any scene, while only a single reference is adopted as the support view. To achieve this, POPE leverages the power of the pre-trained large-scale 2D foundation model, employs a framework with hierarchical feature representation and 3D geometry principles. Moreover, it estimates the relative camera pose between object prompts and the target object in new views, enabling both two-view and multi-view 6DoF pose estimation tasks. Comprehensive experimental results demonstrate that POPE exhibits unrivaled robust performance in zero-shot settings, by achieving a significant reduction in the averaged Median Pose Error by 52.38% and 50.47% on the LINEMOD and OnePose datasets, respectively. We also conduct more challenging testings in causally captured images (see Figure 1), which further demonstrates the robustness of POPE. Project page can be found with https://paulpanwang.github.io/POPE/.

Via

Access Paper or Ask Questions

The Weighted Möbius Score: A Unified Framework for Feature Attribution

May 16, 2023

Yifan Jiang, Shane Steinert-Threlkeld

Abstract:Feature attribution aims to explain the reasoning behind a black-box model's prediction by identifying the impact of each feature on the prediction. Recent work has extended feature attribution to interactions between multiple features. However, the lack of a unified framework has led to a proliferation of methods that are often not directly comparable. This paper introduces a parameterized attribution framework -- the Weighted M\"obius Score -- and (i) shows that many different attribution methods for both individual features and feature interactions are special cases and (ii) identifies some new methods. By studying the vector space of attribution methods, our framework utilizes standard linear algebra tools and provides interpretations in various fields, including cooperative game theory and causal mediation analysis. We empirically demonstrate the framework's versatility and effectiveness by applying these attribution methods to feature interactions in sentiment analysis and chain-of-thought prompting.

Via

Access Paper or Ask Questions

In-Context Learning Unlocked for Diffusion Models

May 01, 2023

Zhendong Wang, Yifan Jiang, Yadong Lu, Yelong Shen, Pengcheng He, Weizhu Chen, Zhangyang Wang, Mingyuan Zhou

Abstract:We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes effectively to new, unseen vision tasks with their respective prompts. Our model also shows compelling text-guided image editing results. Our framework, with code publicly available at https://github.com/Zhendong-Wang/Prompt-Diffusion, aims to facilitate research into in-context learning for computer vision.

Via

Access Paper or Ask Questions

Transferring Procedural Knowledge across Commonsense Tasks

Apr 26, 2023

Yifan Jiang, Filip Ilievski, Kaixin Ma

Abstract:Stories about everyday situations are an essential part of human communication, motivating the need to develop AI agents that can reliably understand these stories. Despite the long list of supervised methods for story completion and procedural understanding, current AI has no mechanisms to automatically track and explain procedures in unseen stories. To bridge this gap, we study the ability of AI models to transfer procedural knowledge to novel narrative tasks in a transparent manner. We design LEAP: a comprehensive framework that integrates state-of-the-art modeling architectures, training regimes, and augmentation strategies based on both natural and synthetic stories. To address the lack of densely annotated training data, we devise a robust automatic labeler based on few-shot prompting to enhance the augmented data. Our experiments with in- and out-of-domain tasks reveal insights into the interplay of different architectures, training regimes, and augmentation strategies. LEAP's labeler has a clear positive impact on out-of-domain datasets, while the resulting dense annotation provides native explainability.

Via

Access Paper or Ask Questions