"photo": models, code, and papers

Searching for Uncollected Litter with Computer Vision

Nov 27, 2022
Julian Hernandez, Dr. Clark Fitzgerald

Figure 1 for Searching for Uncollected Litter with Computer Vision
Figure 2 for Searching for Uncollected Litter with Computer Vision
Figure 3 for Searching for Uncollected Litter with Computer Vision
Figure 4 for Searching for Uncollected Litter with Computer Vision

This study combines photo metadata and computer vision to quantify where uncollected litter is present. Images from the Trash Annotations in Context (TACO) dataset were used to teach an algorithm to detect 10 categories of garbage. Although it worked well with smartphone photos, it struggled when trying to process images from vehicle mounted cameras. However, increasing the variety of perspectives and backgrounds in the dataset will help it improve in unfamiliar situations. These data are plotted onto a map which, as accuracy improves, could be used for measuring waste management strategies and quantifying trends.

* 17 pages, 6 figures 

NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views

Nov 29, 2022
Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, Zhangyang Wang

Figure 1 for NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views
Figure 2 for NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views
Figure 3 for NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views
Figure 4 for NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views

Virtual reality and augmented reality (XR) bring increasing demand for 3D content. However, creating high-quality 3D content requires tedious work that a human expert must do. In this work, we study the challenging task of lifting a single image to a 3D object and, for the first time, demonstrate the ability to generate a plausible 3D object with 360{\deg} views that correspond well with the given reference image. By conditioning on the reference image, our model can fulfill the everlasting curiosity for synthesizing novel views of objects from images. Our technique sheds light on a promising direction of easing the workflows for 3D artists and XR designers. We propose a novel framework, dubbed NeuralLift-360, that utilizes a depth-aware neural radiance representation (NeRF) and learns to craft the scene guided by denoising diffusion models. By introducing a ranking loss, our NeuralLift-360 can be guided with rough depth estimation in the wild. We also adopt a CLIP-guided sampling strategy for the diffusion prior to provide coherent guidance. Extensive experiments demonstrate that our NeuralLift-360 significantly outperforms existing state-of-the-art baselines. Project page: https://vita-group.github.io/NeuralLift-360/

* Project page: https://vita-group.github.io/NeuralLift-360/ 

ManVatar : Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels

Nov 23, 2022
Yuelang Xu, Lizhen Wang, Xiaochen Zhao, Hongwen Zhang, Yebin Liu

Figure 1 for ManVatar : Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels
Figure 2 for ManVatar : Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels
Figure 3 for ManVatar : Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels
Figure 4 for ManVatar : Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels

With NeRF widely used for facial reenactment, recent methods can recover photo-realistic 3D head avatar from just a monocular video. Unfortunately, the training process of the NeRF-based methods is quite time-consuming, as MLP used in the NeRF-based methods is inefficient and requires too many iterations to converge. To overcome this problem, we propose ManVatar, a fast 3D head avatar reconstruction method using Motion-Aware Neural Voxels. ManVatar is the first to decouple expression motion from canonical appearance for head avatar, and model the expression motion by neural voxels. In particular, the motion-aware neural voxels is generated from the weighted concatenation of multiple 4D tensors. The 4D tensors semantically correspond one-to-one with 3DMM expression bases and share the same weights as 3DMM expression coefficients. Benefiting from our novel representation, the proposed ManVatar can recover photo-realistic head avatars in just 5 minutes (implemented with pure PyTorch), which is significantly faster than the state-of-the-art facial reenactment methods.

* Project Page: https://www.liuyebin.com/manvatar/manvatar.html 

PyNet-V2 Mobile: Efficient On-Device Photo Processing With Neural Networks

Nov 08, 2022
Andrey Ignatov, Grigory Malivenko, Radu Timofte, Yu Tseng, Yu-Syuan Xu, Po-Hsiang Yu, Cheng-Ming Chiang, Hsien-Kai Kuo, Min-Hung Chen, Chia-Ming Cheng, Luc Van Gool

Figure 1 for PyNet-V2 Mobile: Efficient On-Device Photo Processing With Neural Networks
Figure 2 for PyNet-V2 Mobile: Efficient On-Device Photo Processing With Neural Networks
Figure 3 for PyNet-V2 Mobile: Efficient On-Device Photo Processing With Neural Networks
Figure 4 for PyNet-V2 Mobile: Efficient On-Device Photo Processing With Neural Networks

The increased importance of mobile photography created a need for fast and performant RAW image processing pipelines capable of producing good visual results in spite of the mobile camera sensor limitations. While deep learning-based approaches can efficiently solve this problem, their computational requirements usually remain too large for high-resolution on-device image processing. To address this limitation, we propose a novel PyNET-V2 Mobile CNN architecture designed specifically for edge devices, being able to process RAW 12MP photos directly on mobile phones under 1.5 second and producing high perceptual photo quality. To train and to evaluate the performance of the proposed solution, we use the real-world Fujifilm UltraISP dataset consisting on thousands of RAW-RGB image pairs captured with a professional medium-format 102MP Fujifilm camera and a popular Sony mobile camera sensor. The results demonstrate that the PyNET-V2 Mobile model can substantially surpass the quality of tradition ISP pipelines, while outperforming the previously introduced neural network-based solutions designed for fast image processing. Furthermore, we show that the proposed architecture is also compatible with the latest mobile AI accelerators such as NPUs or APUs that can be used to further reduce the latency of the model to as little as 0.5 second. The dataset, code and pre-trained models used in this paper are available on the project website: https://github.com/gmalivenko/PyNET-v2

High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization

Nov 29, 2022
Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, Qifeng Chen

Figure 1 for High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization
Figure 2 for High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization
Figure 3 for High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization
Figure 4 for High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization

We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views while preserving specific details of the input image. High-fidelity 3D GAN inversion is inherently challenging due to the geometry-texture trade-off in 3D inversion, where overfitting to a single view input image often damages the estimated geometry during the latent optimization. To solve this challenge, we propose a novel pipeline that builds on the pseudo-multi-view estimation with visibility analysis. We keep the original textures for the visible parts and utilize generative priors for the occluded parts. Extensive experiments show that our approach achieves advantageous reconstruction and novel view synthesis quality over state-of-the-art methods, even for images with out-of-distribution textures. The proposed pipeline also enables image attribute editing with the inverted latent code and 3D-aware texture modification. Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.

* Project website: https://ken-ouyang.github.io/HFGI3D/index.html ; Github link: https://github.com/jiaxinxie97/HFGI3D 

3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation

Dec 02, 2022
Zutao Jiang, Guangsong Lu, Xiaodan Liang, Jihua Zhu, Wei Zhang, Xiaojun Chang, Hang Xu

Figure 1 for 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation
Figure 2 for 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation
Figure 3 for 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation
Figure 4 for 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation

Text-guided 3D object generation aims to generate 3D objects described by user-defined captions, which paves a flexible way to visualize what we imagined. Although some works have been devoted to solving this challenging task, these works either utilize some explicit 3D representations (e.g., mesh), which lack texture and require post-processing for rendering photo-realistic views; or require individual time-consuming optimization for every single case. Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module. The text-to-views generation module is designed to generate different views of the target 3D object given an input caption. prior-guidance, caption-guidance and view contrastive learning are proposed for achieving better view-consistency and caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D generation module to obtain the implicit 3D neural representation from the previously-generated views. Our 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture and requires no time-cost optimization for every single caption. Besides, 3D-TOGO can control the category, color and shape of generated 3D objects with the input caption. Extensive experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects according to the input captions across 98 different categories, in terms of PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.

* AAAI 2023  

High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors

Nov 28, 2022
Yunpeng Bai, Yanbo Fan, Xuan Wang, Yong Zhang, Jingxiang Sun, Chun Yuan, Ying Shan

Figure 1 for High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors
Figure 2 for High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors
Figure 3 for High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors
Figure 4 for High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors

High-fidelity facial avatar reconstruction from a monocular video is a significant research problem in computer graphics and computer vision. Recently, Neural Radiance Field (NeRF) has shown impressive novel view rendering results and has been considered for facial avatar reconstruction. However, the complex facial dynamics and missing 3D information in monocular videos raise significant challenges for faithful facial reconstruction. In this work, we propose a new method for NeRF-based facial avatar reconstruction that utilizes 3D-aware generative prior. Different from existing works that depend on a conditional deformation field for dynamic modeling, we propose to learn a personalized generative prior, which is formulated as a local and low dimensional subspace in the latent space of 3D-GAN. We propose an efficient method to construct the personalized generative prior based on a small set of facial images of a given individual. After learning, it allows for photo-realistic rendering with novel views and the face reenactment can be realized by performing navigation in the latent space. Our proposed method is applicable for different driven signals, including RGB images, 3DMM coefficients, and audios. Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance.

* 8 pages, 7 figures 

DynIBaR: Neural Dynamic Image-Based Rendering

Nov 28, 2022
Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, Noah Snavely

Figure 1 for DynIBaR: Neural Dynamic Image-Based Rendering
Figure 2 for DynIBaR: Neural Dynamic Image-Based Rendering
Figure 3 for DynIBaR: Neural Dynamic Image-Based Rendering
Figure 4 for DynIBaR: Neural Dynamic Image-Based Rendering

We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of an MLP, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.

* Project page: dynibar.github.io 

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Nov 20, 2022
Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, Jiashi Feng

Figure 1 for MagicVideo: Efficient Video Generation With Latent Diffusion Models
Figure 2 for MagicVideo: Efficient Video Generation With Latent Diffusion Models
Figure 3 for MagicVideo: Efficient Video Generation With Latent Diffusion Models
Figure 4 for MagicVideo: Efficient Video Generation With Latent Diffusion Models

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. Given a text description, MagicVideo can generate photo-realistic video clips with high relevance to the text content. With the proposed efficient latent 3D U-Net design, MagicVideo can generate video clips with 256x256 spatial resolution on a single GPU card, which is 64x faster than the recent video diffusion model (VDM). Unlike previous works that train video generation from scratch in the RGB space, we propose to generate video clips in a low-dimensional latent space. We further utilize all the convolution operator weights of pre-trained text-to-image generative U-Net models for faster training. To achieve this, we introduce two new designs to adapt the U-Net decoder to video data: a framewise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture frame temporal dependencies. The whole generation process is within the low-dimension latent space of a pre-trained variation auto-encoder. We demonstrate that MagicVideo can generate both realistic video content and imaginary content in a photo-realistic style with a trade-off in terms of quality and computational cost. Refer to https://magicvideo.github.io/# for more examples.

3D Scene Creation and Rendering via Rough Meshes: A Lighting Transfer Avenue

Nov 27, 2022
Yujie Li, Bowen Cai, Yuqin Liang, Rongfei Jia, Binqiang Zhao, Mingming Gong, Huan Fu

Figure 1 for 3D Scene Creation and Rendering via Rough Meshes: A Lighting Transfer Avenue
Figure 2 for 3D Scene Creation and Rendering via Rough Meshes: A Lighting Transfer Avenue
Figure 3 for 3D Scene Creation and Rendering via Rough Meshes: A Lighting Transfer Avenue
Figure 4 for 3D Scene Creation and Rendering via Rough Meshes: A Lighting Transfer Avenue

This paper studies how to flexibly integrate reconstructed 3D models into practical 3D modeling pipelines such as 3D scene creation and rendering. Due to the technical difficulty, one can only obtain rough 3D models (R3DMs) for most real objects using existing 3D reconstruction techniques. As a result, physically-based rendering (PBR) would render low-quality images or videos for scenes that are constructed by R3DMs. One promising solution would be representing real-world objects as Neural Fields such as NeRFs, which are able to generate photo-realistic renderings of an object under desired viewpoints. However, a drawback is that the synthesized views through Neural Fields Rendering (NFR) cannot reflect the simulated lighting details on R3DMs in PBR pipelines, especially when object interactions in the 3D scene creation cause local shadows. To solve this dilemma, we propose a lighting transfer network (LighTNet) to bridge NFR and PBR, such that they can benefit from each other. LighTNet reasons about a simplified image composition model, remedies the uneven surface issue caused by R3DMs, and is empowered by several perceptual-motivated constraints and a new Lab angle loss which enhances the contrast between lighting strength and colors. Comparisons demonstrate that LighTNet is superior in synthesizing impressive lighting, and is promising in pushing NFR further in practical 3D modeling workflows. Project page: https://3d-front-future.github.io/LighTNet .