Alert button
Picture for Sam Tsai

Sam Tsai

Alert button

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Sep 27, 2023
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh

Figure 1 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Figure 2 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Figure 3 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Figure 4 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

Viaarxiv icon

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

Jul 27, 2023
Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka

Figure 1 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
Figure 2 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
Figure 3 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
Figure 4 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at \url{https://github.com/facebookresearch/NeRF-Det}.

* Accepted by ICCV 2023 
Viaarxiv icon

A Practical Stereo Depth System for Smart Glasses

Nov 19, 2022
Jialiang Wang, Daniel Scharstein, Akash Bapat, Kevin Blackburn-Matzen, Matthew Yu, Jonathan Lehman, Suhib Alsisan, Yanghan Wang, Sam Tsai, Jan-Michael Frahm, Zijian He, Peter Vajda, Michael F. Cohen, Matt Uyttendaele

Figure 1 for A Practical Stereo Depth System for Smart Glasses
Figure 2 for A Practical Stereo Depth System for Smart Glasses
Figure 3 for A Practical Stereo Depth System for Smart Glasses
Figure 4 for A Practical Stereo Depth System for Smart Glasses

We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with a fallback to monocular depth estimation when rectification is unreliable. The output of our depth sensing system is then used in a novel view generation pipeline to create 3D computational photography effect using point-of-view images captured by smart glasses. All these steps are executed on-device on the stringent compute budget of a mobile phone, and because we expect the users can use a wide range of smartphones, our design needs to be general and cannot be dependent on a particular hardware or ML accelerator such as a smartphone GPU. Although each of these steps is well-studied, a description of a practical system is still lacking. For such a system, each of these steps need to work in tandem with one another and fallback gracefully on failures within the system or less than ideal input data. We show how we handle unforeseen changes to calibration, e.g. due to heat, robustly support depth estimation in the wild, and still abide by the memory and latency constraints required for a smooth user experience. We show that our trained models are fast, that run in less than 1s on a six-year-old Samsung Galaxy S8 phone's CPU. Our models generalize well to unseen data and achieve good results on Middlebury and in-the-wild images captured from the smart glasses.

Viaarxiv icon

Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Jul 16, 2019
Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, Yuri Boykov

Figure 1 for Efficient Segmentation: Learning Downsampling Near Semantic Boundaries
Figure 2 for Efficient Segmentation: Learning Downsampling Near Semantic Boundaries
Figure 3 for Efficient Segmentation: Learning Downsampling Near Semantic Boundaries
Figure 4 for Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

Many automated processes such as auto-piloting rely on a good semantic segmentation as a critical component. To speed up performance, it is common to downsample the input frame. However, this comes at the cost of missed small objects and reduced accuracy at semantic boundaries. To address this problem, we propose a new content-adaptive downsampling technique that learns to favor sampling locations near semantic boundaries of target classes. Cost-performance analysis shows that our method consistently outperforms the uniform sampling improving balance between accuracy and computational efficiency. Our adaptive sampling gives segmentation with better quality of boundaries and more reliable support for smaller-size objects.

Viaarxiv icon

DRCD: a Chinese Machine Reading Comprehension Dataset

Jun 20, 2018
Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng, Sam Tsai

Figure 1 for DRCD: a Chinese Machine Reading Comprehension Dataset

In this paper, we introduce DRCD (Delta Reading Comprehension Dataset), an open domain traditional Chinese machine reading comprehension (MRC) dataset. This dataset aimed to be a standard Chinese machine reading comprehension dataset, which can be a source dataset in transfer learning. The dataset contains 10,014 paragraphs from 2,108 Wikipedia articles and 30,000+ questions generated by annotators. We build a baseline model that achieves an F1 score of 53.78%. F1 score of Human performance is 93.30%.

* 6 pages 
Viaarxiv icon