Alert button
Picture for Jialiang Wang

Jialiang Wang

Alert button

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Sep 27, 2023
Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh

Figure 1 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Figure 2 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Figure 3 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
Figure 4 for Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

Viaarxiv icon

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

Jul 27, 2023
Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka

Figure 1 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
Figure 2 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
Figure 3 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection
Figure 4 for NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at \url{https://github.com/facebookresearch/NeRF-Det}.

* Accepted by ICCV 2023 
Viaarxiv icon

A Practical Stereo Depth System for Smart Glasses

Nov 19, 2022
Jialiang Wang, Daniel Scharstein, Akash Bapat, Kevin Blackburn-Matzen, Matthew Yu, Jonathan Lehman, Suhib Alsisan, Yanghan Wang, Sam Tsai, Jan-Michael Frahm, Zijian He, Peter Vajda, Michael F. Cohen, Matt Uyttendaele

Figure 1 for A Practical Stereo Depth System for Smart Glasses
Figure 2 for A Practical Stereo Depth System for Smart Glasses
Figure 3 for A Practical Stereo Depth System for Smart Glasses
Figure 4 for A Practical Stereo Depth System for Smart Glasses

We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with a fallback to monocular depth estimation when rectification is unreliable. The output of our depth sensing system is then used in a novel view generation pipeline to create 3D computational photography effect using point-of-view images captured by smart glasses. All these steps are executed on-device on the stringent compute budget of a mobile phone, and because we expect the users can use a wide range of smartphones, our design needs to be general and cannot be dependent on a particular hardware or ML accelerator such as a smartphone GPU. Although each of these steps is well-studied, a description of a practical system is still lacking. For such a system, each of these steps need to work in tandem with one another and fallback gracefully on failures within the system or less than ideal input data. We show how we handle unforeseen changes to calibration, e.g. due to heat, robustly support depth estimation in the wild, and still abide by the memory and latency constraints required for a smooth user experience. We show that our trained models are fast, that run in less than 1s on a six-year-old Samsung Galaxy S8 phone's CPU. Our models generalize well to unseen data and achieve good results on Middlebury and in-the-wild images captured from the smart glasses.

Viaarxiv icon

Consistent Direct Time-of-Flight Video Depth Super-Resolution

Nov 16, 2022
Zhanghao Sun, Wei Ye, Jinhui Xiong, Gyeongmin Choe, Jialiang Wang, Shuochen Su, Rakesh Ranjan

Figure 1 for Consistent Direct Time-of-Flight Video Depth Super-Resolution
Figure 2 for Consistent Direct Time-of-Flight Video Depth Super-Resolution
Figure 3 for Consistent Direct Time-of-Flight Video Depth Super-Resolution
Figure 4 for Consistent Direct Time-of-Flight Video Depth Super-Resolution

Direct time-of-flight (dToF) sensors are promising for next-generation on-device 3D sensing. However, to achieve the sufficient signal-to-noise-ratio (SNR) in a compact module, the dToF data has limited spatial resolution (e.g., ~20x30 for iPhone dToF), and it requires a super-resolution step before being passed to downstream tasks. In this paper, we solve this super-resolution problem by fusing the low-resolution dToF data with the corresponding high-resolution RGB guidance. Unlike the conventional RGB-guided depth enhancement approaches which perform the fusion in a per-frame manner, we propose the first multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the low-resolution dToF imaging. In addition, dToF sensors provide unique depth histogram information for each local patch, and we incorporate this dToF-specific feature in our network design to further alleviate spatial ambiguity. To evaluate our models on complex dynamic indoor environments and to provide a large-scale dToF sensor dataset, we introduce DyDToF, the first synthetic RGB-dToF video dataset that features dynamic objects and a realistic dToF simulator following the physical imaging process. We believe the methods and dataset are beneficial to a broad community as dToF depth sensing is becoming mainstream on mobile devices.

Viaarxiv icon

A Practical Second-order Latent Factor Model via Distributed Particle Swarm Optimization

Aug 12, 2022
Jialiang Wang, Yurong Zhong, Weiling Li

Figure 1 for A Practical Second-order Latent Factor Model via Distributed Particle Swarm Optimization
Figure 2 for A Practical Second-order Latent Factor Model via Distributed Particle Swarm Optimization
Figure 3 for A Practical Second-order Latent Factor Model via Distributed Particle Swarm Optimization
Figure 4 for A Practical Second-order Latent Factor Model via Distributed Particle Swarm Optimization

Latent Factor (LF) models are effective in representing high-dimension and sparse (HiDS) data via low-rank matrices approximation. Hessian-free (HF) optimization is an efficient method to utilizing second-order information of an LF model's objective function and it has been utilized to optimize second-order LF (SLF) model. However, the low-rank representation ability of a SLF model heavily relies on its multiple hyperparameters. Determining these hyperparameters is time-consuming and it largely reduces the practicability of an SLF model. To address this issue, a practical SLF (PSLF) model is proposed in this work. It realizes hyperparameter self-adaptation with a distributed particle swarm optimizer (DPSO), which is gradient-free and parallelized. Experiments on real HiDS data sets indicate that PSLF model has a competitive advantage over state-of-the-art models in data representation ability.

* 7 pages 
Viaarxiv icon

UMSNet: An Universal Multi-sensor Network for Human Activity Recognition

May 24, 2022
Jialiang Wang, Haotian Wei, Yi Wang, Shu Yang, Chi Li

Figure 1 for UMSNet: An Universal Multi-sensor Network for Human Activity Recognition
Figure 2 for UMSNet: An Universal Multi-sensor Network for Human Activity Recognition
Figure 3 for UMSNet: An Universal Multi-sensor Network for Human Activity Recognition
Figure 4 for UMSNet: An Universal Multi-sensor Network for Human Activity Recognition

Human activity recognition (HAR) based on multimodal sensors has become a rapidly growing branch of biometric recognition and artificial intelligence. However, how to fully mine multimodal time series data and effectively learn accurate behavioral features has always been a hot topic in this field. Practical applications also require a well-generalized framework that can quickly process a variety of raw sensor data and learn better feature representations. This paper proposes a universal multi-sensor network (UMSNet) for human activity recognition. In particular, we propose a new lightweight sensor residual block (called LSR block), which improves the performance by reducing the number of activation function and normalization layers, and adding inverted bottleneck structure and grouping convolution. Then, the Transformer is used to extract the relationship of series features to realize the classification and recognition of human activities. Our framework has a clear structure and can be directly applied to various types of multi-modal Time Series Classification (TSC) tasks after simple specialization. Extensive experiments show that the proposed UMSNet outperforms other state-of-the-art methods on two popular multi-sensor human activity recognition datasets (i.e. HHAR dataset and MHEALTH dataset).

Viaarxiv icon

Toward Practical Self-Supervised Monocular Indoor Depth Estimation

Dec 04, 2021
Cho-Ying Wu, Jialiang Wang, Michael Hall, Ulrich Neumann, Shuochen Su

Figure 1 for Toward Practical Self-Supervised Monocular Indoor Depth Estimation
Figure 2 for Toward Practical Self-Supervised Monocular Indoor Depth Estimation
Figure 3 for Toward Practical Self-Supervised Monocular Indoor Depth Estimation
Figure 4 for Toward Practical Self-Supervised Monocular Indoor Depth Estimation

The majority of self-supervised monocular depth estimation methods focus on driving scenarios. We show that such methods generalize poorly to unseen complex indoor scenes, where objects are cluttered and arbitrarily arranged in the near field. To obtain more robustness, we propose a structure distillation approach to learn knacks from a pretrained depth estimator that produces structured but metric-agnostic depth due to its in-the-wild mixed-dataset training. By combining distillation with the self-supervised branch that learns metrics from left-right consistency, we attain structured and metric depth for generic indoor scenes and make inferences in real-time. To facilitate learning and evaluation, we collect SimSIN, a dataset from simulation with thousands of environments, and UniSIN, a dataset that contains about 500 real scan sequences of generic indoor environments. We experiment in both sim-to-real and real-to-real settings, and show improvements both qualitatively and quantitatively, as well as in downstream applications using our depth maps. This work provides a full study, covering methods, data, and applications. We believe the work lays a solid basis for practical indoor depth estimation via self-supervision.

Viaarxiv icon

FBNetV5: Neural Architecture Search for Multiple Tasks in One Run

Nov 30, 2021
Bichen Wu, Chaojian Li, Hang Zhang, Xiaoliang Dai, Peizhao Zhang, Matthew Yu, Jialiang Wang, Yingyan Lin, Peter Vajda

Figure 1 for FBNetV5: Neural Architecture Search for Multiple Tasks in One Run
Figure 2 for FBNetV5: Neural Architecture Search for Multiple Tasks in One Run
Figure 3 for FBNetV5: Neural Architecture Search for Multiple Tasks in One Run
Figure 4 for FBNetV5: Neural Architecture Search for Multiple Tasks in One Run

Neural Architecture Search (NAS) has been widely adopted to design accurate and efficient image classification models. However, applying NAS to a new computer vision task still requires a huge amount of effort. This is because 1) previous NAS research has been over-prioritized on image classification while largely ignoring other tasks; 2) many NAS works focus on optimizing task-specific components that cannot be favorably transferred to other tasks; and 3) existing NAS methods are typically designed to be "proxyless" and require significant effort to be integrated with each new task's training pipelines. To tackle these challenges, we propose FBNetV5, a NAS framework that can search for neural architectures for a variety of vision tasks with much reduced computational cost and human effort. Specifically, we design 1) a search space that is simple yet inclusive and transferable; 2) a multitask search process that is disentangled with target tasks' training pipeline; and 3) an algorithm to simultaneously search for architectures for multiple tasks with a computational cost agnostic to the number of tasks. We evaluate the proposed FBNetV5 targeting three fundamental vision tasks -- image classification, object detection, and semantic segmentation. Models searched by FBNetV5 in a single run of search have outperformed the previous stateof-the-art in all the three tasks: image classification (e.g., +1.3% ImageNet top-1 accuracy under the same FLOPs as compared to FBNetV3), semantic segmentation (e.g., +1.8% higher ADE20K val. mIoU than SegFormer with 3.6x fewer FLOPs), and object detection (e.g., +1.1% COCO val. mAP with 1.2x fewer FLOPs as compared to YOLOX).

Viaarxiv icon

Level Set Binocular Stereo with Occlusions

Sep 08, 2021
Jialiang Wang, Todd Zickler

Figure 1 for Level Set Binocular Stereo with Occlusions
Figure 2 for Level Set Binocular Stereo with Occlusions
Figure 3 for Level Set Binocular Stereo with Occlusions
Figure 4 for Level Set Binocular Stereo with Occlusions

Localizing stereo boundaries and predicting nearby disparities are difficult because stereo boundaries induce occluded regions where matching cues are absent. Most modern computer vision algorithms treat occlusions secondarily (e.g., via left-right consistency checks after matching) or rely on high-level cues to improve nearby disparities (e.g., via deep networks and large training sets). They ignore the geometry of stereo occlusions, which dictates that the spatial extent of occlusion must equal the amplitude of the disparity jump that causes it. This paper introduces an energy and level-set optimizer that improves boundaries by encoding occlusion geometry. Our model applies to two-layer, figure-ground scenes, and it can be implemented cooperatively using messages that pass predominantly between parents and children in an undecimated hierarchy of multi-scale image patches. In a small collection of figure-ground scenes curated from Middlebury and Falling Things stereo datasets, our model provides more accurate boundaries than previous occlusion-handling stereo techniques. This suggests new directions for creating cooperative stereo systems that incorporate occlusion cues in a human-like manner.

* extended journal version of arXiv:2006.16094 
Viaarxiv icon