Alert button
Picture for Yuan Liu

Yuan Liu

Alert button

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Sep 07, 2023
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, Wenping Wang

Figure 1 for SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Figure 2 for SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Figure 3 for SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Figure 4 for SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.

* Project page: https://liuyuan-pal.github.io/SyncDreamer/ 
Viaarxiv icon

Local Structure-aware Graph Contrastive Representation Learning

Aug 07, 2023
Kai Yang, Yuan Liu, Zijuan Zhao, Peijin Ding, Wenqian Zhao

Traditional Graph Neural Network (GNN), as a graph representation learning method, is constrained by label information. However, Graph Contrastive Learning (GCL) methods, which tackle the label problem effectively, mainly focus on the feature information of the global graph or small subgraph structure (e.g., the first-order neighborhood). In the paper, we propose a Local Structure-aware Graph Contrastive representation Learning method (LS-GCL) to model the structural information of nodes from multiple views. Specifically, we construct the semantic subgraphs that are not limited to the first-order neighbors. For the local view, the semantic subgraph of each target node is input into a shared GNN encoder to obtain the target node embeddings at the subgraph-level. Then, we use a pooling function to generate the subgraph-level graph embeddings. For the global view, considering the original graph preserves indispensable semantic information of nodes, we leverage the shared GNN encoder to learn the target node embeddings at the global graph-level. The proposed LS-GCL model is optimized to maximize the common information among similar instances at three various perspectives through a multi-level contrastive loss function. Experimental results on five datasets illustrate that our method outperforms state-of-the-art graph representation learning approaches for both node classification and link prediction tasks.

Viaarxiv icon

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Aug 01, 2023
Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin

Figure 1 for Improving Pixel-based MIM by Reducing Wasted Modeling Capability
Figure 2 for Improving Pixel-based MIM by Reducing Wasted Modeling Capability
Figure 3 for Improving Pixel-based MIM by Reducing Wasted Modeling Capability
Figure 4 for Improving Pixel-based MIM by Reducing Wasted Modeling Capability

There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2\% on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation. Code and models are available at https://github.com/open-mmlab/mmpretrain.

* Accepted by ICCV2023 
Viaarxiv icon

MMBench: Is Your Multi-modal Model an All-around Player?

Jul 26, 2023
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin

Figure 1 for MMBench: Is Your Multi-modal Model an All-around Player?
Figure 2 for MMBench: Is Your Multi-modal Model an All-around Player?
Figure 3 for MMBench: Is Your Multi-modal Model an All-around Player?
Figure 4 for MMBench: Is Your Multi-modal Model an All-around Player?

Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.

Viaarxiv icon

A Blender-based channel simulator for FMCW Radar

Jul 18, 2023
Yuan Liu, Moein Ahmadi, Johann Fuchs, Mohammad Alaee-Kerahroodi, M. R. Bhavani Shankar

Figure 1 for A Blender-based channel simulator for FMCW Radar
Figure 2 for A Blender-based channel simulator for FMCW Radar
Figure 3 for A Blender-based channel simulator for FMCW Radar

Radar simulation is a promising way to provide data-cube with effectiveness and accuracy for AI-based approaches to radar applications. This paper develops a channel simulator to generate frequency-modulated continuous-wave (FMCW) waveform multiple inputs multiple outputs (MIMO) radar signals. In the proposed simulation framework, an open-source animation tool called Blender is utilized to model the scenarios and render animations. The ray tracing (RT) engine embedded can trace the radar propagation paths, i.e., the distance and signal strength of each path. The beat signal models of time division multiplexing (TDM)-MIMO are adapted to RT outputs. Finally, the environment-based models are simulated to show the validation.

* Presented in ISCS23 
Viaarxiv icon

Evaluating AI systems under uncertain ground truth: a case study in dermatology

Jul 05, 2023
David Stutz, Ali Taylan Cemgil, Abhijit Guha Roy, Tatiana Matejovicova, Melih Barsbey, Patricia Strachan, Mike Schaekermann, Jan Freyberg, Rajeev Rikhye, Beverly Freeman, Javier Perez Matos, Umesh Telang, Dale R. Webster, Yuan Liu, Greg S. Corrado, Yossi Matias, Pushmeet Kohli, Yun Liu, Arnaud Doucet, Alan Karthikesalingam

Figure 1 for Evaluating AI systems under uncertain ground truth: a case study in dermatology
Figure 2 for Evaluating AI systems under uncertain ground truth: a case study in dermatology
Figure 3 for Evaluating AI systems under uncertain ground truth: a case study in dermatology
Figure 4 for Evaluating AI systems under uncertain ground truth: a case study in dermatology

For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.

Viaarxiv icon

NeRO: Neural Geometry and BRDF Reconstruction of Reflective Objects from Multiview Images

May 27, 2023
Yuan Liu, Peng Wang, Cheng Lin, Xiaoxiao Long, Jiepeng Wang, Lingjie Liu, Taku Komura, Wenping Wang

Figure 1 for NeRO: Neural Geometry and BRDF Reconstruction of Reflective Objects from Multiview Images
Figure 2 for NeRO: Neural Geometry and BRDF Reconstruction of Reflective Objects from Multiview Images
Figure 3 for NeRO: Neural Geometry and BRDF Reconstruction of Reflective Objects from Multiview Images
Figure 4 for NeRO: Neural Geometry and BRDF Reconstruction of Reflective Objects from Multiview Images

We present a neural rendering-based method called NeRO for reconstructing the geometry and the BRDF of reflective objects from multiview images captured in an unknown environment. Multiview reconstruction of reflective objects is extremely challenging because specular reflections are view-dependent and thus violate the multiview consistency, which is the cornerstone for most multiview reconstruction methods. Recent neural rendering techniques can model the interaction between environment lights and the object surfaces to fit the view-dependent reflections, thus making it possible to reconstruct reflective objects from multiview images. However, accurately modeling environment lights in the neural rendering is intractable, especially when the geometry is unknown. Most existing neural rendering methods, which can model environment lights, only consider direct lights and rely on object masks to reconstruct objects with weak specular reflections. Therefore, these methods fail to reconstruct reflective objects, especially when the object mask is not available and the object is illuminated by indirect lights. We propose a two-step approach to tackle this problem. First, by applying the split-sum approximation and the integrated directional encoding to approximate the shading effects of both direct and indirect lights, we are able to accurately reconstruct the geometry of reflective objects without any object masks. Then, with the object geometry fixed, we use more accurate sampling to recover the environment lights and the BRDF of the object. Extensive experiments demonstrate that our method is capable of accurately reconstructing the geometry and the BRDF of reflective objects from only posed RGB images without knowing the environment lights and the object masks. Codes and datasets are available at https://github.com/liuyuan-pal/NeRO.

* Accepted to SIGGRAPH 2023. Project page: https://liuyuan-pal.github.io/NeRO/ Codes: https://github.com/liuyuan-pal/NeRO 
Viaarxiv icon

Hyperspectral Image Super-Resolution via Dual-domain Network Based on Hybrid Convolution

Apr 20, 2023
Tingting Liu, Yuan Liu, Chuncheng Zhang, Yuan Liyin, Xiubao Sui, Qian Chen

Figure 1 for Hyperspectral Image Super-Resolution via Dual-domain Network Based on Hybrid Convolution
Figure 2 for Hyperspectral Image Super-Resolution via Dual-domain Network Based on Hybrid Convolution
Figure 3 for Hyperspectral Image Super-Resolution via Dual-domain Network Based on Hybrid Convolution
Figure 4 for Hyperspectral Image Super-Resolution via Dual-domain Network Based on Hybrid Convolution

Since the number of incident energies is limited, it is difficult to directly acquire hyperspectral images (HSI) with high spatial resolution. Considering the high dimensionality and correlation of HSI, super-resolution (SR) of HSI remains a challenge in the absence of auxiliary high-resolution images. Furthermore, it is very important to extract the spatial features effectively and make full use of the spectral information. This paper proposes a novel HSI super-resolution algorithm, termed dual-domain network based on hybrid convolution (SRDNet). Specifically, a dual-domain network is designed to fully exploit the spatial-spectral and frequency information among the hyper-spectral data. To capture inter-spectral self-similarity, a self-attention learning mechanism (HSL) is devised in the spatial domain. Meanwhile the pyramid structure is applied to increase the acceptance field of attention, which further reinforces the feature representation ability of the network. Moreover, to further improve the perceptual quality of HSI, a frequency loss(HFL) is introduced to optimize the model in the frequency domain. The dynamic weighting mechanism drives the network to gradually refine the generated frequency and excessive smoothing caused by spatial loss. Finally, In order to better fully obtain the mapping relationship between high-resolution space and low-resolution space, a hybrid module of 2D and 3D units with progressive upsampling strategy is utilized in our method. Experiments on a widely used benchmark dataset illustrate that the proposed SRDNet method enhances the texture information of HSI and is superior to state-of-the-art methods.

Viaarxiv icon