Alert button
Picture for Zhu Liu

Zhu Liu

Alert button

PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image

Dec 23, 2022
Weichao Shen, Yuan Dong, Zonghao Chen, Zhengyi Zhao, Yang Gao, Zhu Liu

Figure 1 for PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image
Figure 2 for PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image
Figure 3 for PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image
Figure 4 for PanoViT: Vision Transformer for Room Layout Estimation from a Single Panoramic Image

In this paper, we propose PanoViT, a panorama vision transformer to estimate the room layout from a single panoramic image. Compared to CNN models, our PanoViT is more proficient in learning global information from the panoramic image for the estimation of complex room layouts. Considering the difference between a perspective image and an equirectangular image, we design a novel recurrent position embedding and a patch sampling method for the processing of panoramic images. In addition to extracting global information, PanoViT also includes a frequency-domain edge enhancement module and a 3D loss to extract local geometric features in a panoramic image. Experimental results on several datasets demonstrate that our method outperforms state-of-the-art solutions in room layout prediction accuracy.

Viaarxiv icon

Breaking Free from Fusion Rule: A Fully Semantic-driven Infrared and Visible Image Fusion

Nov 22, 2022
Yuhui Wu, Zhu Liu, Jinyuan Liu, Xin Fan, Risheng Liu

Figure 1 for Breaking Free from Fusion Rule: A Fully Semantic-driven Infrared and Visible Image Fusion
Figure 2 for Breaking Free from Fusion Rule: A Fully Semantic-driven Infrared and Visible Image Fusion
Figure 3 for Breaking Free from Fusion Rule: A Fully Semantic-driven Infrared and Visible Image Fusion
Figure 4 for Breaking Free from Fusion Rule: A Fully Semantic-driven Infrared and Visible Image Fusion

Infrared and visible image fusion plays a vital role in the field of computer vision. Previous approaches make efforts to design various fusion rules in the loss functions. However, these experimental designed fusion rules make the methods more and more complex. Besides, most of them only focus on boosting the visual effects, thus showing unsatisfactory performance for the follow-up high-level vision tasks. To address these challenges, in this letter, we develop a semantic-level fusion network to sufficiently utilize the semantic guidance, emancipating the experimental designed fusion rules. In addition, to achieve a better semantic understanding of the feature fusion process, a fusion block based on the transformer is presented in a multi-scale manner. Moreover, we devise a regularization loss function, together with a training strategy, to fully use semantic guidance from the high-level vision tasks. Compared with state-of-the-art methods, our method does not depend on the hand-crafted fusion loss function. Still, it achieves superior performance on visual quality along with the follow-up high-level vision tasks.

Viaarxiv icon

Semantic-Aware Pretraining for Dense Video Captioning

Apr 13, 2022
Teng Wang, Zhu Liu, Feng Zheng, Zhichao Lu, Ran Cheng, Ping Luo

Figure 1 for Semantic-Aware Pretraining for Dense Video Captioning
Figure 2 for Semantic-Aware Pretraining for Dense Video Captioning
Figure 3 for Semantic-Aware Pretraining for Dense Video Captioning

This report describes the details of our approach for the event dense-captioning task in ActivityNet Challenge 2021. We present a semantic-aware pretraining method for dense video captioning, which empowers the learned features to recognize high-level semantic concepts. Diverse video features of different modalities are fed into an event captioning module to generate accurate and meaningful sentences. Our final ensemble model achieves a 10.00 METEOR score on the test set.

* The 2nd place solution to ActivityNet Event Dense-Captioning Challenge 2021 
Viaarxiv icon

ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation

Dec 06, 2021
Isabella Liu, Edward Yang, Jianyu Tao, Rui Chen, Xiaoshuai Zhang, Qing Ran, Zhu Liu, Hao Su

Figure 1 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation
Figure 2 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation
Figure 3 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation
Figure 4 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation

Traditional depth sensors generate accurate real world depth estimates that surpass even the most advanced learning approaches trained only on simulation domains. Since ground truth depth is readily available in the simulation domain but quite difficult to obtain in the real domain, we propose a method that leverages the best of both worlds. In this paper we present a new framework, ActiveZero, which is a mixed domain learning solution for active stereovision systems that requires no real world depth annotation. First, we demonstrate the transferability of our method to out-of-distribution real data by using a mixed domain learning strategy. In the simulation domain, we use a combination of supervised disparity loss and self-supervised losses on a shape primitives dataset. By contrast, in the real domain, we only use self-supervised losses on a dataset that is out-of-distribution from either training simulation data or test real data. Second, our method introduces a novel self-supervised loss called temporal IR reprojection to increase the robustness and accuracy of our reprojections in hard-to-perceive regions. Finally, we show how the method can be trained end-to-end and that each module is important for attaining the end result. Extensive qualitative and quantitative evaluations on real data demonstrate state of the art results that can even beat a commercial depth sensor.

Viaarxiv icon

Triple-level Model Inferred Collaborative Network Architecture for Video Deraining

Nov 08, 2021
Pan Mu, Zhu Liu, Yaohua Liu, Risheng Liu, Xin Fan

Figure 1 for Triple-level Model Inferred Collaborative Network Architecture for Video Deraining
Figure 2 for Triple-level Model Inferred Collaborative Network Architecture for Video Deraining
Figure 3 for Triple-level Model Inferred Collaborative Network Architecture for Video Deraining
Figure 4 for Triple-level Model Inferred Collaborative Network Architecture for Video Deraining

Video deraining is an important issue for outdoor vision systems and has been investigated extensively. However, designing optimal architectures by the aggregating model formation and data distribution is a challenging task for video deraining. In this paper, we develop a model-guided triple-level optimization framework to deduce network architecture with cooperating optimization and auto-searching mechanism, named Triple-level Model Inferred Cooperating Searching (TMICS), for dealing with various video rain circumstances. In particular, to mitigate the problem that existing methods cannot cover various rain streaks distribution, we first design a hyper-parameter optimization model about task variable and hyper-parameter. Based on the proposed optimization model, we design a collaborative structure for video deraining. This structure includes Dominant Network Architecture (DNA) and Companionate Network Architecture (CNA) that is cooperated by introducing an Attention-based Averaging Scheme (AAS). To better explore inter-frame information from videos, we introduce a macroscopic structure searching scheme that searches from Optical Flow Module (OFM) and Temporal Grouping Module (TGM) to help restore latent frame. In addition, we apply the differentiable neural architecture searching from a compact candidate set of task-specific operations to discover desirable rain streaks removal architectures automatically. Extensive experiments on various datasets demonstrate that our model shows significant improvements in fidelity and temporal consistency over the state-of-the-art works. Source code is available at https://github.com/vis-opt-group/TMICS.

* Accepted at IEEE Transactions on Image Processing 
Viaarxiv icon

Learning Optimization-inspired Image Propagation with Control Mechanisms and Architecture Augmentations for Low-level Vision

Dec 10, 2020
Risheng Liu, Zhu Liu, Pan Mu, Zhouchen Lin, Xin Fan, Zhongxuan Luo

Figure 1 for Learning Optimization-inspired Image Propagation with Control Mechanisms and Architecture Augmentations for Low-level Vision
Figure 2 for Learning Optimization-inspired Image Propagation with Control Mechanisms and Architecture Augmentations for Low-level Vision
Figure 3 for Learning Optimization-inspired Image Propagation with Control Mechanisms and Architecture Augmentations for Low-level Vision
Figure 4 for Learning Optimization-inspired Image Propagation with Control Mechanisms and Architecture Augmentations for Low-level Vision

In recent years, building deep learning models from optimization perspectives has becoming a promising direction for solving low-level vision problems. The main idea of most existing approaches is to straightforwardly combine numerical iterations with manually designed network architectures to generate image propagations for specific kinds of optimization models. However, these heuristic learning models often lack mechanisms to control the propagation and rely on architecture engineering heavily. To mitigate the above issues, this paper proposes a unified optimization-inspired deep image propagation framework to aggregate Generative, Discriminative and Corrective (GDC for short) principles for a variety of low-level vision tasks. Specifically, we first formulate low-level vision tasks using a generic optimization objective and construct our fundamental propagative modules from three different viewpoints, i.e., the solution could be obtained/learned 1) in generative manner; 2) based on discriminative metric, and 3) with domain knowledge correction. By designing control mechanisms to guide image propagations, we then obtain convergence guarantees of GDC for both fully- and partially-defined optimization formulations. Furthermore, we introduce two architecture augmentation strategies (i.e., normalization and automatic search) to respectively enhance the propagation stability and task/data-adaption ability. Extensive experiments on different low-level vision applications demonstrate the effectiveness and flexibility of GDC.

* 15 pages 
Viaarxiv icon

Estimates of daily ground-level NO2 concentrations in China based on big data and machine learning approaches

Nov 18, 2020
Xinyu Dou, Cuijuan Liao, Hengqi Wang, Ying Huang, Ying Tu, Xiaomeng Huang, Yiran Peng, Biqing Zhu, Jianguang Tan, Zhu Deng, Nana Wu, Taochun Sun, Piyu Ke, Zhu Liu

Figure 1 for Estimates of daily ground-level NO2 concentrations in China based on big data and machine learning approaches
Figure 2 for Estimates of daily ground-level NO2 concentrations in China based on big data and machine learning approaches
Figure 3 for Estimates of daily ground-level NO2 concentrations in China based on big data and machine learning approaches
Figure 4 for Estimates of daily ground-level NO2 concentrations in China based on big data and machine learning approaches

Nitrogen dioxide (NO2) is one of the most important atmospheric pollutants. However, current ground-level NO2 concentration data are lack of either high-resolution coverage or full coverage national wide, due to the poor quality of source data and the computing power of the models. To our knowledge, this study is the first to estimate the ground-level NO2 concentration in China with national coverage as well as relatively high spatiotemporal resolution (0.25 degree; daily intervals) over the newest past 6 years (2013-2018). We advanced a Random Forest model integrated K-means (RF-K) for the estimates with multi-source parameters. Besides meteorological parameters, satellite retrievals parameters, we also, for the first time, introduce socio-economic parameters to assess the impact by human activities. The results show that: (1) the RF-K model we developed shows better prediction performance than other models, with cross-validation R2 = 0.64 (MAPE = 34.78%). (2) The annual average concentration of NO2 in China showed a weak increasing trend . While in the economic zones such as Beijing-Tianjin-Hebei region, Yangtze River Delta, and Pearl River Delta, the NO2 concentration there even decreased or remained unchanged, especially in spring. Our dataset has verified that pollutant controlling targets have been achieved in these areas. With mapping daily nationwide ground-level NO2 concentrations, this study provides timely data with high quality for air quality management for China. We provide a universal model framework to quickly generate a timely national atmospheric pollutants concentration map with a high spatial-temporal resolution, based on improved machine learning methods.

Viaarxiv icon

Automatic Question-Answering Using A Deep Similarity Neural Network

Aug 05, 2017
Shervin Minaee, Zhu Liu

Figure 1 for Automatic Question-Answering Using A Deep Similarity Neural Network
Figure 2 for Automatic Question-Answering Using A Deep Similarity Neural Network
Figure 3 for Automatic Question-Answering Using A Deep Similarity Neural Network
Figure 4 for Automatic Question-Answering Using A Deep Similarity Neural Network

Automatic question-answering is a classical problem in natural language processing, which aims at designing systems that can automatically answer a question, in the same way as human does. In this work, we propose a deep learning based model for automatic question-answering. First the questions and answers are embedded using neural probabilistic modeling. Then a deep similarity neural network is trained to find the similarity score of a pair of answer and question. Then for each question, the best answer is found as the one with the highest similarity score. We first train this model on a large-scale public question-answering database, and then fine-tune it to transfer to the customer-care chat data. We have also tested our framework on a public question-answering database and achieved very good performance.

Viaarxiv icon

Deep Hashing: A Joint Approach for Image Signature Learning

Aug 12, 2016
Yadong Mu, Zhu Liu

Figure 1 for Deep Hashing: A Joint Approach for Image Signature Learning
Figure 2 for Deep Hashing: A Joint Approach for Image Signature Learning
Figure 3 for Deep Hashing: A Joint Approach for Image Signature Learning
Figure 4 for Deep Hashing: A Joint Approach for Image Signature Learning

Similarity-based image hashing represents crucial technique for visual data storage reduction and expedited image search. Conventional hashing schemes typically feed hand-crafted features into hash functions, which separates the procedures of feature extraction and hash function learning. In this paper, we propose a novel algorithm that concurrently performs feature engineering and non-linear supervised hashing function learning. Our technical contributions in this paper are two-folds: 1) deep network optimization is often achieved by gradient propagation, which critically requires a smooth objective function. The discrete nature of hash codes makes them not amenable for gradient-based optimization. To address this issue, we propose an exponentiated hashing loss function and its bilinear smooth approximation. Effective gradient calculation and propagation are thereby enabled; 2) pre-training is an important trick in supervised deep learning. The impact of pre-training on the hash code quality has never been discussed in current deep hashing literature. We propose a pre-training scheme inspired by recent advance in deep network based image classification, and experimentally demonstrate its effectiveness. Comprehensive quantitative evaluations are conducted on several widely-used image benchmarks. On all benchmarks, our proposed deep hashing algorithm outperforms all state-of-the-art competitors by significant margins. In particular, our algorithm achieves a near-perfect 0.99 in terms of Hamming ranking accuracy with only 12 bits on MNIST, and a new record of 0.74 on the CIFAR10 dataset. In comparison, the best accuracies obtained on CIFAR10 by existing hashing algorithms without or with deep networks are known to be 0.36 and 0.58 respectively.

* 10 pages 
Viaarxiv icon