Alert button
Picture for Dayan Wu

Dayan Wu

Alert button

Digging into Depth Priors for Outdoor Neural Radiance Fields

Aug 08, 2023
Chen Wang, Jiadai Sun, Lina Liu, Chenming Wu, Zhelun Shen, Dayan Wu, Yuchao Dai, Liangjun Zhang

Figure 1 for Digging into Depth Priors for Outdoor Neural Radiance Fields
Figure 2 for Digging into Depth Priors for Outdoor Neural Radiance Fields
Figure 3 for Digging into Depth Priors for Outdoor Neural Radiance Fields
Figure 4 for Digging into Depth Priors for Outdoor Neural Radiance Fields

Neural Radiance Fields (NeRF) have demonstrated impressive performance in vision and graphics tasks, such as novel view synthesis and immersive reality. However, the shape-radiance ambiguity of radiance fields remains a challenge, especially in the sparse viewpoints setting. Recent work resorts to integrating depth priors into outdoor NeRF training to alleviate the issue. However, the criteria for selecting depth priors and the relative merits of different priors have not been thoroughly investigated. Moreover, the relative merits of selecting different approaches to use the depth priors is also an unexplored problem. In this paper, we provide a comprehensive study and evaluation of employing depth priors to outdoor neural radiance fields, covering common depth sensing technologies and most application ways. Specifically, we conduct extensive experiments with two representative NeRF methods equipped with four commonly-used depth priors and different depth usages on two widely used outdoor datasets. Our experimental results reveal several interesting findings that can potentially benefit practitioners and researchers in training their NeRF models with depth priors. Project Page: https://cwchenwang.github.io/outdoor-nerf-depth

* Accepted to ACM MM 2023. Project Page: https://cwchenwang.github.io/outdoor-nerf-depth 
Viaarxiv icon

Joint Plasticity Learning for Camera Incremental Person Re-Identification

Oct 18, 2022
Zexian Yang, Dayan Wu, Bo Li, Weiping Wang

Figure 1 for Joint Plasticity Learning for Camera Incremental Person Re-Identification
Figure 2 for Joint Plasticity Learning for Camera Incremental Person Re-Identification
Figure 3 for Joint Plasticity Learning for Camera Incremental Person Re-Identification
Figure 4 for Joint Plasticity Learning for Camera Incremental Person Re-Identification

Recently, incremental learning for person re-identification receives increasing attention, which is considered a more practical setting in real-world applications. However, the existing works make the strong assumption that the cameras are fixed and the new-emerging data is class-disjoint from previous classes. In this paper, we focus on a new and more practical task, namely Camera Incremental person ReID (CIP-ReID). CIP-ReID requires ReID models to continuously learn informative representations without forgetting the previously learned ones only through the data from newly installed cameras. This is challenging as the new data only have local supervision in new cameras with no access to the old data due to privacy issues, and they may also contain persons seen by previous cameras. To address this problem, we propose a non-exemplar-based framework, named JPL-ReID. JPL-ReID first adopts a one-vs-all detector to discover persons who have been presented in previous cameras. To maintain learned representations, JPL-ReID utilizes a similarity distillation strategy with no previous training data available. Simultaneously, JPL-ReID is capable of learning new knowledge to improve the generalization ability using a Joint Plasticity Learning objective. The comprehensive experimental results on two datasets demonstrate that our proposed method significantly outperforms the comparative methods and can achieve state-of-the-art results with remarkable advantages.

Viaarxiv icon

Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification

Dec 15, 2021
Xiaohua Chen, Yucan Zhou, Dayan Wu, Wanqian Zhang, Yu Zhou, Bo Li, Weiping Wang

Figure 1 for Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification
Figure 2 for Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification
Figure 3 for Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification
Figure 4 for Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification

Real-world data often follows a long-tailed distribution, which makes the performance of existing classification algorithms degrade heavily. A key issue is that samples in tail categories fail to depict their intra-class diversity. Humans can imagine a sample in new poses, scenes, and view angles with their prior knowledge even if it is the first time to see this category. Inspired by this, we propose a novel reasoning-based implicit semantic data augmentation method to borrow transformation directions from other classes. Since the covariance matrix of each category represents the feature transformation directions, we can sample new directions from similar categories to generate definitely different instances. Specifically, the long-tailed distributed data is first adopted to train a backbone and a classifier. Then, a covariance matrix for each category is estimated, and a knowledge graph is constructed to store the relations of any two categories. Finally, tail samples are adaptively enhanced via propagating information from all the similar categories in the knowledge graph. Experimental results on CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 have demonstrated the effectiveness of our proposed method compared with the state-of-the-art methods.

* 9 pages, 5 figures 
Viaarxiv icon

Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

Sep 08, 2021
Xugong Qin, Yu Zhou, Youhui Guo, Dayan Wu, Zhihong Tian, Ning Jiang, Hongbin Wang, Weiping Wang

Figure 1 for Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection
Figure 2 for Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection
Figure 3 for Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection
Figure 4 for Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

Due to the large success in object detection and instance segmentation, Mask R-CNN attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting. However, two issues remain to be settled. The first is dense text case, which is easy to be neglected but quite practical. There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance. In this work, we argue that the performance degradation results from the learning confusion issue in the mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in the mask head, which alleviates the issue and promotes robustness significantly. And we propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text. With instance-aware mask learning, the mask branch can learn separated and compact masks. The second is that due to large variations in scale and aspect ratio, RPN needs complicated anchor settings, making it hard to maintain and transfer across different datasets. To settle this issue, we propose an adaptive label assignment in which all instances especially those with extreme aspect ratios are guaranteed to be associated with enough anchors. Equipped with these components, the proposed method named MAYOR achieves state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015, CTW1500, and Total-Text.

* Accepted by ACM MM 2021 
Viaarxiv icon

Rescuing Deep Hashing from Dead Bits Problem

Feb 01, 2021
Shu Zhao, Dayan Wu, Yucan Zhou, Bo Li, Weiping Wang

Figure 1 for Rescuing Deep Hashing from Dead Bits Problem
Figure 2 for Rescuing Deep Hashing from Dead Bits Problem
Figure 3 for Rescuing Deep Hashing from Dead Bits Problem
Figure 4 for Rescuing Deep Hashing from Dead Bits Problem

Deep hashing methods have shown great retrieval accuracy and efficiency in large-scale image retrieval. How to optimize discrete hash bits is always the focus in deep hashing methods. A common strategy in these methods is to adopt an activation function, e.g. $\operatorname{sigmoid}(\cdot)$ or $\operatorname{tanh}(\cdot)$, and minimize a quantization loss to approximate discrete values. However, this paradigm may make more and more hash bits stuck into the wrong saturated area of the activation functions and never escaped. We call this problem "Dead Bits Problem~(DBP)". Besides, the existing quantization loss will aggravate DBP as well. In this paper, we propose a simple but effective gradient amplifier which acts before activation functions to alleviate DBP. Moreover, we devise an error-aware quantization loss to further alleviate DBP. It avoids the negative effect of quantization loss based on the similarity between two images. The proposed gradient amplifier and error-aware quantization loss are compatible with a variety of deep hashing methods. Experimental results on three datasets demonstrate the efficiency of the proposed gradient amplifier and the error-aware quantization loss.

Viaarxiv icon

Exploring Relations in Untrimmed Videos for Self-Supervised Learning

Aug 06, 2020
Dezhao Luo, Bo Fang, Yu Zhou, Yucan Zhou, Dayan Wu, Weiping Wang

Figure 1 for Exploring Relations in Untrimmed Videos for Self-Supervised Learning
Figure 2 for Exploring Relations in Untrimmed Videos for Self-Supervised Learning
Figure 3 for Exploring Relations in Untrimmed Videos for Self-Supervised Learning
Figure 4 for Exploring Relations in Untrimmed Videos for Self-Supervised Learning

Existing video self-supervised learning methods mainly rely on trimmed videos for model training. However, trimmed datasets are manually annotated from untrimmed videos. In this sense, these methods are not really self-supervised. In this paper, we propose a novel self-supervised method, referred to as Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos (real unlabeled) to learn spatio-temporal features. ERUV first generates single-shot videos by shot change detection. Then a designed sampling strategy is used to model relations for video clips. The strategy is saved as our self-supervision signals. Finally, the network learns representations by predicting the category of relations between the video clips. ERUV is able to compare the differences and similarities of videos, which is also an essential procedure for action and video related tasks. We validate our learned models with action recognition and video retrieval tasks with three kinds of 3D CNNs. Experimental results show that ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.

Viaarxiv icon

Two-Level Residual Distillation based Triple Network for Incremental Object Detection

Jul 27, 2020
Dongbao Yang, Yu Zhou, Dayan Wu, Can Ma, Fei Yang, Weiping Wang

Figure 1 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection
Figure 2 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection
Figure 3 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection
Figure 4 for Two-Level Residual Distillation based Triple Network for Incremental Object Detection

Modern object detection methods based on convolutional neural network suffer from severe catastrophic forgetting in learning new classes without original data. Due to time consumption, storage burden and privacy of old data, it is inadvisable to train the model from scratch with both old and new data when new object classes emerge after the model trained. In this paper, we propose a novel incremental object detector based on Faster R-CNN to continuously learn from new object classes without using old data. It is a triple network where an old model and a residual model as assistants for helping the incremental model learning on new classes without forgetting the previous learned knowledge. To better maintain the discrimination of features between old and new classes, the residual model is jointly trained on new classes in the incremental learning procedure. In addition, a corresponding distillation scheme is designed to guide the training process, which consists of a two-level residual distillation loss and a joint classification distillation loss. Extensive experiments on VOC2007 and COCO are conducted, and the results demonstrate that the proposed method can effectively learn to incrementally detect objects of new classes, and the problem of catastrophic forgetting is mitigated in this context.

Viaarxiv icon

FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

Jul 10, 2020
Xugong Qin, Yu Zhou, Dayan Wu, Yinliang Yue, Weiping Wang

Figure 1 for FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection
Figure 2 for FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection
Figure 3 for FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection
Figure 4 for FC2RN: A Fully Convolutional Corner Refinement Network for Accurate Multi-Oriented Scene Text Detection

Recent scene text detection works mainly focus on curve text detection. However, in real applications, the curve texts are more scarce than the multi-oriented ones. Accurate detection of multi-oriented text with large variations of scales, orientations, and aspect ratios is of great significance. Among the multi-oriented detection methods, direct regression for the geometry of scene text shares a simple yet powerful pipeline and gets popular in academic and industrial communities, but it may produce imperfect detections, especially for long texts due to the limitation of the receptive field. In this work, we aim to improve this while keeping the pipeline simple. A fully convolutional corner refinement network (FC2RN) is proposed for accurate multi-oriented text detection, in which an initial corner prediction and a refined corner prediction are obtained at one pass. With a novel quadrilateral RoI convolution operation tailed for multi-oriented scene text, the initial quadrilateral prediction is encoded into the feature maps which can be further used to predict offset between the initial prediction and the ground-truth as well as output a refined confidence score. Experimental results on four public datasets including MSRA-TD500, ICDAR2017-RCTW, ICDAR2015, and COCO-Text demonstrate that FC2RN can outperform the state-of-the-art methods. The ablation study shows the effectiveness of corner refinement and scoring for accurate text localization.

Viaarxiv icon