Alert button
Picture for Lu Fang

Lu Fang

Alert button

OccuSeg: Occupancy-aware 3D Instance Segmentation

Mar 14, 2020
Lei Han, Tian Zheng, Lan Xu, Lu Fang

Figure 1 for OccuSeg: Occupancy-aware 3D Instance Segmentation
Figure 2 for OccuSeg: Occupancy-aware 3D Instance Segmentation
Figure 3 for OccuSeg: Occupancy-aware 3D Instance Segmentation
Figure 4 for OccuSeg: Occupancy-aware 3D Instance Segmentation

3D instance segmentation, with a variety of applications in robotics and augmented reality, is in large demands these days. Unlike 2D images that are projective observations of the environment, 3D models provide metric reconstruction of the scenes without occlusion or scale ambiguity. In this paper, we define "3D occupancy size", as the number of voxels occupied by each instance. It owns advantages of robustness in prediction, on which basis, OccuSeg, an occupancy-aware 3D instance segmentation scheme is proposed. Our multi-task learning produces both occupancy signal and embedding representations, where the training of spatial and feature embeddings varies with their difference in scale-aware. Our clustering scheme benefits from the reliable comparison between the predicted occupancy size and the clustered occupancy size, which encourages hard samples being correctly clustered and avoids over segmentation. The proposed approach achieves state-of-the-art performance on 3 real-world datasets, i.e. ScanNetV2, S3DIS and SceneNN, while maintaining high efficiency.

* CVPR 2020, video this https URL https://youtu.be/co7y6LQ7Kqc 
Viaarxiv icon

PANDA: A Gigapixel-level Human-centric Video Dataset

Mar 10, 2020
Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David J Brady, Qionghai Dai, Lu Fang

Figure 1 for PANDA: A Gigapixel-level Human-centric Video Dataset
Figure 2 for PANDA: A Gigapixel-level Human-centric Video Dataset
Figure 3 for PANDA: A Gigapixel-level Human-centric Video Dataset
Figure 4 for PANDA: A Gigapixel-level Human-centric Video Dataset

We present PANDA, the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The videos in PANDA were captured by a gigapixel camera and cover real-world scenes with both wide field-of-view (~1 square kilometer area) and high-resolution details (~gigapixel-level/frame). The scenes may contain 4k head counts with over 100x scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions. We benchmark the human detection and tracking tasks. Due to the vast variance of pedestrian pose, scale, occlusion and trajectory, existing approaches are challenged by both accuracy and efficiency. Given the uniqueness of PANDA with both wide FoV and high resolution, a new task of interaction-aware group detection is introduced. We design a 'global-to-local zoom-in' framework, where global trajectories and local interactions are simultaneously encoded, yielding promising results. We believe PANDA will contribute to the community of artificial intelligence and praxeology by understanding human behaviors and interactions in large-scale real-world scenes. PANDA Website: http://www.panda-dataset.com.

* Accepted by IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020 
Viaarxiv icon

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Dec 03, 2019
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala

Figure 1 for PyTorch: An Imperative Style, High-Performance Deep Learning Library
Figure 2 for PyTorch: An Imperative Style, High-Performance Deep Learning Library
Figure 3 for PyTorch: An Imperative Style, High-Performance Deep Learning Library

Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.

* 12 pages, 3 figures, NeurIPS 2019 
Viaarxiv icon

EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera

Aug 30, 2019
Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, Lu Fang, Christian Theobalt

Figure 1 for EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera
Figure 2 for EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera
Figure 3 for EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera
Figure 4 for EventCap: Monocular 3D Capture of High-Speed Human Motions using an Event Camera

The high frame rate is a critical requirement for capturing fast human motions. In this setting, existing markerless image-based methods are constrained by the lighting requirement, the high data bandwidth and the consequent high computation overhead. In this paper, we propose EventCap --- the first approach for 3D capturing of high-speed human motions using a single event camera. Our method combines model-based optimization and CNN-based human pose detection to capture high-frequency motion details and to reduce the drifting in the tracking. As a result, we can capture fast motions at millisecond resolution with significantly higher data efficiency than using high frame rate videos. Experiments on our new event-based fast human motion dataset demonstrate the effectiveness and accuracy of our method, as well as its robustness to challenging lighting conditions.

* 10 pages, 11 figures, 2 tables 
Viaarxiv icon

LapEPI-Net: A Laplacian Pyramid EPI structure for Learning-based Dense Light Field Reconstruction

Feb 17, 2019
Gaochang Wu, Yebin Liu, Lu Fang, Tianyou Chai

Figure 1 for LapEPI-Net: A Laplacian Pyramid EPI structure for Learning-based Dense Light Field Reconstruction
Figure 2 for LapEPI-Net: A Laplacian Pyramid EPI structure for Learning-based Dense Light Field Reconstruction
Figure 3 for LapEPI-Net: A Laplacian Pyramid EPI structure for Learning-based Dense Light Field Reconstruction
Figure 4 for LapEPI-Net: A Laplacian Pyramid EPI structure for Learning-based Dense Light Field Reconstruction

For dense sampled light field (LF) reconstruction problem, existing approaches focus on a depth-free framework to achieve non-Lambertian performance. However, they trap in the trade-off "either aliasing or blurring" problem, i.e., pre-filtering the aliasing components (caused by the angular sparsity of the input LF) always leads to a blurry result. In this paper, we intend to solve this challenge by introducing an elaborately designed epipolar plane image (EPI) structure within a learning-based framework. Specifically, we start by analytically showing that decreasing the spatial scale of an EPI shows higher efficiency in addressing the aliasing problem than simply adopting pre-filtering. Accordingly, we design a Laplacian Pyramid EPI (LapEPI) structure that contains both low spatial scale EPI (for aliasing) and high-frequency residuals (for blurring) to solve the trade-off problem. We then propose a novel network architecture for the LapEPI structure, termed as LapEPI-net. To ensure the non-Lambertian performance, we adopt a transfer-learning strategy by first pre-training the network with natural images then fine-tuning it with unstructured LFs. Extensive experiments demonstrate the high performance and robustness of the proposed approach for tackling the aliasing-or-blurring problem as well as the non-Lambertian reconstruction.

* 10 pages, 8 figures, 4 tables 
Viaarxiv icon

SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization

Dec 29, 2018
Dan Wang, Mengqi Ji, Yong Wang, Haoqian Wang, Lu Fang

Figure 1 for SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization
Figure 2 for SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization
Figure 3 for SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization
Figure 4 for SPI-Optimizer: an integral-Separated PI Controller for Stochastic Optimization

To overcome the oscillation problem in the classical momentum-based optimizer, recent work associates it with the proportional-integral (PI) controller, and artificially adds D term producing a PID controller. It suppresses oscillation with the sacrifice of introducing extra hyper-parameter. In this paper, we start by analyzing: why momentum-based method oscillates about the optimal point? and answering that: the fluctuation problem relates to the lag effect of integral (I) term. Inspired by the conditional integration idea in classical control society, we propose SPI-Optimizer, an integral-Separated PI controller based optimizer WITHOUT introducing extra hyperparameter. It separates momentum term adaptively when the inconsistency of current and historical gradient direction occurs. Extensive experiments demonstrate that SPIOptimizer generalizes well on popular network architectures to eliminate the oscillation, and owns competitive performance with faster convergence speed (up to 40% epochs reduction ratio ) and more accurate classification result on MNIST, CIFAR10, and CIFAR100 (up to 27.5% error reduction ratio) than the state-of-the-art methods.

Viaarxiv icon

RegNet: Learning the Optimization of Direct Image-to-Image Pose Registration

Dec 26, 2018
Lei Han, Mengqi Ji, Lu Fang, Matthias Nießner

Figure 1 for RegNet: Learning the Optimization of Direct Image-to-Image Pose Registration
Figure 2 for RegNet: Learning the Optimization of Direct Image-to-Image Pose Registration
Figure 3 for RegNet: Learning the Optimization of Direct Image-to-Image Pose Registration
Figure 4 for RegNet: Learning the Optimization of Direct Image-to-Image Pose Registration

Direct image-to-image alignment that relies on the optimization of photometric error metrics suffers from limited convergence range and sensitivity to lighting conditions. Deep learning approaches has been applied to address this problem by learning better feature representations using convolutional neural networks, yet still require a good initialization. In this paper, we demonstrate that the inaccurate numerical Jacobian limits the convergence range which could be improved greatly using learned approaches. Based on this observation, we propose a novel end-to-end network, RegNet, to learn the optimization of image-to-image pose registration. By jointly learning feature representation for each pixel and partial derivatives that replace handcrafted ones (e.g., numerical differentiation) in the optimization step, the neural network facilitates end-to-end optimization. The energy landscape is constrained on both the feature representation and the learned Jacobian, hence providing more flexibility for the optimization as a consequence leads to more robust and faster convergence. In a series of experiments, including a broad ablation study, we demonstrate that RegNet is able to converge for large-baseline image pairs with fewer iterations.

* 8 pages, 6 figures 
Viaarxiv icon

CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping

Jul 27, 2018
Haitian Zheng, Mengqi Ji, Haoqian Wang, Yebin Liu, Lu Fang

Figure 1 for CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping
Figure 2 for CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping
Figure 3 for CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping
Figure 4 for CrossNet: An End-to-end Reference-based Super Resolution Network using Cross-scale Warping

The Reference-based Super-resolution (RefSR) super-resolves a low-resolution (LR) image given an external high-resolution (HR) reference image, where the reference image and LR image share similar viewpoint but with significant resolution gap x8. Existing RefSR methods work in a cascaded way such as patch matching followed by synthesis pipeline with two independently defined objective functions, leading to the inter-patch misalignment, grid effect and inefficient optimization. To resolve these issues, we present CrossNet, an end-to-end and fully-convolutional deep neural network using cross-scale warping. Our network contains image encoders, cross-scale warping layers, and fusion decoder: the encoder serves to extract multi-scale features from both the LR and the reference images; the cross-scale warping layers spatially aligns the reference feature map with the LR feature map; the decoder finally aggregates feature maps from both domains to synthesize the HR output. Using cross-scale warping, our network is able to perform spatial alignment at pixel-level in an end-to-end fashion, which improves the existing schemes both in precision (around 2dB-4dB) and efficiency (more than 100 times faster).

* To be appeared in ECCV 2018 
Viaarxiv icon

Beyond SIFT using Binary features for Loop Closure Detection

Sep 18, 2017
Lei Han, Guyue Zhou, Lan Xu, Lu Fang

Figure 1 for Beyond SIFT using Binary features for Loop Closure Detection
Figure 2 for Beyond SIFT using Binary features for Loop Closure Detection
Figure 3 for Beyond SIFT using Binary features for Loop Closure Detection
Figure 4 for Beyond SIFT using Binary features for Loop Closure Detection

In this paper a binary feature based Loop Closure Detection (LCD) method is proposed, which for the first time achieves higher precision-recall (PR) performance compared with state-of-the-art SIFT feature based approaches. The proposed system originates from our previous work Multi-Index hashing for Loop closure Detection (MILD), which employs Multi-Index Hashing (MIH)~\cite{greene1994multi} for Approximate Nearest Neighbor (ANN) search of binary features. As the accuracy of MILD is limited by repeating textures and inaccurate image similarity measurement, burstiness handling is introduced to solve this problem and achieves considerable accuracy improvement. Additionally, a comprehensive theoretical analysis on MIH used in MILD is conducted to further explore the potentials of hashing methods for ANN search of binary features from probabilistic perspective. This analysis provides more freedom on best parameter choosing in MIH for different application scenarios. Experiments on popular public datasets show that the proposed approach achieved the highest accuracy compared with state-of-the-art while running at 30Hz for databases containing thousands of images.

* IROS 2017 paper for loop closure detection 
Viaarxiv icon

SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis

Aug 05, 2017
Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, Lu Fang

Figure 1 for SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis
Figure 2 for SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis
Figure 3 for SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis
Figure 4 for SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis

This paper proposes an end-to-end learning framework for multiview stereopsis. We term the network SurfaceNet. It takes a set of images and their corresponding camera parameters as input and directly infers the 3D model. The key advantage of the framework is that both photo-consistency as well geometric relations of the surface structure can be directly learned for the purpose of multiview stereopsis in an end-to-end fashion. SurfaceNet is a fully 3D convolutional network which is achieved by encoding the camera parameters together with the images in a 3D voxel representation. We evaluate SurfaceNet on the large-scale DTU benchmark.

* 2017 iccv poster 
Viaarxiv icon