Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Houwen Peng

Stephen

PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

Jun 09, 2022

Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Abed Al Kader Hammoud, Mohamed Elhoseiny, Bernard Ghanem

Figure 1 for PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

Figure 2 for PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

Figure 3 for PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

Figure 4 for PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

Abstract:PointNet++ is one of the most influential neural architectures for point cloud understanding. Although the accuracy of PointNet++ has been largely surpassed by recent networks such as PointMLP and Point Transformer, we find that a large portion of the performance gain is due to improved training strategies, i.e. data augmentation and optimization techniques, and increased model sizes rather than architectural innovations. Thus, the full potential of PointNet++ has yet to be explored. In this work, we revisit the classical PointNet++ through a systematic study of model training and scaling strategies, and offer two major contributions. First, we propose a set of improved training strategies that significantly improve PointNet++ performance. For example, we show that, without any change in architecture, the overall accuracy (OA) of PointNet++ on ScanObjectNN object classification can be raised from 77.9\% to 86.1\%, even outperforming state-of-the-art PointMLP. Second, we introduce an inverted residual bottleneck design and separable MLPs into PointNet++ to enable efficient and effective model scaling and propose PointNeXt, the next version of PointNets. PointNeXt can be flexibly scaled up and outperforms state-of-the-art methods on both 3D classification and segmentation tasks. For classification, PointNeXt reaches an overall accuracy of $87.7\%$ on ScanObjectNN, surpassing PointMLP by $2.3\%$, while being $10 \times$ faster in inference. For semantic segmentation, PointNeXt establishes a new state-of-the-art performance with $74.9\%$ mean IoU on S3DIS (6-fold cross-validation), being superior to the recent Point Transformer. The code and models are available at https://github.com/guochengqian/pointnext.

* Code and models are available at https://github.com/guochengqian/pointnext

Via

Access Paper or Ask Questions

MiniViT: Compressing Vision Transformers with Weight Multiplexing

Apr 14, 2022

Jinnian Zhang, Houwen Peng, Kan Wu, Mengchen Liu, Bin Xiao, Jianlong Fu, Lu Yuan

Figure 1 for MiniViT: Compressing Vision Transformers with Weight Multiplexing

Figure 2 for MiniViT: Compressing Vision Transformers with Weight Multiplexing

Figure 3 for MiniViT: Compressing Vision Transformers with Weight Multiplexing

Figure 4 for MiniViT: Compressing Vision Transformers with Weight Multiplexing

Abstract:Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability. However, ViT models suffer from huge number of parameters, restricting their applicability on devices with limited memory. To alleviate this problem, we propose MiniViT, a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance. The central idea of MiniViT is to multiplex the weights of consecutive transformer blocks. More specifically, we make the weights shared across layers, while imposing a transformation on the weights to increase diversity. Weight distillation over self-attention is also applied to transfer knowledge from large-scale ViT models to weight-multiplexed compact models. Comprehensive experiments demonstrate the efficacy of MiniViT, showing that it can reduce the size of the pre-trained Swin-B transformer by 48\%, while achieving an increase of 1.0\% in Top-1 accuracy on ImageNet. Moreover, using a single-layer of parameters, MiniViT is able to compress DeiT-B by 9.7 times from 86M to 9M parameters, without seriously compromising the performance. Finally, we verify the transferability of MiniViT by reporting its performance on downstream benchmarks. Code and models are available at here.

* Accepted by CVPR 2022

Via

Access Paper or Ask Questions

Searching the Search Space of Vision Transformer

Nov 29, 2021

Minghao Chen, Kan Wu, Bolin Ni, Houwen Peng, Bei Liu, Jianlong Fu, Hongyang Chao, Haibin Ling

Figure 1 for Searching the Search Space of Vision Transformer

Figure 2 for Searching the Search Space of Vision Transformer

Figure 3 for Searching the Search Space of Vision Transformer

Figure 4 for Searching the Search Space of Vision Transformer

Abstract:Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures. In this paper, we propose to use neural architecture search to automate this process, by searching not only the architecture but also the search space. The central idea is to gradually evolve different search dimensions guided by their E-T Error computed using a weight-sharing supernet. Moreover, we provide design guidelines of general vision transformers with extensive analysis according to the space searching process, which could promote the understanding of vision transformer. Remarkably, the searched models, named S3 (short for Searching the Search Space), from the searched space achieve superior performance to recently proposed models, such as Swin, DeiT and ViT, when evaluated on ImageNet. The effectiveness of S3 is also illustrated on object detection, semantic segmentation and visual question answering, demonstrating its generality to downstream vision and vision-language tasks. Code and models will be available at https://github.com/microsoft/Cream.

* Accepted to NIPS 2021

Via

Access Paper or Ask Questions

Learning to Track Objects from Unlabeled Videos

Aug 28, 2021

Jilai Zheng, Chao Ma, Houwen Peng, Xiaokang Yang

Figure 1 for Learning to Track Objects from Unlabeled Videos

Figure 2 for Learning to Track Objects from Unlabeled Videos

Figure 3 for Learning to Track Objects from Unlabeled Videos

Figure 4 for Learning to Track Objects from Unlabeled Videos

Abstract:In this paper, we propose to learn an Unsupervised Single Object Tracker (USOT) from scratch. We identify that three major challenges, i.e., moving object discovery, rich temporal variation exploitation, and online update, are the central causes of the performance bottleneck of existing unsupervised trackers. To narrow the gap between unsupervised trackers and supervised counterparts, we propose an effective unsupervised learning approach composed of three stages. First, we sample sequentially moving objects with unsupervised optical flow and dynamic programming, instead of random cropping. Second, we train a naive Siamese tracker from scratch using single-frame pairs. Third, we continue training the tracker with a novel cycle memory learning scheme, which is conducted in longer temporal spans and also enables our tracker to update online. Extensive experiments show that the proposed USOT learned from unlabeled videos performs well over the state-of-the-art unsupervised trackers by large margins, and on par with recent supervised deep trackers. Code is available at https://github.com/VISION-SJTU/USOT.

* Accpeted to ICCV2021

Via

Access Paper or Ask Questions

Rethinking and Improving Relative Position Encoding for Vision Transformer

Jul 29, 2021

Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, Hongyang Chao

Figure 1 for Rethinking and Improving Relative Position Encoding for Vision Transformer

Figure 2 for Rethinking and Improving Relative Position Encoding for Vision Transformer

Figure 3 for Rethinking and Improving Relative Position Encoding for Vision Transformer

Figure 4 for Rethinking and Improving Relative Position Encoding for Vision Transformer

Abstract:Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

* Accepted by ICCV 2021

Via

Access Paper or Ask Questions

AutoFormer: Searching Transformers for Visual Recognition

Jul 01, 2021

Minghao Chen, Houwen Peng, Jianlong Fu, Haibin Ling

Figure 1 for AutoFormer: Searching Transformers for Visual Recognition

Figure 2 for AutoFormer: Searching Transformers for Visual Recognition

Figure 3 for AutoFormer: Searching Transformers for Visual Recognition

Figure 4 for AutoFormer: Searching Transformers for Visual Recognition

Abstract:Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that the depth, embedding dimension, and number of heads can largely affect the performance of vision transformers. Previous models configure these dimensions based upon manual crafting. In this work, we propose a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search. AutoFormer entangles the weights of different blocks in the same layers during supernet training. Benefiting from the strategy, the trained supernet allows thousands of subnets to be very well-trained. Specifically, the performance of these subnets with weights inherited from the supernet is comparable to those retrained from scratch. Besides, the searched models, which we refer to AutoFormers, surpass the recent state-of-the-arts such as ViT and DeiT. In particular, AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy on ImageNet with 5.7M/22.9M/53.7M parameters, respectively. Lastly, we verify the transferability of AutoFormer by providing the performance on downstream benchmarks and distillation experiments. Code and models are available at https://github.com/microsoft/AutoML.

* Github: https://github.com/microsoft/AutoML

Via

Access Paper or Ask Questions

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Jun 28, 2021

Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo

Figure 1 for Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Figure 2 for Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Figure 3 for Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Figure 4 for Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Abstract:Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Image-Text Retrieval, Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.

Via

Access Paper or Ask Questions

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

Apr 29, 2021

Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, Huchuan Lu

Figure 1 for LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

Figure 2 for LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

Figure 3 for LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

Figure 4 for LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

Abstract:Object tracking has achieved significant progress over the past few years. However, state-of-the-art trackers become increasingly heavy and expensive, which limits their deployments in resource-constrained applications. In this work, we present LightTrack, which uses neural architecture search (NAS) to design more lightweight and efficient object trackers. Comprehensive experiments show that our LightTrack is effective. It can find trackers that achieve superior performance compared to handcrafted SOTA trackers, such as SiamRPN++ and Ocean, while using much fewer model Flops and parameters. Moreover, when deployed on resource-constrained mobile chipsets, the discovered trackers run much faster. For example, on Snapdragon 845 Adreno GPU, LightTrack runs $12\times$ faster than Ocean, while using $13\times$ fewer parameters and $38\times$ fewer Flops. Such improvements might narrow the gap between academic models and industrial deployments in object tracking task. LightTrack is released at https://github.com/researchmm/LightTrack.

* Accepted by CVPR 2021, Github: https://github.com/researchmm/LightTrack

Via

Access Paper or Ask Questions

One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

Apr 01, 2021

Minghao Chen, Houwen Peng, Jianlong Fu, Haibin Ling

Figure 1 for One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

Figure 2 for One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

Figure 3 for One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

Figure 4 for One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking

Abstract:Despite remarkable progress achieved, most neural architecture search (NAS) methods focus on searching for one single accurate and robust architecture. To further build models with better generalization capability and performance, model ensemble is usually adopted and performs better than stand-alone models. Inspired by the merits of model ensemble, we propose to search for multiple diverse models simultaneously as an alternative way to find powerful models. Searching for ensembles is non-trivial and has two key challenges: enlarged search space and potentially more complexity for the searched model. In this paper, we propose a one-shot neural ensemble architecture search (NEAS) solution that addresses the two challenges. For the first challenge, we introduce a novel diversity-based metric to guide search space shrinking, considering both the potentiality and diversity of candidate operators. For the second challenge, we enable a new search dimension to learn layer sharing among different models for efficiency purposes. The experiments on ImageNet clearly demonstrate that our solution can improve the supernet's capacity of ranking ensemble architectures, and further lead to better search results. The discovered architectures achieve superior performance compared with state-of-the-arts such as MobileNetV3 and EfficientNet families under aligned settings. Moreover, we evaluate the generalization ability and robustness of our searched architecture on the COCO detection benchmark and achieve a 3.1% improvement on AP compared with MobileNetV3. Codes and models are available at https://github.com/researchmm/NEAS.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions

Learning Spatio-Temporal Transformer for Visual Tracking

Mar 31, 2021

Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, Huchuan Lu

Abstract:In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6x faster than Siam R-CNN. Code and models are open-sourced at https://github.com/researchmm/Stark.

Via

Access Paper or Ask Questions