Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiaya Jia

Spherical Transformer for LiDAR-based 3D Recognition

Mar 22, 2023
Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, Jiaya Jia

Figure 1 for Spherical Transformer for LiDAR-based 3D Recognition

Figure 2 for Spherical Transformer for LiDAR-based 3D Recognition

Figure 3 for Spherical Transformer for LiDAR-based 3D Recognition

Figure 4 for Spherical Transformer for LiDAR-based 3D Recognition

LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at https://github.com/dvlab-research/SphereFormer.git.

* Accepted to CVPR 2023. Code is available at https://github.com/dvlab-research/SphereFormer.git

Via

Access Paper or Ask Questions

Learning Context-aware Classifier for Semantic Segmentation

Mar 21, 2023
Zhuotao Tian, Jiequan Cui, Li Jiang, Xiaojuan Qi, Xin Lai, Yixin Chen, Shu Liu, Jiaya Jia

Figure 1 for Learning Context-aware Classifier for Semantic Segmentation

Figure 2 for Learning Context-aware Classifier for Semantic Segmentation

Figure 3 for Learning Context-aware Classifier for Semantic Segmentation

Figure 4 for Learning Context-aware Classifier for Semantic Segmentation

Semantic segmentation is still a challenging task for parsing diverse contexts in different scenes, thus the fixed classifier might not be able to well address varying feature distributions during testing. Different from the mainstream literature where the efficacy of strong backbones and effective decoder heads has been well studied, in this paper, additional contextual hints are instead exploited via learning a context-aware classifier whose content is data-conditioned, decently adapting to different latent distributions. Since only the classifier is dynamically altered, our method is model-agnostic and can be easily applied to generic segmentation models. Notably, with only negligible additional parameters and +2\% inference time, decent performance gain has been achieved on both small and large models with challenging benchmarks, manifesting substantial practical merits brought by our simple yet effective method. The implementation is available at \url{https://github.com/tianzhuotao/CAC}.

* AAAI 2023. Code and models are available at https://github.com/tianzhuotao/CAC

Via

Access Paper or Ask Questions

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Mar 20, 2023
Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, Jiaya Jia

Figure 1 for VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Figure 2 for VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Figure 3 for VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Figure 4 for VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark.

* In CVPR 2023, Code and models are available at https://github.com/dvlab-research/VoxelNeXt

Via

Access Paper or Ask Questions

Video-P2P: Video Editing with Cross-attention Control

Mar 08, 2023
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, Jiaya Jia

Figure 1 for Video-P2P: Video Editing with Cross-attention Control

Figure 2 for Video-P2P: Video Editing with Cross-attention Control

Figure 3 for Video-P2P: Video Editing with Cross-attention Control

Figure 4 for Video-P2P: Video Editing with Cross-attention Control

This paper presents Video-P2P, a novel framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models, there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically, we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. For attention control, we introduce a novel decoupled-guidance strategy, which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability, while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications, including word swap, prompt refinement, and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches.

* 10 pages, 9 figures. Project page: https://video-p2p.github.io/

Via

Access Paper or Ask Questions

StraIT: Non-autoregressive Generation with Stratified Image Transformer

Mar 01, 2023
Shengju Qian, Huiwen Chang, Yuanzhen Li, Zizhao Zhang, Jiaya Jia, Han Zhang

Figure 1 for StraIT: Non-autoregressive Generation with Stratified Image Transformer

Figure 2 for StraIT: Non-autoregressive Generation with Stratified Image Transformer

Figure 3 for StraIT: Non-autoregressive Generation with Stratified Image Transformer

Figure 4 for StraIT: Non-autoregressive Generation with Stratified Image Transformer

We propose Stratified Image Transformer(StraIT), a pure non-autoregressive(NAR) generative model that demonstrates superiority in high-quality image synthesis over existing autoregressive(AR) and diffusion models(DMs). In contrast to the under-exploitation of visual characteristics in existing vision tokenizer, we leverage the hierarchical nature of images to encode visual tokens into stratified levels with emergent properties. Through the proposed image stratification that obtains an interlinked token pair, we alleviate the modeling difficulty and lift the generative power of NAR models. Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods while being order-of-magnitude faster, achieving FID scores of 3.96 at 256*256 resolution on ImageNet without leveraging any guidance in sampling or auxiliary image classifiers. When equipped with classifier-free guidance, our method achieves an FID of 3.36 and IS of 259.3. In addition, we illustrate the decoupled modeling process of StraIT generation, showing its compelling properties on applications including domain transfer.

Via

Access Paper or Ask Questions

Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need

Feb 06, 2023
Jingyao Li, Pengguang Chen, Shaozuo Yu, Zexin He, Shu Liu, Jiaya Jia

Figure 1 for Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need

Figure 2 for Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need

Figure 3 for Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need

Figure 4 for Rethinking Out-of-distribution (OOD) Detection: Masked Image Modeling is All You Need

The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection

Via

Access Paper or Ask Questions

Understanding Imbalanced Semantic Segmentation Through Neural Collapse

Jan 03, 2023
Zhisheng Zhong, Jiequan Cui, Yibo Yang, Xiaoyang Wu, Xiaojuan Qi, Xiangyu Zhang, Jiaya Jia

Figure 1 for Understanding Imbalanced Semantic Segmentation Through Neural Collapse

Figure 2 for Understanding Imbalanced Semantic Segmentation Through Neural Collapse

Figure 3 for Understanding Imbalanced Semantic Segmentation Through Neural Collapse

Figure 4 for Understanding Imbalanced Semantic Segmentation Through Neural Collapse

A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.

* Technical Report

Via

Access Paper or Ask Questions

What Makes for Good Tokenizers in Vision Transformer?

Dec 21, 2022
Shengju Qian, Yi Zhu, Wenbo Li, Mu Li, Jiaya Jia

Figure 1 for What Makes for Good Tokenizers in Vision Transformer?

Figure 2 for What Makes for Good Tokenizers in Vision Transformer?

Figure 3 for What Makes for Good Tokenizers in Vision Transformer?

Figure 4 for What Makes for Good Tokenizers in Vision Transformer?

The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good tokenizer has not been well understood in computer vision. In this work, we investigate this uncharted problem from an information trade-off perspective. In addition to unifying and understanding existing structural modifications, our derivation leads to better design strategies for vision tokenizers. The proposed Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Furthermore, a regularization objective TokenProp is embraced in the standard training regime. Through extensive experiments on various transformer architectures, we observe both improved performance and intriguing properties of these two plug-and-play designs with negligible computational overhead. These observations further indicate the importance of the commonly-omitted designs of tokenizers in vision transformer.

* To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence

Via

Access Paper or Ask Questions

General Adversarial Defense Against Black-box Attacks via Pixel Level and Feature Level Distribution Alignments

Dec 11, 2022
Xiaogang Xu, Hengshuang Zhao, Philip Torr, Jiaya Jia

Figure 1 for General Adversarial Defense Against Black-box Attacks via Pixel Level and Feature Level Distribution Alignments

Figure 2 for General Adversarial Defense Against Black-box Attacks via Pixel Level and Feature Level Distribution Alignments

Figure 3 for General Adversarial Defense Against Black-box Attacks via Pixel Level and Feature Level Distribution Alignments

Figure 4 for General Adversarial Defense Against Black-box Attacks via Pixel Level and Feature Level Distribution Alignments

Deep Neural Networks (DNNs) are vulnerable to the black-box adversarial attack that is highly transferable. This threat comes from the distribution gap between adversarial and clean samples in feature space of the target DNNs. In this paper, we use Deep Generative Networks (DGNs) with a novel training mechanism to eliminate the distribution gap. The trained DGNs align the distribution of adversarial samples with clean ones for the target DNNs by translating pixel values. Different from previous work, we propose a more effective pixel level training constraint to make this achievable, thus enhancing robustness on adversarial samples. Further, a class-aware feature-level constraint is formulated for integrated distribution alignment. Our approach is general and applicable to multiple tasks, including image classification, semantic segmentation, and object detection. We conduct extensive experiments on different datasets. Our strategy demonstrates its unique effectiveness and generality against black-box attacks.

Via

Access Paper or Ask Questions

SDM: Spatial Diffusion Model for Large Hole Image Inpainting

Dec 06, 2022
Wenbo Li, Xin Yu, Kun Zhou, Yibing Song, Zhe Lin, Jiaya Jia

Figure 1 for SDM: Spatial Diffusion Model for Large Hole Image Inpainting

Figure 2 for SDM: Spatial Diffusion Model for Large Hole Image Inpainting

Figure 3 for SDM: Spatial Diffusion Model for Large Hole Image Inpainting

Figure 4 for SDM: Spatial Diffusion Model for Large Hole Image Inpainting

Generative adversarial networks (GANs) have made great success in image inpainting yet still have difficulties tackling large missing regions. In contrast, iterative algorithms, such as autoregressive and denoising diffusion models, have to be deployed with massive computing resources for decent effect. To overcome the respective limitations, we present a novel spatial diffusion model (SDM) that uses a few iterations to gradually deliver informative pixels to the entire image, largely enhancing the inference efficiency. Also, thanks to the proposed decoupled probabilistic modeling and spatial diffusion scheme, our method achieves high-quality large-hole completion. On multiple benchmarks, we achieve new state-of-the-art performance. Code is released at https://github.com/fenglinglwb/SDM.

* 18 pages, 14 figures

Via

Access Paper or Ask Questions