Alert button
Picture for Shihao Ji

Shihao Ji

Alert button

MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

Aug 25, 2023
Hui Ye, Rajshekhar Sunderraman, Shihao Ji

Figure 1 for MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification
Figure 2 for MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification
Figure 3 for MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification
Figure 4 for MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.

Viaarxiv icon

FLSL: Feature-level Self-supervised Learning

Jun 09, 2023
Qing Su, Anton Netchaev, Hai Li, Shihao Ji

Figure 1 for FLSL: Feature-level Self-supervised Learning
Figure 2 for FLSL: Feature-level Self-supervised Learning
Figure 3 for FLSL: Feature-level Self-supervised Learning
Figure 4 for FLSL: Feature-level Self-supervised Learning

Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg, MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation. Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a two-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)% AP and 42.1% AP in instance segmentation on MS-COCO, using Mask R-CNN with ViT-S/16 and ViT-S/8 as backbone, respectively. FLSL consistently outperforms existing SSL methods across additional benchmarks, including UAV object detection on UAVDT, and video instance segmentation on DAVIS 2017. We conclude by presenting visualization and various ablation studies to better 20 understand the success of FLSL.

Viaarxiv icon

M-EBM: Towards Understanding the Manifolds of Energy-Based Models

Mar 08, 2023
Xiulong Yang, Shihao Ji

Figure 1 for M-EBM: Towards Understanding the Manifolds of Energy-Based Models
Figure 2 for M-EBM: Towards Understanding the Manifolds of Energy-Based Models
Figure 3 for M-EBM: Towards Understanding the Manifolds of Energy-Based Models
Figure 4 for M-EBM: Towards Understanding the Manifolds of Energy-Based Models

Energy-based models (EBMs) exhibit a variety of desirable properties in predictive tasks, such as generality, simplicity and compositionality. However, training EBMs on high-dimensional datasets remains unstable and expensive. In this paper, we present a Manifold EBM (M-EBM) to boost the overall performance of unconditional EBM and Joint Energy-based Model (JEM). Despite its simplicity, M-EBM significantly improves unconditional EBMs in training stability and speed on a host of benchmark datasets, such as CIFAR10, CIFAR100, CelebA-HQ, and ImageNet 32x32. Once class labels are available, label-incorporated M-EBM (M-JEM) further surpasses M-EBM in image generation quality with an over 40% FID improvement, while enjoying improved accuracy. The code can be found at https://github.com/sndnyang/mebm.

* Accepted to PAKDD 2023 
Viaarxiv icon

Accounting for Temporal Variability in Functional Magnetic Resonance Imaging Improves Prediction of Intelligence

Nov 11, 2022
Yang Li, Xin Ma, Raj Sunderraman, Shihao Ji, Suprateek Kundu

Figure 1 for Accounting for Temporal Variability in Functional Magnetic Resonance Imaging Improves Prediction of Intelligence
Figure 2 for Accounting for Temporal Variability in Functional Magnetic Resonance Imaging Improves Prediction of Intelligence
Figure 3 for Accounting for Temporal Variability in Functional Magnetic Resonance Imaging Improves Prediction of Intelligence
Figure 4 for Accounting for Temporal Variability in Functional Magnetic Resonance Imaging Improves Prediction of Intelligence

Neuroimaging-based prediction methods for intelligence and cognitive abilities have seen a rapid development, while prediction based on functional connectivity (FC) has shown great promise. The overwhelming majority of literature has focused on static FC with extremely limited results available on dynamic FC or region level fMRI time series. Unlike static FC, the latter features include the temporal variability in the fMRI data. In this project, we propose a novel bi-LSTM approach that incorporates an $L_0$ regularization for feature selection. The proposed pipeline is applied to prediction based on region level fMRI time series as well as dynamic FC and implemented via an efficient algorithm. We undertake a detailed comparison of prediction performance for different intelligence measures based on fMRI features acquired from the Adolescent Brain Cognitive Development (ABCD) study. Our analysis illustrates that static FC consistently has inferior performance compared to region level fMRI time series or dynamic FC for unimodal rest and task fMRI experiments, as well as in almost all cases for multi-task analysis. The proposed pipeline based on region level time-series identifies several important brain regions that drive fluctuations in intelligence measures. Strong test-retest reliability of the selected features is reported, pointing to reproducible findings. Given the large sample size from ABCD study, our results provide conclusive evidence that superior intelligence prediction can be achieved by considering temporal variations in the fMRI data, either at the region level, or based on dynamic FC, which is one of the first such findings in literature. These results are particularly noteworthy, given the low dimensionality of the region level time series, easier interpretability, and extremely quick computation times, compared to network-based analysis.

Viaarxiv icon

APSNet: Attention Based Point Cloud Sampling

Oct 11, 2022
Yang Ye, Xiulong Yang, Shihao Ji

Figure 1 for APSNet: Attention Based Point Cloud Sampling
Figure 2 for APSNet: Attention Based Point Cloud Sampling
Figure 3 for APSNet: Attention Based Point Cloud Sampling
Figure 4 for APSNet: Attention Based Point Cloud Sampling

Processing large point clouds is a challenging task. Therefore, the data is often downsampled to a smaller size such that it can be stored, transmitted and processed more efficiently without incurring significant performance degradation. Traditional task-agnostic sampling methods, such as farthest point sampling (FPS), do not consider downstream tasks when sampling point clouds, and thus non-informative points to the tasks are often sampled. This paper explores a task-oriented sampling for 3D point clouds, and aims to sample a subset of points that are tailored specifically to a downstream task of interest. Similar to FPS, we assume that point to be sampled next should depend heavily on the points that have already been sampled. We thus formulate point cloud sampling as a sequential generation process, and develop an attention-based point cloud sampling network (APSNet) to tackle this problem. At each time step, APSNet attends to all the points in a cloud by utilizing the history of previously sampled points, and samples the most informative one. Both supervised learning and knowledge distillation-based self-supervised learning of APSNet are proposed. Moreover, joint training of APSNet over multiple sample sizes is investigated, leading to a single APSNet that can generate arbitrary length of samples with prominent performances. Extensive experiments demonstrate the superior performance of APSNet against state-of-the-arts in various downstream tasks, including 3D point cloud classification, reconstruction, and registration.

* Published as a conference paper in BMVC 2022 
Viaarxiv icon

Towards Bridging the Performance Gaps of Joint Energy-based Models

Sep 16, 2022
Xiulong Yang, Qing Su, Shihao Ji

Figure 1 for Towards Bridging the Performance Gaps of Joint Energy-based Models
Figure 2 for Towards Bridging the Performance Gaps of Joint Energy-based Models
Figure 3 for Towards Bridging the Performance Gaps of Joint Energy-based Models
Figure 4 for Towards Bridging the Performance Gaps of Joint Energy-based Models

Can we train a hybrid discriminative-generative model within a single network? This question has recently been answered in the affirmative, introducing the field of Joint Energy-based Model (JEM), which achieves high classification accuracy and image generation quality simultaneously. Despite recent advances, there remain two performance gaps: the accuracy gap to the standard softmax classifier, and the generation quality gap to state-of-the-art generative models. In this paper, we introduce a variety of training techniques to bridge the accuracy gap and the generation quality gap of JEM. 1) We incorporate a recently proposed sharpness-aware minimization (SAM) framework to train JEM, which promotes the energy landscape smoothness and the generalizability of JEM. 2) We exclude data augmentation from the maximum likelihood estimate pipeline of JEM, and mitigate the negative impact of data augmentation to image generation quality. Extensive experiments on multiple datasets demonstrate that our SADA-JEM achieves state-of-the-art performances and outperforms JEM in image classification, image generation, calibration, out-of-distribution detection and adversarial robustness by a notable margin.

Viaarxiv icon

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

Aug 16, 2022
Xiulong Yang, Sheng-Min Shih, Yinlin Fu, Xiaoting Zhao, Shihao Ji

Figure 1 for Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model
Figure 2 for Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model
Figure 3 for Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model
Figure 4 for Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

Diffusion Denoising Probability Models (DDPM) and Vision Transformer (ViT) have demonstrated significant progress in generative tasks and discriminative tasks, respectively, and thus far these models have largely been developed in their own domains. In this paper, we establish a direct connection between DDPM and ViT by integrating the ViT architecture into DDPM, and introduce a new generative model called Generative ViT (GenViT). The modeling flexibility of ViT enables us to further extend GenViT to hybrid discriminative-generative modeling, and introduce a Hybrid ViT (HybViT). Our work is among the first to explore a single ViT for image generation and classification jointly. We conduct a series of experiments to analyze the performance of proposed models and demonstrate their superiority over prior state-of-the-arts in both generative and discriminative tasks. Our code and pre-trained models can be found in https://github.com/sndnyang/Diffusion_ViT .

Viaarxiv icon

ChiTransformer:Towards Reliable Stereo from Cues

Mar 29, 2022
Qing Su, Shihao Ji

Figure 1 for ChiTransformer:Towards Reliable Stereo from Cues
Figure 2 for ChiTransformer:Towards Reliable Stereo from Cues
Figure 3 for ChiTransformer:Towards Reliable Stereo from Cues
Figure 4 for ChiTransformer:Towards Reliable Stereo from Cues

Current stereo matching techniques are challenged by restricted searching space, occluded regions and sheer size. While single image depth estimation is spared from these challenges and can achieve satisfactory results with the extracted monocular cues, the lack of stereoscopic relationship renders the monocular prediction less reliable on its own, especially in highly dynamic or cluttered environments. To address these issues in both scenarios, we present an optic-chiasm-inspired self-supervised binocular depth estimation method, wherein a vision transformer (ViT) with gated positional cross-attention (GPCA) layers is designed to enable feature-sensitive pattern retrieval between views while retaining the extensive context information aggregated through self-attentions. Monocular cues from a single view are thereafter conditionally rectified by a blending layer with the retrieved pattern pairs. This crossover design is biologically analogous to the optic-chasma structure in the human visual system and hence the name, ChiTransformer. Our experiments show that this architecture yields substantial improvements over state-of-the-art self-supervised stereo approaches by 11%, and can be used on both rectilinear and non-rectilinear (e.g., fisheye) images.

* 11 pages, 3 figures, CVPR2022 
Viaarxiv icon

Generative Dynamic Patch Attack

Nov 15, 2021
Xiang Li, Shihao Ji

Figure 1 for Generative Dynamic Patch Attack
Figure 2 for Generative Dynamic Patch Attack
Figure 3 for Generative Dynamic Patch Attack
Figure 4 for Generative Dynamic Patch Attack

Adversarial patch attack is a family of attack algorithms that perturb a part of image to fool a deep neural network model. Existing patch attacks mostly consider injecting adversarial patches at input-agnostic locations: either a predefined location or a random location. This attack setup may be sufficient for attack but has considerable limitations when using it for adversarial training. Thus, robust models trained with existing patch attacks cannot effectively defend other adversarial attacks. In this paper, we first propose an end-to-end patch attack algorithm, Generative Dynamic Patch Attack (GDPA), which generates both patch pattern and patch location adversarially for each input image. We show that GDPA is a generic attack framework that can produce dynamic/static and visible/invisible patches with a few configuration changes. Secondly, GDPA can be readily integrated for adversarial training to improve model robustness to various adversarial attacks. Extensive experiments on VGGFace, Traffic Sign and ImageNet show that GDPA achieves higher attack success rates than state-of-the-art patch attacks, while adversarially trained model with GDPA demonstrates superior robustness to adversarial patch attacks than competing methods. Our source code can be found at https://github.com/lxuniverse/gdpa.

* Published as a conference paper at BMVC 2021 
Viaarxiv icon