Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only the most basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 48%/60% fewer MACs. The effectiveness of PoolFormer verifies our hypothesis and urges us to initiate the concept of "MetaFormer", a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design. Code is available at https://github.com/sail-sg/poolformer
Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms. However, such clustering-based scheme becomes computationally prohibitive for large-scale datasets. How to efficiently leverage endless unlabeled data with limited computing resources for better U-ReID is under-explored. In this paper, we make the first attempt to the large-scale U-ReID and propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL). MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training. After that, the learned cluster centroids, termed as meta-prototypes in our MCL, are regarded as a proxy annotator to softly annotate the rest unlabeled data for further polishing the model. To alleviate the potential noisy labeling issue in the polishment phase, we enforce two well-designed loss constraints to promise intra-identity consistency and inter-identity strong correlation. For multiple widely-used U-ReID benchmarks, our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
Knowledge distillation~(KD) aims to craft a compact student model that imitates the behavior of a pre-trained teacher in a target domain. Prior KD approaches, despite their gratifying results, have largely relied on the premise that \emph{in-domain} data is available to carry out the knowledge transfer. Such an assumption, unfortunately, in many cases violates the practical setting, since the original training data or even the data domain is often unreachable due to privacy or copyright reasons. In this paper, we attempt to tackle an ambitious task, termed as \emph{out-of-domain} knowledge distillation~(OOD-KD), which allows us to conduct KD using only OOD data that can be readily obtained at a very low cost. Admittedly, OOD-KD is by nature a highly challenging task due to the agnostic domain gap. To this end, we introduce a handy yet surprisingly efficacious approach, dubbed as~\textit{MosaicKD}. The key insight behind MosaicKD lies in that, samples from various domains share common local patterns, even though their global semantic may vary significantly; these shared local patterns, in turn, can be re-assembled analogous to mosaic tiling, to approximate the in-domain data and to further alleviating the domain discrepancy. In MosaicKD, this is achieved through a four-player min-max game, in which a generator, a discriminator, a student network, are collectively trained in an adversarial manner, partially under the guidance of a pre-trained teacher. We validate MosaicKD over {classification and semantic segmentation tasks} across various benchmarks, and demonstrate that it yields results much superior to the state-of-the-art counterparts on OOD data. Our code is available at \url{https://github.com/zju-vipa/MosaicKD}.
Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity. In this paper, we propose a pre-training model \textbf{MEmoBERT} for multimodal emotion recognition, which learns multimodal joint representations through self-supervised learning from large-scale unlabeled video data that come in sheer volume. Furthermore, unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction one, bringing the downstream task closer to the pre-training. Extensive experiments on two benchmark datasets, IEMOCAP and MSP-IMPROV, show that our proposed MEmoBERT significantly enhances emotion recognition performance.
In this paper, we study a novel meta aggregation scheme towards binarizing graph neural networks (GNNs). We begin by developing a vanilla 1-bit GNN framework that binarizes both the GNN parameters and the graph features. Despite the lightweight architecture, we observed that this vanilla framework suffered from insufficient discriminative power in distinguishing graph topologies, leading to a dramatic drop in performance. This discovery motivates us to devise meta aggregators to improve the expressive power of vanilla binarized GNNs, of which the aggregation schemes can be adaptively changed in a learnable manner based on the binarized features. Towards this end, we propose two dedicated forms of meta neighborhood aggregators, an exclusive meta aggregator termed as Greedy Gumbel Neighborhood Aggregator (GNA), and a diffused meta aggregator termed as Adaptable Hybrid Neighborhood Aggregator (ANA). GNA learns to exclusively pick one single optimal aggregator from a pool of candidates, while ANA learns a hybrid aggregation behavior to simultaneously retain the benefits of several individual aggregators. Furthermore, the proposed meta aggregators may readily serve as a generic plugin module into existing full-precision GNNs. Experiments across various domains demonstrate that the proposed method yields results superior to the state of the art.
Zero-Shot Learning (ZSL) targets at recognizing unseen categories by leveraging auxiliary information, such as attribute embedding. Despite the encouraging results achieved, prior ZSL approaches focus on improving the discriminant power of seen-class features, yet have largely overlooked the geometric structure of the samples and the prototypes. The subsequent attribute-based generative adversarial network (GAN), as a result, also neglects the topological information in sample generation and further yields inferior performances in classifying the visual features of unseen classes. In this paper, we introduce a novel structure-aware feature generation scheme, termed as SA-GAN, to explicitly account for the topological structure in learning both the latent space and the generative networks. Specifically, we introduce a constraint loss to preserve the initial geometric structure when learning a discriminative latent space, and carry out our GAN training with additional supervising signals from a structure-aware discriminator and a reconstruction module. The former supervision distinguishes fake and real samples based on their affinity to class prototypes, while the latter aims to reconstruct the original feature space from the generated latent space. This topology-preserving mechanism enables our method to significantly enhance the generalization capability on unseen-classes and consequently improve the classification performance. Experiments on four benchmarks demonstrate that the proposed approach consistently outperforms the state of the art. Our code can be found in the supplementary material and will also be made publicly available.
Given a reference object of an unknown type in an image, human observers can effortlessly find the objects of the same category in another image and precisely tell their visual boundaries. Such visual cognition capability of humans seems absent from the current research spectrum of computer vision. Existing segmentation networks, for example, rely on a humongous amount of labeled data, which is laborious and costly to collect and annotate; besides, the performance of segmentation networks tend to downgrade as the number of the category increases. In this paper, we introduce a novel Reference semantic segmentation Network (Ref-Net) to conduct visual boundary knowledge translation. Ref-Net contains a Reference Segmentation Module (RSM) and a Boundary Knowledge Translation Module (BKTM). Inspired by the human recognition mechanism, RSM is devised only to segment the same category objects based on the features of the reference objects. BKTM, on the other hand, introduces two boundary discriminator branches to conduct inner and outer boundary segmentation of the target objectin an adversarial manner, and translate the annotated boundary knowledge of open-source datasets into the segmentation network. Exhaustive experiments demonstrate that, with tens of finely-grained annotated samples as guidance, Ref-Net achieves results on par with fully supervised methods on six datasets.
The microvascular invasion (MVI) is a major prognostic factor in hepatocellular carcinoma, which is one of the malignant tumors with the highest mortality rate. The diagnosis of MVI needs discovering the vessels that contain hepatocellular carcinoma cells and counting their number in each vessel, which depends heavily on experiences of the doctor, is largely subjective and time-consuming. However, there is no algorithm as yet tailored for the MVI detection from pathological images. This paper collects the first pathological liver image dataset containing 522 whole slide images with labels of vessels, MVI, and hepatocellular carcinoma grades. The first and essential step for the automatic diagnosis of MVI is the accurate segmentation of vessels. The unique characteristics of pathological liver images, such as super-large size, multi-scale vessel, and blurred vessel edges, make the accurate vessel segmentation challenging. Based on the collected dataset, we propose an Edge-competing Vessel Segmentation Network (EVS-Net), which contains a segmentation network and two edge segmentation discriminators. The segmentation network, combined with an edge-aware self-supervision mechanism, is devised to conduct vessel segmentation with limited labeled patches. Meanwhile, two discriminators are introduced to distinguish whether the segmented vessel and background contain residual features in an adversarial manner. In the training stage, two discriminators are devised tocompete for the predicted position of edges. Exhaustive experiments demonstrate that, with only limited labeled patches, EVS-Net achieves a close performance of fully supervised methods, which provides a convenient tool for the pathological liver vessel segmentation. Code is publicly available at https://github.com/zju-vipa/EVS-Net.
When confronted with objects of unknown types in an image, humans can effortlessly and precisely tell their visual boundaries. This recognition mechanism and underlying generalization capability seem to contrast to state-of-the-art image segmentation networks that rely on large-scale category-aware annotated training samples. In this paper, we make an attempt towards building models that explicitly account for visual boundary knowledge, in hope to reduce the training effort on segmenting unseen categories. Specifically, we investigate a new task termed as Boundary Knowledge Translation (BKT). Given a set of fully labeled categories, BKT aims to translate the visual boundary knowledge learned from the labeled categories, to a set of novel categories, each of which is provided only a few labeled samples. To this end, we propose a Translation Segmentation Network (Trans-Net), which comprises a segmentation network and two boundary discriminators. The segmentation network, combined with a boundary-aware self-supervised mechanism, is devised to conduct foreground segmentation, while the two discriminators work together in an adversarial manner to ensure an accurate segmentation of the novel categories under light supervision. Exhaustive experiments demonstrate that, with only tens of labeled samples as guidance, Trans-Net achieves close results on par with fully supervised methods.