Source-free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to the unlabeled target domain without accessing the well-labeled source data, which is a much more practical setting due to the data privacy, security, and transmission issues. To make up for the absence of source data, most existing methods introduced feature prototype based pseudo-labeling strategies to realize self-training model adaptation. However, feature prototypes are obtained by instance-level predictions based feature clustering, which is category-biased and tends to result in noisy labels since the visual domain gaps between source and target are usually different between categories. In addition, we found that a monocentric feature prototype may be ineffective to represent each category and introduce negative transfer, especially for those hard-transfer data. To address these issues, we propose a general class-Balanced Multicentric Dynamic prototype (BMD) strategy for the SFDA task. Specifically, for each target category, we first introduce a global inter-class balanced sampling strategy to aggregate potential representative target samples. Then, we design an intra-class multicentric clustering strategy to achieve more robust and representative prototypes generation. In contrast to existing strategies that update the pseudo label at a fixed training period, we further introduce a dynamic pseudo labeling strategy to incorporate network update information during model adaptation. Extensive experiments show that the proposed model-agnostic BMD strategy significantly improves representative SFDA methods to yield new state-of-the-art results, e.g., improving SHOT from 82.9\% to 85.8\% on VisDA-C and NRC from 52.6\% to 57.0\% on PointDA. The code is available at https://github.com/ispc-lab/BMD.
Most of the latest top semantic segmentation approaches are based on vision Transformers, particularly DETR-like frameworks, which employ a set of queries in the Transformer decoder. Each query is composed of a content query that preserves semantic information and a positional query that provides positional guidance for aggregating the query-specific context. However, the positional queries in the Transformer decoder layers are typically represented as fixed learnable weights, which often encode dataset statistics for segments and can be inaccurate for individual samples. Therefore, in this paper, we propose to generate positional queries dynamically conditioned on the cross-attention scores and the localization information of the preceding layer. By doing so, each query is aware of its previous focus, thus providing more accurate positional guidance and encouraging the cross-attention consistency across the decoder layers. In addition, we also propose an efficient way to deal with high-resolution cross-attention by dynamically determining the contextual tokens based on the low-resolution cross-attention maps to perform local relation aggregation. Our overall framework termed FASeg (Focus-Aware semantic Segmentation) provides a simple yet effective solution for semantic segmentation. Extensive experiments on ADE20K and Cityscapes show that our FASeg achieves state-of-the-art performance, e.g., obtaining 48.3% and 49.6% mIoU respectively for single-scale inference on ADE20K validation set with ResNet-50 and Swin-T backbones, and barely increases the computation consumption from Mask2former. Source code will be made publicly available at https://github.com/zip-group/FASeg.
With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view.
Recently, there has been an increasing concern about the privacy issue raised by using personally identifiable information in machine learning. However, previous portrait matting methods were all based on identifiable portrait images. To fill the gap, we present P3M-10k in this paper, which is the first large-scale anonymized benchmark for Privacy-Preserving Portrait Matting (P3M). P3M-10k consists of 10,000 high-resolution face-blurred portrait images along with high-quality alpha mattes. We systematically evaluate both trimap-free and trimap-based matting methods on P3M-10k and find that existing matting methods show different generalization abilities under the privacy preserving training setting, i.e., training only on face-blurred images while testing on arbitrary images. Based on the gained insights, we propose a unified matting model named P3M-Net consisting of three carefully designed integration modules that can perform privacy-insensitive semantic perception and detail-reserved matting simultaneously. We further design multiple variants of P3M-Net with different CNN and transformer backbones and identify the difference in their generalization abilities. To further mitigate this issue, we devise a simple yet effective Copy and Paste strategy (P3M-CP) that can borrow facial information from public celebrity images without privacy concerns and direct the network to reacquire the face context at both data and feature levels. P3M-CP only brings a few additional computations during training, while enabling the matting model to process both face-blurred and normal images without extra effort during inference. Extensive experiments on P3M-10k demonstrate the superiority of P3M-Net over state-of-the-art methods and the effectiveness of P3M-CP in improving the generalization ability of P3M-Net, implying a great significance of P3M for future research and real-world applications.
We present a new particle tracking algorithm to accurately resolve large deformation and rotational motion fields, which takes advantage of both local and global particle tracking algorithms. We call this method the ScalE and Rotation Invariant Augmented Lagrangian Particle Tracking (SerialTrack). This method builds an iterative scale and rotation invariant topology-based feature for each particle within a multi-scale tracking algorithm. The global kinematic compatibility condition is applied as a global augmented Lagrangian constraint to enhance the tracking accuracy. An open source software package implementing this numerical approach to track both 2D and 3D, incremental and cumulative deformation fields is provided.
Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at \url{https://github.com/encounter1997/DE-DETRs}.
Cell-free massive multiple-input multiple-output (MIMO) provides more uniform spectral efficiency (SE) for users (UEs) than cellular technology. The main challenge to achieve the benefits of cell-free massive MIMO is to realize signal processing in a scalable way. In this paper, we consider scalable fullpilot zero-forcing (FZF), partial FZF (PFZF), protective weak PFZF (PWPFZF), and local regularized ZF (LRZF) combining by exploiting channel statistics. We derive closed-form expressions of the uplink SE for FZF, PFZF, and PWPFZF combining with large-scale fading decoding over independent Rayleigh fading channels, taking channel estimation errors and pilot contamination into account. Moreover, we investigate the impact of the number of pilot sequences, antennas per AP, and APs on the performance. Numerical results show that LRZF provides the highest SE. However, PWPFZF is preferable when the number of pilot sequences is large and the number of antennas per AP is small. The reason is that PWPFZF has lower computational complexity and the SE expression can be computed in closed-form. Furthermore, we investigate the performance of PWPFZF combining with fractional power control and the numerical results show that it improves the performance of weak UEs and realizes uniformly good service for all UEs in a scalable fashion.
Recently, medical report generation, which aims to automatically generate a long and coherent descriptive paragraph of a given medical image, has received growing research interests. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias: the normal visual regions dominate the dataset over the abnormal visual regions, and 2) the very long sequence. To alleviate above two problems, we propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules: 1) AHA module first predicts the disease tags from the input image and then learns the multi-grained visual features by hierarchically aligning the visual regions and disease tags. The acquired disease-grounded visual features can better represent the abnormal regions of the input image, which could alleviate data bias problem; 2) MGT module effectively uses the multi-grained features and Transformer framework to generate the long medical report. The experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets. Moreover, the human evaluation conducted by professional radiologists further proves the effectiveness of our approach.
Affordance grounding, a task to ground (i.e., localize) action possibility region in objects, which faces the challenge of establishing an explicit link with object parts due to the diversity of interactive affordance. Human has the ability that transform the various exocentric interactions to invariant egocentric affordance so as to counter the impact of interactive diversity. To empower an agent with such ability, this paper proposes a task of affordance grounding from exocentric view, i.e., given exocentric human-object interaction and egocentric object images, learning the affordance knowledge of the object and transferring it to the egocentric image using only the affordance label as supervision. To this end, we devise a cross-view knowledge transfer framework that extracts affordance-specific features from exocentric interactions and enhances the perception of affordance regions by preserving affordance correlation. Specifically, an Affordance Invariance Mining module is devised to extract specific clues by minimizing the intra-class differences originated from interaction habits in exocentric images. Besides, an Affordance Co-relation Preserving strategy is presented to perceive and localize affordance by aligning the co-relation matrix of predicted results between the two views. Particularly, an affordance grounding dataset named AGD20K is constructed by collecting and labeling over 20K images from 36 affordance categories. Experimental results demonstrate that our method outperforms the representative models in terms of objective metrics and visual quality. Code: github.com/lhc1224/Cross-View-AG.
In this paper, we propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation, a crucial component of many robotics and vision tasks such as navigation and virtual reality where relative camera poses are required in real time. We formulate this problem as optimizing a variational information bottleneck objective function, which eliminates pose-irrelevant information from the latent representation. The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language. Specifically, we bound the generalization errors of the deep information bottleneck framework and the predictability of the latent representation. These provide not only a performance guarantee but also practical guidance for model design, sample collection, and sensor selection. Furthermore, the stochastic latent representation provides a natural uncertainty measure without the needs for extra structures or computations. Experiments on two well-known odometry datasets demonstrate the effectiveness of our method.