Neoadjuvant therapy (NAT) for breast cancer is a common treatment option in clinical practice. Tumor cellularity (TC), which represents the percentage of invasive tumors in the tumor bed, has been widely used to quantify the response of breast cancer to NAT. Therefore, automatic TC estimation is significant in clinical practice. However, existing state-of-the-art methods usually take it as a TC score regression problem, which ignores the ambiguity of TC labels caused by subjective assessment or multiple raters. In this paper, to efficiently leverage the label ambiguities, we proposed an Uncertainty-aware Label disTRibution leArning (ULTRA) framework for automatic TC estimation. The proposed ULTRA first converted the single-value TC labels to discrete label distributions, which effectively models the ambiguity among all possible TC labels. Furthermore, the network learned TC label distributions by minimizing the Kullback-Leibler (KL) divergence between the predicted and ground-truth TC label distributions, which better supervised the model to leverage the ambiguity of TC labels. Moreover, the ULTRA mimicked the multi-rater fusion process in clinical practice with a multi-branch feature fusion module to further explore the uncertainties of TC labels. We evaluated the ULTRA on the public BreastPathQ dataset. The experimental results demonstrate that the ULTRA outperformed the regression-based methods for a large margin and achieved state-of-the-art results. The code will be available from https://github.com/PerceptionComputingLab/ULTRA
Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.
In this paper, we investigate secure transmissions in integrated satellite-terrestrial communications and the green interference based symbiotic security scheme is proposed. Particularly, the co-channel interference induced by the spectrum sharing between satellite and terrestrial networks and the inter-beam interference due to frequency reuse among satellite multi-beam serve as the green interference to assist the symbiotic secure transmission, where the secure transmissions of both satellite and terrestrial links are guaranteed simultaneously. Specifically, to realize the symbiotic security, we formulate a problem to maximize the sum secrecy rate of satellite users by cooperatively beamforming optimizing and a constraint of secrecy rate of each terrestrial user is guaranteed. Since the formulated problem is non-convex and intractable, the Taylor expansion and semi-definite relaxation (SDR) are adopted to further reformulate this problem, and the successive convex approximation (SCA) algorithm is designed to solve it. Finally, the tightness of the relaxation is proved. In addition, numerical results verify the efficiency of our proposed approach.
Domain generalization aims to learn a predictive model from multiple different but related source tasks that can generalize well to a target task without the need of accessing any target data. Existing domain generalization methods ignore the relationship between tasks, implicitly assuming that all the tasks are sampled from a stationary environment. Therefore, they can fail when deployed in an evolving environment. To this end, we formulate and study the \emph{evolving domain generalization} (EDG) scenario, which exploits not only the source data but also their evolving pattern to generate a model for the unseen task. Our theoretical result reveals the benefits of modeling the relation between two consecutive tasks by learning a globally consistent directional mapping function. In practice, our analysis also suggests solving the DDG problem in a meta-learning manner, which leads to \emph{directional prototypical network}, the first method for the DDG problem. Empirical evaluation of both synthetic and real-world data sets validates the effectiveness of our approach.
With the increasingly complex and changeable electromagnetic environment, wireless communication systems are facing jamming and abnormal signal injection, which significantly affects the normal operation of a communication system. In particular, the abnormal signals may emulate the normal signals, which makes it very challenging for abnormal signal recognition. In this paper, we propose a new abnormal signal recognition scheme, which combines time-frequency analysis with deep learning to effectively identify synthetic abnormal communication signals. Firstly, we emulate synthetic abnormal communication signals including seven jamming patterns. Then, we model an abnormal communication signals recognition system based on the communication protocol between the transmitter and the receiver. To improve the performance, we convert the original signal into the time-frequency spectrogram to develop an image classification algorithm. Simulation results demonstrate that the proposed method can effectively recognize the abnormal signals under various parameter configurations, even under low signal-to-noise ratio (SNR) and low jamming-to-signal ratio (JSR) conditions.
Adversarial attacks seriously threaten the high accuracy of face anti-spoofing models. Little adversarial noise can perturb their classification of live and spoofing. The existing adversarial attacks fail to figure out which part of the target face anti-spoofing model is vulnerable, making adversarial analysis tricky. So we propose fine-grained attacks for exposing adversarial vulnerability of face anti-spoofing models. Firstly, we propose Semantic Feature Augmentation (SFA) module, which makes adversarial noise semantic-aware to live and spoofing features. SFA considers the contrastive classes of data and texture bias of models in the context of face anti-spoofing, increasing the attack success rate by nearly 40% on average. Secondly, we generate fine-grained adversarial examples based on SFA and the multitask network with auxiliary information. We evaluate three annotations (facial attributes, spoofing types and illumination) and two geometric maps (depth and reflection), on four backbone networks (VGG, Resnet, Densenet and Swin Transformer). We find that facial attributes annotation and state-of-art networks fail to guarantee that models are robust to adversarial attacks. Such adversarial attacks can be generalized to more auxiliary information and backbone networks, to help our community handle the trade-off between accuracy and adversarial robustness.
The Fine-Grained Visual Categorization (FGVC) is challenging because the subtle inter-class variations are difficult to be captured. One notable research line uses the Global Covariance Pooling (GCP) layer to learn powerful representations with second-order statistics, which can effectively model inter-class differences. In our previous conference paper, we show that truncating small eigenvalues of the GCP covariance can attain smoother gradient and improve the performance on large-scale benchmarks. However, on fine-grained datasets, truncating the small eigenvalues would make the model fail to converge. This observation contradicts the common assumption that the small eigenvalues merely correspond to the noisy and unimportant information. Consequently, ignoring them should have little influence on the performance. To diagnose this peculiar behavior, we propose two attribution methods whose visualizations demonstrate that the seemingly unimportant small eigenvalues are crucial as they are in charge of extracting the discriminative class-specific features. Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues. Without introducing any additional parameters, this branch simply amplifies the small eigenvalues and achieves state-of-the-art performances of GCP methods on three fine-grained benchmarks. Furthermore, the performance is also competitive against other FGVC approaches on larger datasets. Code is available at \href{https://github.com/KingJamesSong/DifferentiableSVD}{https://github.com/KingJamesSong/DifferentiableSVD}.
User privacy is of great concern in Federated Learning, while Vision Transformers (ViTs) have been revealed to be vulnerable to gradient-based inversion attacks. We show that the learned low-dimensional spatial prior in position embeddings (PEs) accelerates the training of ViTs. As a side effect, it makes the ViTs tend to be position sensitive and at high risk of privacy leakage. We observe that enhancing the position-insensitive property of a ViT model is a promising way to protect data privacy against these gradient attacks. However, simply removing the PEs may not only harm the convergence and accuracy of ViTs but also places the model at more severe privacy risk. To deal with the aforementioned contradiction, we propose a simple yet efficient Masked Jigsaw Puzzle (MJP) method to break the chain of gradient leakage in ViTs. MJP can be easily plugged into existing ViTs and their derived variants. Extensive experiments demonstrate that our proposed MJP method not only boosts the performance on large-scale datasets (i.e., ImageNet-1K), but can also improve the privacy preservation capacity in the typical gradient attacks by a large margin. Our code is available at: https://github.com/yhlleo/MJP.
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.