We introduce a privacy auditing scheme for ML models that relies on membership inference attacks using generated data as "non-members". This scheme, which we call PANORAMIA, quantifies the privacy leakage for large-scale ML models without control of the training process or model re-training and only requires access to a subset of the training data. To demonstrate its applicability, we evaluate our auditing scheme across multiple ML domains, ranging from image and tabular data classification to large-scale language models.
Recent advancements in neural compression have surpassed traditional codecs in PSNR and MS-SSIM measurements. However, at low bit-rates, these methods can introduce visually displeasing artifacts, such as blurring, color shifting, and texture loss, thereby compromising perceptual quality of images. To address these issues, this study presents an enhanced neural compression method designed for optimal visual fidelity. We have trained our model with a sophisticated semantic ensemble loss, integrating Charbonnier loss, perceptual loss, style loss, and a non-binary adversarial loss, to enhance the perceptual quality of image reconstructions. Additionally, we have implemented a latent refinement process to generate content-aware latent codes. These codes adhere to bit-rate constraints, balance the trade-off between distortion and fidelity, and prioritize bit allocation to regions of greater importance. Our empirical findings demonstrate that this approach significantly improves the statistical fidelity of neural image compression. On CLIC2024 validation set, our approach achieves a 62% bitrate saving compared to MS-ILLM under FID metric.
The primary goal of the L3DAS23 Signal Processing Grand Challenge at ICASSP 2023 is to promote and support collaborative research on machine learning for 3D audio signal processing, with a specific emphasis on 3D speech enhancement and 3D Sound Event Localization and Detection in Extended Reality applications. As part of our latest competition, we provide a brand-new dataset, which maintains the same general characteristics of the L3DAS21 and L3DAS22 datasets, but with first-order Ambisonics recordings from multiple reverberant simulated environments. Moreover, we start exploring an audio-visual scenario by providing images of these environments, as perceived by the different microphone positions and orientations. We also propose updated baseline models for both tasks that can now support audio-image couples as input and a supporting API to replicate our results. Finally, we present the results of the participants. Further details about the challenge are available at https://www.l3das.com/icassp2023.
While score-based generative models (SGMs) have achieved remarkable success in enormous image generation tasks, their mathematical foundations are still limited. In this paper, we analyze the approximation and generalization of SGMs in learning a family of sub-Gaussian probability distributions. We introduce a notion of complexity for probability distributions in terms of their relative density with respect to the standard Gaussian measure. We prove that if the log-relative density can be locally approximated by a neural network whose parameters can be suitably bounded, then the distribution generated by empirical score matching approximates the target distribution in total variation with a dimension-independent rate. We illustrate our theory through examples, which include certain mixtures of Gaussians. An essential ingredient of our proof is to derive a dimension-free deep neural network approximation rate for the true score function associated with the forward process, which is interesting in its own right.
Machine learning model bias can arise from dataset composition: sensitive features correlated to the learning target disturb the model decision rule and lead to performance differences along the features. Existing de-biasing work captures prominent and delicate image features which are traceable in model latent space, like colors of digits or background of animals. However, using the latent space is not sufficient to understand all dataset feature correlations. In this work, we propose a framework to extract feature clusters in a dataset based on image descriptions, allowing us to capture both subtle and coarse features of the images. The feature co-occurrence pattern is formulated and correlation is measured, utilizing a human-in-the-loop for examination. The analyzed features and correlations are human-interpretable, so we name the method Common-Sense Bias Discovery (CSBD). Having exposed sensitive correlations in a dataset, we demonstrate that downstream model bias can be mitigated by adjusting image sampling weights, without requiring a sensitive group label supervision. Experiments show that our method discovers novel biases on multiple classification tasks for two benchmark image datasets, and the intervention outperforms state-of-the-art unsupervised bias mitigation methods.
Image registration is an essential process for aligning features of interest from multiple images. With the recent development of deep learning techniques, image registration approaches have advanced to a new level. In this work, we present 'Rotation-Equivariant network and Transformers for Image Registration' (RoTIR), a deep-learning-based method for the alignment of fish scale images captured by light microscopy. This approach overcomes the challenge of arbitrary rotation and translation detection, as well as the absence of ground truth data. We employ feature-matching approaches based on Transformers and general E(2)-equivariant steerable CNNs for model creation. Besides, an artificial training dataset is employed for semi-supervised learning. Results show RoTIR successfully achieves the goal of fish scale image registration.
Histologic examination plays a crucial role in oncology research and diagnostics. The adoption of digital scanning of whole slide images (WSI) has created an opportunity to leverage deep learning-based image classification methods to enhance diagnosis and risk stratification. Technical limitations of current approaches to training deep convolutional neural networks (DCNN) result in suboptimal model performance and make training and deployment of comprehensive classification models unobtainable. In this study, we introduce a novel approach that addresses the main limitations of traditional histopathology classification model training. Our method, termed Learned Resizing with Efficient Training (LRET), couples efficient training techniques with image resizing to facilitate seamless integration of larger histology image patches into state-of-the-art classification models while preserving important structural information. We used the LRET method coupled with two distinct resizing techniques to train three diverse histology image datasets using multiple diverse DCNN architectures. Our findings demonstrate a significant enhancement in classification performance and training efficiency. Across the spectrum of experiments, LRET consistently outperforms existing methods, yielding a substantial improvement of 15-28% in accuracy for a large-scale, multiclass tumor classification task consisting of 74 distinct brain tumor types. LRET not only elevates classification accuracy but also substantially reduces training times, unlocking the potential for faster model development and iteration. The implications of this work extend to broader applications within medical imaging and beyond, where efficient integration of high-resolution images into deep learning pipelines is paramount for driving advancements in research and clinical practice.
The growth of generative adversarial network (GAN) models has increased the ability of image processing and provides numerous industries with the technology to produce realistic image transformations. However, with the field being recently established there are new evaluation metrics that can further this research. Previous research has shown the Fr\'echet Inception Distance (FID) to be an effective metric when testing these image-to-image GANs in real-world applications. Signed Inception Distance (SID), a founded metric in 2023, expands on FID by allowing unsigned distances. This paper uses public datasets that consist of fa\c{c}ades, cityscapes, and maps within Pix2Pix and CycleGAN models. After training these models are evaluated on both inception distance metrics which measure the generating performance of the trained models. Our findings indicate that usage of the metric SID incorporates an efficient and effective metric to complement, or even exceed the ability shown using the FID for the image-to-image GANs
Person re-identification (re-ID) continues to pose a significant challenge, particularly in scenarios involving occlusions. Prior approaches aimed at tackling occlusions have predominantly focused on aligning physical body features through the utilization of external semantic cues. However, these methods tend to be intricate and susceptible to noise. To address the aforementioned challenges, we present an innovative end-to-end solution known as the Dynamic Patch-aware Enrichment Transformer (DPEFormer). This model effectively distinguishes human body information from occlusions automatically and dynamically, eliminating the need for external detectors or precise image alignment. Specifically, we introduce a dynamic patch token selection module (DPSM). DPSM utilizes a label-guided proxy token as an intermediary to identify informative occlusion-free tokens. These tokens are then selected for deriving subsequent local part features. To facilitate the seamless integration of global classification features with the finely detailed local features selected by DPSM, we introduce a novel feature blending module (FBM). FBM enhances feature representation through the complementary nature of information and the exploitation of part diversity. Furthermore, to ensure that DPSM and the entire DPEFormer can effectively learn with only identity labels, we also propose a Realistic Occlusion Augmentation (ROA) strategy. This strategy leverages the recent advances in the Segment Anything Model (SAM). As a result, it generates occlusion images that closely resemble real-world occlusions, greatly enhancing the subsequent contrastive learning process. Experiments on occluded and holistic re-ID benchmarks signify a substantial advancement of DPEFormer over existing state-of-the-art approaches. The code will be made publicly available.
Among the widely used parameter-efficient finetuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed LowRank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.