In neural network binarization, BinaryConnect (BC) and its variants are considered the standard. These methods apply the sign function in their forward pass and their respective gradients are backpropagated to update the weights. However, the derivative of the sign function is zero whenever defined, which consequently freezes training. Therefore, implementations of BC (e.g., BNN) usually replace the derivative of sign in the backward computation with identity or other approximate gradient alternatives. Although such practice works well empirically, it is largely a heuristic or ''training trick.'' We aim at shedding some light on these training tricks from the optimization perspective. Building from existing theory on ProxConnect (PC, a generalization of BC), we (1) equip PC with different forward-backward quantizers and obtain ProxConnect++ (PC++) that includes existing binarization techniques as special cases; (2) derive a principled way to synthesize forward-backward quantizers with automatic theoretical guarantees; (3) illustrate our theory by proposing an enhanced binarization algorithm BNN++; (4) conduct image classification experiments on CNNs and vision transformers, and empirically verify that BNN++ generally achieves competitive results on binarizing these models.
While latent diffusion models (LDMs) excel at creating imaginative images, they often lack precision in semantic fidelity and spatial control over where objects are generated. To address these deficiencies, we introduce the Box-it-to-Bind-it (B2B) module - a novel, training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models. B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance. The process encompasses two main steps: i) Object generation, which adjusts the latent encoding to guarantee object generation and directs it within specified bounding boxes, and ii) attribute binding, guaranteeing that generated objects adhere to their specified attributes in the prompt. B2B is designed as a compatible plug-and-play module for existing T2I models, markedly enhancing model performance in addressing the key challenges. We evaluate our technique using the established CompBench and TIFA score benchmarks, demonstrating significant performance improvements compared to existing methods. The source code will be made publicly available at https://github.com/nextaistudio/BoxIt2BindIt.
In machine learning fairness, training models which minimize disparity across different sensitive groups often leads to diminished accuracy, a phenomenon known as the fairness-accuracy trade-off. The severity of this trade-off fundamentally depends on dataset characteristics such as dataset imbalances or biases. Therefore using a uniform fairness requirement across datasets remains questionable and can often lead to models with substantially low utility. To address this, we present a computationally efficient approach to approximate the fairness-accuracy trade-off curve tailored to individual datasets, backed by rigorous statistical guarantees. By utilizing the You-Only-Train-Once (YOTO) framework, our approach mitigates the computational burden of having to train multiple models when approximating the trade-off curve. Moreover, we quantify the uncertainty in our approximation by introducing confidence intervals around this curve, offering a statistically grounded perspective on the acceptable range of fairness violations for any given accuracy threshold. Our empirical evaluation spanning tabular, image and language datasets underscores that our approach provides practitioners with a principled framework for dataset-specific fairness decisions across various data modalities.
In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images, including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths, thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM demonstrates strong generalization capabilities, allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation.
Diffusion and Poisson flow models have demonstrated remarkable success for a wide range of generative tasks. Nevertheless, their iterative nature results in computationally expensive sampling and the number of function evaluations (NFE) required can be orders of magnitude larger than for single-step methods. Consistency models are a recent class of deep generative models which enable single-step sampling of high quality data without the need for adversarial training. In this paper, we introduce a novel image denoising technique which combines the flexibility afforded in Poisson flow generative models (PFGM)++ with the, high quality, single step sampling of consistency models. The proposed method first learns a trajectory between a noise distribution and the posterior distribution of interest by training PFGM++ in a supervised fashion. These pre-trained PFGM++ are subsequently "distilled" into Poisson flow consistency models (PFCM) via an updated version of consistency distillation. We call this approach posterior sampling Poisson flow consistency models (PS-PFCM). Our results indicate that the added flexibility of tuning the hyperparameter D, the dimensionality of the augmentation variables in PFGM++, allows us to outperform consistency models, a current state-of-the-art diffusion-style model with NFE=1 on clinical low-dose CT images. Notably, PFCM is in itself a novel family of deep generative models and we provide initial results on the CIFAR-10 dataset.
Radiometric accuracy of data is crucial in quantitative precision agriculture, to produce reliable and repeatable data for modeling and decision making. The effect of exposure time and gain settings on the radiometric accuracy of multispectral images was not explored enough. The goal of this study was to determine if having a fixed exposure (FE) time during image acquisition improved radiometric accuracy of images, compared to the default auto-exposure (AE) settings. This involved quantifying the errors from auto-exposure and determining ideal exposure values within which radiometric mean absolute percentage error (MAPE) were minimal (< 5%). The results showed that FE orthomosaic was closer to ground-truth (higher R2 and lower MAPE) than AE orthomosaic. An ideal exposure range was determined for capturing canopy and soil objects, without loss of information from under-exposure or saturation from over-exposure. A simulation of errors from AE showed that MAPE < 5% for the blue, green, red, and NIR bands and < 7% for the red edge band for exposure settings within the determined ideal ranges and increased exponentially beyond the ideal exposure upper limit. Further, prediction of total plant nitrogen uptake (g/plant) using vegetation indices (VIs) from two different growing seasons were closer to the ground truth (mostly, R2 > 0.40, and MAPE = 12 to 14%, p < 0.05) when FE was used, compared to the prediction from AE images (mostly, R2 < 0.13, MAPE = 15 to 18%, p >= 0.05).
Dithering is a technique that can improve human perception of low-resolution data by reducing quantization artifacts. In this work we formalize and analytically justify two metrics for quantization artifact prominence, using them to design a novel dithering method for distortion-controlled data compression. We present theoretical entropy calculations for this dither and experimentally validate its performance on a low-rate image compression task. The result is a drastic improvement in the perceptual quality of quantized images with a lower recompression entropy than any state-of-the-art dither technique, achieving 45 points lower PIQUE at the same rate or 40% lower rate at the same PIQUE. The proposed dither is an adaptable tool applicable for use in any lossy compression system, permitting precise control of rate-distortion characteristics for both compression and recompression.
This article presents a novel undersampled magnetic resonance imaging (MRI) technique that leverages the concept of Neural Radiance Field (NeRF). With radial undersampling, the corresponding imaging problem can be reformulated into an image modeling task from sparse-view rendered data; therefore, a high dimensional MR image is obtainable from undersampled $k$-space data by taking advantage of implicit neural representation. A multi-layer perceptron, which is designed to output an image intensity from a spatial coordinate, learns the MR physics-driven rendering relation between given measurement data and desired image. Effective undersampling strategies for high-quality neural representation are investigated. The proposed method serves two benefits: (i) The learning is based fully on single undersampled $k$-space data, not a bunch of measured data and target image sets. It can be used potentially for diagnostic MR imaging, such as fetal MRI, where data acquisition is relatively rare or limited against diversity of clinical images while undersampled reconstruction is highly demanded. (ii) A reconstructed MR image is a scan-specific representation highly adaptive to the given $k$-space measurement. Numerous experiments validate the feasibility and capability of the proposed approach.
This study develops and evaluates a novel multimodal medical image zero-shot segmentation algorithm named Text-Visual-Prompt SAM (TV-SAM) without any manual annotations. TV-SAM incorporates and integrates large language model GPT-4, Vision Language Model GLIP, and Segment Anything Model (SAM), to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing SAM for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training, significantly outperforming SAM AUTO and GSAM, closely matching the performance of SAM BBOX with gold standard bounding box prompts, and surpassing the state-of-the-art on specific datasets like ISIC and WBC. The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, it could enhance the capability to address complex problems in specialized domains. The code is available at: https://github.com/JZK00/TV-SAM.
Precise situational awareness is required for the safe decision-making of assisted and automated driving (AAD) functions. Panoptic segmentation is a promising perception technique to identify and categorise objects, impending hazards, and driveable space at a pixel level. While segmentation quality is generally associated with the quality of the camera data, a comprehensive understanding and modelling of this relationship are paramount for AAD system designers. Motivated by such a need, this work proposes a unifying pipeline to assess the robustness of panoptic segmentation models for AAD, correlating it with traditional image quality. The first step of the proposed pipeline involves generating degraded camera data that reflects real-world noise factors. To this end, 19 noise factors have been identified and implemented with 3 severity levels. Of these factors, this work proposes novel models for unfavourable light and snow. After applying the degradation models, three state-of-the-art CNN- and vision transformers (ViT)-based panoptic segmentation networks are used to analyse their robustness. The variations of the segmentation performance are then correlated to 8 selected image quality metrics. This research reveals that: 1) certain specific noise factors produce the highest impact on panoptic segmentation, i.e. droplets on lens and Gaussian noise; 2) the ViT-based panoptic segmentation backbones show better robustness to the considered noise factors; 3) some image quality metrics (i.e. LPIPS and CW-SSIM) correlate strongly with panoptic segmentation performance and therefore they can be used as predictive metrics for network performance.