Large-scale text-to-image diffusion models have shown impressive capabilities across various generative tasks, enabled by strong vision-language alignment obtained through pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding.
Analog to Digital Converters (ADCs) are a major contributor to the power consumption of multiple-input multiple-output (MIMO) receivers with large antenna arrays operating in the millimeter wave and terahertz carrier frequencies. This is especially the case in large bandwidth terahertz communication systems, due to the sudden drop in energy-efficiency of ADCs as the sampling rate is increased above 100MHz. Two mitigating energy-efficient approaches which have received significant recent interest are i) to reduce the number of ADCs via analog and hybrid beamforming architectures, and ii) to reduce the resolution of the ADCs which in turn decreases power consumption. However, decreasing the number and resolution of ADCs leads to performance loss -- in terms of achievable rates -- due to increased quantization error. In this work, we study the application of practically implementable nonlinear analog operators such as envelop detectors and polynomial operators, prior to sampling and quantization at the ADCs, as a way to mitigate the aforementioned rate-loss. A receiver architecture consisting of linear analog combiners, nonlinear analog operators, and few-bit ADCs is designed. The fundamental information theoretic performance limits of the resulting communication system, in terms of achievable rates, are investigated under various assumptions on the set of implementable analog operators. Various numerical evaluations and simulations of the communication system are provided to compare the set of achievable rates under different architecture designs and parameters. Circuit simulations {in a 65 nm CMOS technology} exhibiting the generation of envelope detectors and polynomial operators are provided, and their power consumption is compared.