Jeff
Abstract:In this study, we introduce a novel cross-modal retrieval task involving speaker descriptions and their corresponding audio samples. Utilizing pre-trained speaker and text encoders, we present a simple learning framework based on contrastive learning. Additionally, we explore the impact of incorporating speaker labels into the training process. Our findings establish the effectiveness of linking speaker and text information for the task for both English and Japanese languages, across diverse data configurations. Additional visual analysis unveils potential nuanced associations between speaker clustering and retrieval performance.
Abstract:Quantum Markov chains generalize classical Markov chains for random variables to the quantum realm and exhibit unique inherent properties, making them an important feature in quantum information theory. In this work, we propose the concept of virtual quantum Markov chains (VQMCs), focusing on scenarios where subsystems retain classical information about global systems from measurement statistics. As a generalization of quantum Markov chains, VQMCs characterize states where arbitrary global shadow information can be recovered from subsystems through local quantum operations and measurements. We present an algebraic characterization for virtual quantum Markov chains and show that the virtual quantum recovery is fully determined by the block matrices of a quantum state on its subsystems. Notably, we find a distinction between two classes of tripartite entanglement by showing that the W state is a VQMC while the GHZ state is not. Furthermore, we establish semidefinite programs to determine the optimal sampling overhead and the robustness of virtual quantum Markov chains. We demonstrate the optimal sampling overhead is additive, indicating no free lunch to further reduce the sampling cost of recovery from parallel calls of the VQMC states. Our findings elucidate distinctions between quantum Markov chains and virtual quantum Markov chains, extending our understanding of quantum recovery to scenarios prioritizing classical information from measurement statistics.
Abstract:Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.
Abstract:Human albumin is essential for indicating the body's overall health. Accurately predicting plasma albumin levels and determining appropriate doses are urgent clinical challenges, particularly in critically ill patients, to maintain optimal blood levels. However, human albumin prediction is non-trivial that has to leverage the dynamics of biochemical markers as well as the experience of treating patients. Moreover, the problem of distribution shift is often encountered in real clinical data, which may lead to a decline in the model prediction performance and reduce the reliability of the model's application. In this paper, we propose a framework named Out-of-Distribution Generalized Dynamic Graph Neural Network for Human Albumin Prediction (DyG-HAP), which is able to provide accurate albumin predictions for Intensity Care Unit (ICU) patients during hospitalization. We first model human albumin prediction as a dynamic graph regression problem to model the dynamics and patient relationship. Then, we propose a disentangled dynamic graph attention mechanism to capture and disentangle the patterns whose relationship to labels under distribution shifts is invariant and variant respectively. Last, we propose an invariant dynamic graph regression method to encourage the model to rely on invariant patterns to make predictions. Moreover, we propose a dataset named Albumin level testing and nutritional dosing data for Intensive Care (ANIC) for evaluation. Extensive experiments demonstrate the superiority of our method compared to several baseline methods in human albumin prediction.
Abstract:Federated learning (FL) can suffer from a communication bottleneck when deployed in mobile networks, limiting participating clients and deterring FL convergence. The impact of practical air interfaces with discrete modulations on FL has not previously been studied in depth. This paper proposes a new paradigm of flexible aggregation-based FL (F$^2$L) over orthogonal frequency division multiple-access (OFDMA) air interface, termed as ``OFDMA-F$^2$L'', allowing selected clients to train local models for various numbers of iterations before uploading the models in each aggregation round. We optimize the selections of clients, subchannels and modulations, adapting to channel conditions and computing powers. Specifically, we derive an upper bound on the optimality gap of OFDMA-F$^2$L capturing the impact of the selections, and show that the upper bound is minimized by maximizing the weighted sum rate of the clients per aggregation round. A Lagrange-dual based method is developed to solve this challenging mixed integer program of weighted sum rate maximization, revealing that a ``winner-takes-all'' policy provides the almost surely optimal client, subchannel, and modulation selections. Experiments on multilayer perceptrons and convolutional neural networks show that OFDMA-F$^2$L with optimal selections can significantly improve the training convergence and accuracy, e.g., by about 18\% and 5\%, compared to potential alternatives.




Abstract:Dynamic graph neural networks (DyGNNs) have demonstrated powerful predictive abilities by exploiting graph structural and temporal dynamics. However, the existing DyGNNs fail to handle distribution shifts, which naturally exist in dynamic graphs, mainly because the patterns exploited by DyGNNs may be variant with respect to labels under distribution shifts. In this paper, we propose Disentangled Intervention-based Dynamic graph Attention networks with Invariance Promotion (I-DIDA) to handle spatio-temporal distribution shifts in dynamic graphs by discovering and utilizing invariant patterns, i.e., structures and features whose predictive abilities are stable across distribution shifts. Specifically, we first propose a disentangled spatio-temporal attention network to capture the variant and invariant patterns. By utilizing the disentangled patterns, we design a spatio-temporal intervention mechanism to create multiple interventional distributions and an environment inference module to infer the latent spatio-temporal environments, and minimize the variance of predictions among these intervened distributions and environments, so that our model can make predictions based on invariant patterns with stable predictive abilities under distribution shifts. Extensive experiments demonstrate the superiority of our method over state-of-the-art baselines under distribution shifts. Our work is the first study of spatio-temporal distribution shifts in dynamic graphs, to the best of our knowledge.
Abstract:With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities. However, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. This paper introduces Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs. AdvPT innovatively leverages learnable text prompts and aligns them with adversarial image embeddings, to address the vulnerabilities inherent in VLMs without the need for extensive parameter training or modification of the model architecture. We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques, further boosting defensive capabilities. Comprehensive experimental analyses provide insights into adversarial prompt tuning, a novel paradigm devoted to improving resistance to adversarial images through textual input modifications, paving the way for future robust multimodal learning research. These findings open up new possibilities for enhancing the security of VLMs. Our code will be available upon publication of the paper.




Abstract:Diffusion models have achieved great success due to their remarkable generation ability. However, their high computational overhead is still a troublesome problem. Recent studies have leveraged post-training quantization (PTQ) to compress diffusion models. However, most of them only focus on unconditional models, leaving the quantization of widely used large pretrained text-to-image models, e.g., Stable Diffusion, largely unexplored. In this paper, we propose a novel post-training quantization method PCR (Progressive Calibration and Relaxing) for text-to-image diffusion models, which consists of a progressive calibration strategy that considers the accumulated quantization error across timesteps, and an activation relaxing strategy that improves the performance with negligible cost. Additionally, we demonstrate the previous metrics for text-to-image diffusion model quantization are not accurate due to the distribution gap. To tackle the problem, we propose a novel QDiffBench benchmark, which utilizes data in the same domain for more accurate evaluation. Besides, QDiffBench also considers the generalization performance of the quantized model outside the calibration dataset. Extensive experiments on Stable Diffusion and Stable Diffusion XL demonstrate the superiority of our method and benchmark. Moreover, we are the first to achieve quantization for Stable Diffusion XL while maintaining the performance.
Abstract:In the field of clinical medicine, computed tomography (CT) is an effective medical imaging modality for the diagnosis of various pathologies. Compared with X-ray images, CT images can provide more information, including multi-planar slices and three-dimensional structures for clinical diagnosis. However, CT imaging requires patients to be exposed to large doses of ionizing radiation for a long time, which may cause irreversible physical harm. In this paper, we propose an Uncertainty-aware MedNeRF (UMedNeRF) network based on generated radiation fields. The network can learn a continuous representation of CT projections from 2D X-ray images by obtaining the internal structure and depth information and using adaptive loss weights to ensure the quality of the generated images. Our model is trained on publicly available knee and chest datasets, and we show the results of CT projection rendering with a single X-ray and compare our method with other methods based on generated radiation fields.
Abstract:Diffusion models have recently shown remarkable generation ability, achieving state-of-the-art performance in many tasks. However, the high computational cost is still a troubling problem for diffusion models. To tackle this problem, we propose to automatically remove the structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS). Specifically, given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which can achieve on-par or even better performance than the teacher. Considering current diffusion models are based on UNet which naturally has a block-wise structure, we perform neural architecture search independently in each block, which largely reduces the search space. Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss. Concretely, during the search process, we block-wisely select the best subnet to avoid the unfairness brought by the global search strategy used in previous works. When retraining the searched architecture, we adopt a dynamic joint loss to maintain the consistency between supernet training and subnet retraining, which also provides informative objectives for each block and shortens the paths of gradient propagation. We demonstrate this joint loss can effectively improve model performance. We also prove the necessity of the dynamic adjustment of this loss. The experiments show that our method can achieve significant computational reduction, especially on latent diffusion models with about 50\% MACs and Parameter reduction.