Text-to-image diffusion models are nothing but a revolution, allowing anyone, even without design skills, to create realistic images from simple text inputs. With powerful personalization tools like DreamBooth, they can generate images of a specific person just by learning from his/her few reference images. However, when misused, such a powerful and convenient tool can produce fake news or disturbing content targeting any individual victim, posing a severe negative social impact. In this paper, we explore a defense system called Anti-DreamBooth against such malicious use of DreamBooth. The system aims to add subtle noise perturbation to each user's image before publishing in order to disrupt the generation quality of any DreamBooth model trained on these perturbed images. We investigate a wide range of algorithms for perturbation optimization and extensively evaluate them on two facial datasets over various text-to-image model versions. Despite the complicated formulation of DreamBooth and Diffusion-based text-to-image models, our methods effectively defend users from the malicious use of those models. Their effectiveness withstands even adverse conditions, such as model or prompt/term mismatching between training and testing. Our code will be available at \href{https://github.com/VinAIResearch/Anti-DreamBooth.git}{https://github.com/VinAIResearch/Anti-DreamBooth.git}.
Any-scale image synthesis offers an efficient and scalable solution to synthesize photo-realistic images at any scale, even going beyond 2K resolution. However, existing GAN-based solutions depend excessively on convolutions and a hierarchical architecture, which introduce inconsistency and the $``$texture sticking$"$ issue when scaling the output resolution. From another perspective, INR-based generators are scale-equivariant by design, but their huge memory footprint and slow inference hinder these networks from being adopted in large-scale or real-time systems. In this work, we propose $\textbf{C}$olumn-$\textbf{R}$ow $\textbf{E}$ntangled $\textbf{P}$ixel $\textbf{S}$ynthesis ($\textbf{CREPS}$), a new generative model that is both efficient and scale-equivariant without using any spatial convolutions or coarse-to-fine design. To save memory footprint and make the system scalable, we employ a novel bi-line representation that decomposes layer-wise feature maps into separate $``$thick$"$ column and row encodings. Experiments on various datasets, including FFHQ, LSUN-Church, MetFaces, and Flickr-Scenery, confirm CREPS' ability to synthesize scale-consistent and alias-free images at any arbitrary resolution with proper training and inference speed. Code is available at https://github.com/VinAIResearch/CREPS.
Medical phrase grounding (MPG) aims to locate the most relevant region in a medical image, given a phrase query describing certain medical findings, which is an important task for medical image analysis and radiological diagnosis. However, existing visual grounding methods rely on general visual features for identifying objects in natural images and are not capable of capturing the subtle and specialized features of medical findings, leading to sub-optimal performance in MPG. In this paper, we propose MedRPG, an end-to-end approach for MPG. MedRPG is built on a lightweight vision-language transformer encoder and directly predicts the box coordinates of mentioned medical findings, which can be trained with limited medical data, making it a valuable tool in medical image analysis. To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo). TaCo seeks context alignment to pull both the features and attention outputs of relevant region-phrase pairs close together while pushing those of irrelevant regions far away. This ensures that the final box prediction depends more on its finding-specific regions and phrases. Experimental results on three MPG datasets demonstrate that our MedRPG outperforms state-of-the-art visual grounding approaches by a large margin. Additionally, the proposed TaCo strategy is effective in enhancing finding localization ability and reducing spurious region-phrase correlations.
Ranking systems are ubiquitous in modern Internet services, including online marketplaces, social media, and search engines. Traditionally, ranking systems only focus on how to get better relevance estimation. When relevance estimation is available, they usually adopt a user-centric optimization strategy where ranked lists are generated by sorting items according to their estimated relevance. However, such user-centric optimization ignores the fact that item providers also draw utility from ranking systems. It has been shown in existing research that such user-centric optimization will cause much unfairness to item providers, followed by unfair opportunities and unfair economic gains for item providers. To address ranking fairness, many fair ranking methods have been proposed. However, as we show in this paper, these methods could be suboptimal as they directly rely on the relevance estimation without being aware of the uncertainty (i.e., the variance of the estimated relevance). To address this uncertainty, we propose a novel Marginal-Certainty-aware Fair algorithm named MCFair. MCFair jointly optimizes fairness and user utility, while relevance estimation is constantly updated in an online manner. In MCFair, we first develop a ranking objective that includes uncertainty, fairness, and user utility. Then we directly use the gradient of the ranking objective as the ranking score. We theoretically prove that MCFair based on gradients is optimal for the aforementioned ranking objective. Empirically, we find that on semi-synthesized datasets, MCFair is effective and practical and can deliver superior performance compared to state-of-the-art fair ranking methods. To facilitate reproducibility, we release our code https://github.com/Taosheng-ty/WSDM22-MCFair.
Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models' running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion structure. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints will be available at \url{https://github.com/VinAIResearch/WaveDiff.git}.
Ranking lies at the core of many Information Retrieval (IR) tasks. While existing research on Learning to Rank (LTR) using Deep Neural Network (DNN) has achieved great success, it is somewhat limited because of its dependence on fine-grained labels. In practice, fine-grained labels are often expensive to acquire, i.e. explicit relevance judgements, or suffer from biases, i.e. click logs. Compared to fine-grained labels, coarse-grained labels are easier and cheaper to collect. Some recent works propose utilizing only coarse-grained labels for LTR tasks. A most representative line of work introduces Reinforcement Learning (RL) algorithms. RL can help train the LTR model with little reliance on fine-grained labels compared to Supervised Learning. To study the effectiveness of the RL-based LTR algorithm on coarse-grained labels, in this paper, we implement four different RL paradigms and conduct extensive experiments on two well-established LTR datasets. The results on simulated coarse-grained labeled dataset show that while using coarse-grained labels to train an RL model for LTR tasks still can not outperform traditional approaches using fine-grained labels, it still achieve somewhat promising results and is potentially helpful for future research in LTR. Our code implementations will be released after this work is accepted.
State-of-the-art recommender system (RS) mostly rely on complex deep neural network (DNN) model structure, which makes it difficult to provide explanations along with RS decisions. Previous researchers have proved that providing explanations along with recommended items can help users make informed decisions and improve their trust towards the uninterpretable blackbox system. In model-agnostic explainable recommendation, system designers deploy a separate explanation model to take as input from the decision model, and generate explanations to meet the goal of persuasiveness. In this work, we explore the task of ranking textual rationales (supporting evidences) for model-agnostic explainable recommendation. Most of existing rationales ranking algorithms only utilize the rationale IDs and interaction matrices to build latent factor representations; and the semantic information within the textual rationales are not learned effectively. We argue that such design is suboptimal as the important semantic information within the textual rationales may be used to better profile user preferences and item features. Seeing this gap, we propose a model named Semantic-Enhanced Bayesian Personalized Explanation Ranking (SE-BPER) to effectively combine the interaction information and semantic information. SE-BPER first initializes the latent factor representations with contextualized embeddings generated by transformer model, then optimizes them with the interaction data. Extensive experiments show that such methodology improves the rationales ranking performance while simplifying the model training process (fewer hyperparameters and faster convergence). We conclude that the optimal way to combine semantic and interaction information remains an open question in the task of rationales ranking.
Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing.
Domain adaptation (DA) benefits from the rigorous theoretical works that study its insightful characteristics and various aspects, e.g., learning domain-invariant representations and its trade-off. However, it seems not the case for the multiple source DA and domain generalization (DG) settings which are remarkably more complicated and sophisticated due to the involvement of multiple source domains and potential unavailability of target domain during training. In this paper, we develop novel upper-bounds for the target general loss which appeal to us to define two kinds of domain-invariant representations. We further study the pros and cons as well as the trade-offs of enforcing learning each domain-invariant representation. Finally, we conduct experiments to inspect the trade-off of these representations for offering practical hints regarding how to use them in practice and explore other interesting properties of our developed theory.
Multi-head attention is a driving force behind state-of-the-art transformers which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attentions. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.