Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qingming Huang

University of Chinese Academy of Sciences, Key Lab of Intell. Info. Process., Inst. of Comput. Tech., Chinese Academy of Sciences, Peng Cheng Laboratory

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Oct 26, 2023

Jiayu Xiao, Liang Li, Henglei Lv, Shuhui Wang, Qingming Huang

Figure 1 for R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Figure 2 for R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Figure 3 for R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Figure 4 for R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Abstract:Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.

* Preprint. Under review. Project page: https://sagileo.github.io/Region-and-Boundary

Via

Access Paper or Ask Questions

Open-Set Knowledge-Based Visual Question Answering with Inference Paths

Oct 12, 2023

Jingru Gan, Xinzhe Han, Shuhui Wang, Qingming Huang

Figure 1 for Open-Set Knowledge-Based Visual Question Answering with Inference Paths

Figure 2 for Open-Set Knowledge-Based Visual Question Answering with Inference Paths

Figure 3 for Open-Set Knowledge-Based Visual Question Answering with Inference Paths

Figure 4 for Open-Set Knowledge-Based Visual Question Answering with Inference Paths

Abstract:Given an image and an associated textual question, the purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. Prior KB-VQA models are usually formulated as a retriever-classifier framework, where a pre-trained retriever extracts textual or visual information from knowledge graphs and then makes a prediction among the candidates. Despite promising progress, there are two drawbacks with existing models. Firstly, modeling question-answering as multi-class classification limits the answer space to a preset corpus and lacks the ability of flexible reasoning. Secondly, the classifier merely consider "what is the answer" without "how to get the answer", which cannot ground the answer to explicit reasoning paths. In this paper, we confront the challenge of \emph{explainable open-set} KB-VQA, where the system is required to answer questions with entities at wild and retain an explainable reasoning path. To resolve the aforementioned issues, we propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity). Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process. To comprehensively evaluate our model, we reformulate the benchmark dataset OK-VQA with manually corrected entity-level annotations and release it as ConceptVQA. Extensive experiments on real-world questions demonstrate that our framework is not only able to perform open-set question answering across the whole knowledge base but provide explicit reasoning path.

Via

Access Paper or Ask Questions

Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges

Oct 12, 2023

Peifeng Gao, Qianqian Xu, Yibo Yang, Peisong Wen, Huiyang Shao, Zhiyong Yang, Bernard Ghanem, Qingming Huang

Figure 1 for Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges

Figure 2 for Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges

Figure 3 for Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges

Figure 4 for Towards Demystifying the Generalization Behaviors When Neural Collapse Emerges

Abstract:Neural Collapse (NC) is a well-known phenomenon of deep neural networks in the terminal phase of training (TPT). It is characterized by the collapse of features and classifier into a symmetrical structure, known as simplex equiangular tight frame (ETF). While there have been extensive studies on optimization characteristics showing the global optimality of neural collapse, little research has been done on the generalization behaviors during the occurrence of NC. Particularly, the important phenomenon of generalization improvement during TPT has been remaining in an empirical observation and lacking rigorous theoretical explanation. In this paper, we establish the connection between the minimization of CE and a multi-class SVM during TPT, and then derive a multi-class margin generalization bound, which provides a theoretical explanation for why continuing training can still lead to accuracy improvement on test set, even after the train accuracy has reached 100%. Additionally, our further theoretical results indicate that different alignment between labels and features in a simplex ETF can result in varying degrees of generalization improvement, despite all models reaching NC and demonstrating similar optimization performance on train set. We refer to this newly discovered property as "non-conservative generalization". In experiments, we also provide empirical observations to verify the indications suggested by our theoretical results.

* 20 pages, 6 figures. arXiv admin note: substantial text overlap with arXiv:2304.08914

Via

Access Paper or Ask Questions

A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning

Oct 07, 2023

Zitai Wang, Qianqian Xu, Zhiyong Yang, Yuan He, Xiaochun Cao, Qingming Huang

Figure 1 for A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning

Figure 2 for A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning

Figure 3 for A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning

Figure 4 for A Unified Generalization Analysis of Re-Weighting and Logit-Adjustment for Imbalanced Learning

Abstract:Real-world datasets are typically imbalanced in the sense that only a few classes have numerous samples, while many classes are associated with only a few samples. As a result, a na\"ive ERM learning process will be biased towards the majority classes, making it difficult to generalize to the minority classes. To address this issue, one simple but effective approach is to modify the loss function to emphasize the learning on minority classes, such as re-weighting the losses or adjusting the logits via class-dependent terms. However, existing generalization analysis of such losses is still coarse-grained and fragmented, failing to explain some empirical results. To bridge this gap, we propose a novel technique named data-dependent contraction to capture how these modified losses handle different classes. On top of this technique, a fine-grained generalization bound is established for imbalanced learning, which helps reveal the mystery of re-weighting and logit-adjustment in a unified manner. Furthermore, a principled learning algorithm is developed based on the theoretical insights. Finally, the empirical results on benchmark datasets not only validate the theoretical results but also demonstrate the effectiveness of the proposed method.

Via

Access Paper or Ask Questions

Self-supervised Cross-view Representation Reconstruction for Change Captioning

Sep 28, 2023

Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming Huang

Figure 1 for Self-supervised Cross-view Representation Reconstruction for Change Captioning

Figure 2 for Self-supervised Cross-view Representation Reconstruction for Change Captioning

Figure 3 for Self-supervised Cross-view Representation Reconstruction for Change Captioning

Figure 4 for Self-supervised Cross-view Representation Reconstruction for Change Captioning

Abstract:Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a ``hallucination'' representation with the caption and ``before'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

PUGAN: Physical Model-Guided Underwater Image Enhancement Using GAN with Dual-Discriminators

Jun 15, 2023

Runmin Cong, Wenyu Yang, Wei Zhang, Chongyi Li, Chun-Le Guo, Qingming Huang, Sam Kwong

Abstract:Due to the light absorption and scattering induced by the water medium, underwater images usually suffer from some degradation problems, such as low contrast, color distortion, and blurring details, which aggravate the difficulty of downstream underwater understanding tasks. Therefore, how to obtain clear and visually pleasant images has become a common concern of people, and the task of underwater image enhancement (UIE) has also emerged as the times require. Among existing UIE methods, Generative Adversarial Networks (GANs) based methods perform well in visual aesthetics, while the physical model-based methods have better scene adaptability. Inheriting the advantages of the above two types of models, we propose a physical model-guided GAN model for UIE in this paper, referred to as PUGAN. The entire network is under the GAN architecture. On the one hand, we design a Parameters Estimation subnetwork (Par-subnet) to learn the parameters for physical model inversion, and use the generated color enhancement image as auxiliary information for the Two-Stream Interaction Enhancement sub-network (TSIE-subnet). Meanwhile, we design a Degradation Quantization (DQ) module in TSIE-subnet to quantize scene degradation, thereby achieving reinforcing enhancement of key regions. On the other hand, we design the Dual-Discriminators for the style-content adversarial constraint, promoting the authenticity and visual aesthetics of the results. Extensive experiments on three benchmark datasets demonstrate that our PUGAN outperforms state-of-the-art methods in both qualitative and quantitative metrics.

* 8 pages, 4 figures, Accepted by IEEE Transactions on Image Processing 2023

Via

Access Paper or Ask Questions

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

May 13, 2023

Ke Zhang, Hanliang Jiang, Jian Zhang, Qingming Huang, Jianping Fan, Jun Yu, Weidong Han

Figure 1 for Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Figure 2 for Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Figure 3 for Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Figure 4 for Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Abstract:In recent years, the growing demand for medical imaging diagnosis has brought a significant burden to radiologists. The existing Med-VLP methods provide a solution for automated medical image analysis which learns universal representations from large-scale medical images and reports and benefits downstream tasks without requiring fine-grained annotations. However, the existing methods based on joint image-text reconstruction neglect the importance of cross-modal alignment in conjunction with joint reconstruction, resulting in inadequate cross-modal interaction. In this paper, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a global and local alignment (GLA) module is designed to assist self-supervised paradigm in obtaining semantic representations with rich domain knowledge. To achieve more comprehensive cross-modal fusion, we also propose a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual features to assist in the process of report reconstruction. Experimental results show that our approach outperforms previous methods over all downstream tasks, including uni-modal, cross-modal and multi-modal tasks.

Via

Access Paper or Ask Questions

A Study of Neural Collapse Phenomenon: Grassmannian Frame, Symmetry, Generalization

Apr 18, 2023

Peifeng Gao, Qianqian Xu, Peisong Wen, Huiyang Shao, Zhiyong Yang, Qingming Huang

Figure 1 for A Study of Neural Collapse Phenomenon: Grassmannian Frame, Symmetry, Generalization

Figure 2 for A Study of Neural Collapse Phenomenon: Grassmannian Frame, Symmetry, Generalization

Figure 3 for A Study of Neural Collapse Phenomenon: Grassmannian Frame, Symmetry, Generalization

Figure 4 for A Study of Neural Collapse Phenomenon: Grassmannian Frame, Symmetry, Generalization

Abstract:In this paper, we extends original Neural Collapse Phenomenon by proving Generalized Neural Collapse hypothesis. We obtain Grassmannian Frame structure from the optimization and generalization of classification. This structure maximally separates features of every two classes on a sphere and does not require a larger feature dimension than the number of classes. Out of curiosity about the symmetry of Grassmannian Frame, we conduct experiments to explore if models with different Grassmannian Frames have different performance. As a result, we discover the Symmetric Generalization phenomenon. We provide a theorem to explain Symmetric Generalization of permutation. However, the question of why different directions of features can lead to such different generalization is still open for future investigation.

* 25 pages, 2 figures

Via

Access Paper or Ask Questions

Neighborhood Contrastive Transformer for Change Captioning

Mar 06, 2023

Yunbin Tu, Liang Li, Li Su, Ke Lu, Qingming Huang

Figure 1 for Neighborhood Contrastive Transformer for Change Captioning

Figure 2 for Neighborhood Contrastive Transformer for Change Captioning

Figure 3 for Neighborhood Contrastive Transformer for Change Captioning

Figure 4 for Neighborhood Contrastive Transformer for Change Captioning

Abstract:Change captioning is to describe the semantic change between a pair of similar images in natural language. It is more challenging than general image captioning, because it requires capturing fine-grained change information while being immune to irrelevant viewpoint changes, and solving syntax ambiguity in change descriptions. In this paper, we propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes and cognition ability for complex syntax structure. Concretely, we first design a neighboring feature aggregating to integrate neighboring context into each feature, which helps quickly locate the inconspicuous changes under the guidance of conspicuous referents. Then, we devise a common feature distilling to compare two images at neighborhood level and extract common properties from each image, so as to learn effective contrastive information between them. Finally, we introduce the explicit dependencies between words to calibrate the transformer decoder, which helps better understand complex syntax structure during training. Extensive experimental results demonstrate that the proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios. The code is available at https://github.com/tuyunbin/NCT.

* Accepted by IEEE TMM

Via

Access Paper or Ask Questions

Stable Attribute Group Editing for Reliable Few-shot Image Generation

Feb 01, 2023

Guanqi Ding, Xinzhe Han, Shuhui Wang, Xin Jin, Dandan Tu, Qingming Huang

Figure 1 for Stable Attribute Group Editing for Reliable Few-shot Image Generation

Figure 2 for Stable Attribute Group Editing for Reliable Few-shot Image Generation

Figure 3 for Stable Attribute Group Editing for Reliable Few-shot Image Generation

Figure 4 for Stable Attribute Group Editing for Reliable Few-shot Image Generation

Abstract:Few-shot image generation aims to generate data of an unseen category based on only a few samples. Apart from basic content generation, a bunch of downstream applications hopefully benefit from this task, such as low-data detection and few-shot classification. To achieve this goal, the generated images should guarantee category retention for classification beyond the visual quality and diversity. In our preliminary work, we present an ``editing-based'' framework Attribute Group Editing (AGE) for reliable few-shot image generation, which largely improves the generation performance. Nevertheless, AGE's performance on downstream classification is not as satisfactory as expected. This paper investigates the class inconsistency problem and proposes Stable Attribute Group Editing (SAGE) for more stable class-relevant image generation. SAGE takes use of all given few-shot images and estimates a class center embedding based on the category-relevant attribute dictionary. Meanwhile, according to the projection weights on the category-relevant attribute dictionary, we can select category-irrelevant attributes from the similar seen categories. Consequently, SAGE injects the whole distribution of the novel class into StyleGAN's latent space, thus largely remains the category retention and stability of the generated images. Going one step further, we find that class inconsistency is a common problem in GAN-generated images for downstream classification. Even though the generated images look photo-realistic and requires no category-relevant editing, they are usually of limited help for downstream classification. We systematically discuss this issue from both the generative model and classification model perspectives, and propose to boost the downstream classification performance of SAGE by enhancing the pixel and frequency components.

Via

Access Paper or Ask Questions