Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuming He

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Apr 06, 2024

Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, Xuming He

Abstract:Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

SP$^2$OT: Semantic-Regularized Progressive Partial Optimal Transport for Imbalanced Clustering

Apr 04, 2024

Chuyu Zhang, Hui Ren, Xuming He

Abstract:Deep clustering, which learns representation and semantic clustering without labels information, poses a great challenge for deep learning-based approaches. Despite significant progress in recent years, most existing methods focus on uniformly distributed datasets, significantly limiting the practical applicability of their methods. In this paper, we propose a more practical problem setting named deep imbalanced clustering, where the underlying classes exhibit an imbalance distribution. To address this challenge, we introduce a novel optimal transport-based pseudo-label learning framework. Our framework formulates pseudo-label generation as a Semantic-regularized Progressive Partial Optimal Transport (SP$^2$OT) problem, which progressively transports each sample to imbalanced clusters under several prior distribution and semantic relation constraints, thus generating high-quality and imbalance-aware pseudo-labels. To solve SP$^2$OT, we develop a Majorization-Minimization-based optimization algorithm. To be more precise, we employ the strategy of majorization to reformulate the SP$^2$OT problem into a Progressive Partial Optimal Transport problem, which can be transformed into an unbalanced optimal transport problem with augmented constraints and can be solved efficiently by a fast matrix scaling algorithm. Experiments on various datasets, including a human-curated long-tailed CIFAR100, challenging ImageNet-R, and large-scale subsets of fine-grained iNaturalist2018 datasets, demonstrate the superiority of our method.

* under review. arXiv admin note: substantial text overlap with arXiv:2401.09266

Via

Access Paper or Ask Questions

Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning

Apr 01, 2024

Rongjie Li, Yu Wu, Xuming He

Abstract:Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering. However, improving their zero-shot reasoning typically requires second-stage instruction tuning, which relies heavily on human-labeled or large language model-generated annotation, incurring high labeling costs. To tackle this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts, thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser, we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning.

* Accepted by CVPR2024

Via

Access Paper or Ask Questions

DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation

Mar 21, 2024

Zeeshan Hayder, Xuming He

Abstract:Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image, which is challenging due to incomplete labelling, long-tailed relationship categories, and relational semantic overlap. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets and hence often suffer from limited capacity in learning low-frequency relationships. In this paper, we present a new Transformer-based method, called DSGG, that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries. In particular, each graph-aware query encodes a compact representation of both the node and all of its relations in the graph, acquired through the utilization of a relaxed sub-graph matching during the training process. Moreover, to address the problem of relational semantic overlap, we utilize a strategy for relation distillation, aiming to efficiently learn multiple instances of semantic relationships. Extensive experiments on the VG and the PSG datasets show that our model achieves state-of-the-art results, showing a significant improvement of 3.5\% and 6.7\% in mR@50 and mR@100 for the scene-graph generation task and achieves an even more substantial improvement of 8.5\% and 10.3\% in mR@50 and mR@100 for the panoptic scene graph generation task. Code is available at \url{https://github.com/zeeshanhayder/DSGG}.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Feb 21, 2024

Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu(+2 more)

Figure 1 for RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Figure 2 for RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Figure 3 for RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Figure 4 for RealDex: Towards Human-like Grasping for Robotic Dexterous Hand

Abstract:In this paper, we introduce RealDex, a pioneering dataset capturing authentic dexterous hand grasping motions infused with human behavioral patterns, enriched by multi-view and multimodal visual data. Utilizing a teleoperation system, we seamlessly synchronize human-robot hand poses in real time. This collection of human-like motions is crucial for training dexterous hands to mimic human movements more naturally and precisely. RealDex holds immense promise in advancing humanoid robot for automated perception, cognition, and manipulation in real-world scenarios. Moreover, we introduce a cutting-edge dexterous grasping motion generation framework, which aligns with human experience and enhances real-world applicability through effectively utilizing Multimodal Large Language Models. Extensive experiments have demonstrated the superior performance of our method on RealDex and other open datasets. The complete dataset and code will be made available upon the publication of this work.

Via

Access Paper or Ask Questions

SGTR+: End-to-end Scene Graph Generation with Transformer

Jan 23, 2024

Rongjie Li, Songyang Zhang, Xuming He

Figure 1 for SGTR+: End-to-end Scene Graph Generation with Transformer

Figure 2 for SGTR+: End-to-end Scene Graph Generation with Transformer

Figure 3 for SGTR+: End-to-end Scene Graph Generation with Transformer

Figure 4 for SGTR+: End-to-end Scene Graph Generation with Transformer

Abstract:Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this work, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. Code is available: https://github.com/Scarecrow0/SGTR

* Accepted by TPAMI: https://ieeexplore.ieee.org/document/10315230

Via

Access Paper or Ask Questions

P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering

Jan 17, 2024

Chuyu Zhang, Hui Ren, Xuming He

Figure 1 for P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering

Figure 2 for P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering

Figure 3 for P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering

Figure 4 for P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced Clustering

Abstract:Deep clustering, which learns representation and semantic clustering without labels information, poses a great challenge for deep learning-based approaches. Despite significant progress in recent years, most existing methods focus on uniformly distributed datasets, significantly limiting the practical applicability of their methods. In this paper, we first introduce a more practical problem setting named deep imbalanced clustering, where the underlying classes exhibit an imbalance distribution. To tackle this problem, we propose a novel pseudo-labeling-based learning framework. Our framework formulates pseudo-label generation as a progressive partial optimal transport problem, which progressively transports each sample to imbalanced clusters under prior distribution constraints, thus generating imbalance-aware pseudo-labels and learning from high-confident samples. In addition, we transform the initial formulation into an unbalanced optimal transport problem with augmented constraints, which can be solved efficiently by a fast matrix scaling algorithm. Experiments on various datasets, including a human-curated long-tailed CIFAR100, challenging ImageNet-R, and large-scale subsets of fine-grained iNaturalist2018 datasets, demonstrate the superiority of our method.

* Accepted by ICLR2024

Via

Access Paper or Ask Questions

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Jan 04, 2024

Longtian Qiu, Shan Ning, Xuming He

Figure 1 for Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Figure 2 for Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Figure 3 for Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Figure 4 for Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Abstract:Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.

* AAAI 2024.Open sourced, Code and Model Available

Via

Access Paper or Ask Questions

GenEM: Physics-Informed Generative Cryo-Electron Microscopy

Dec 04, 2023

Jiakai Zhang, Qihe Chen, Yan Zeng, Wenyuan Gao, Xuming He, Zhijie Liu, Jingyi Yu

Figure 1 for GenEM: Physics-Informed Generative Cryo-Electron Microscopy

Figure 2 for GenEM: Physics-Informed Generative Cryo-Electron Microscopy

Figure 3 for GenEM: Physics-Informed Generative Cryo-Electron Microscopy

Figure 4 for GenEM: Physics-Informed Generative Cryo-Electron Microscopy

Abstract:In the past decade, deep conditional generative models have revolutionized the generation of realistic images, extending their application from entertainment to scientific domains. Single-particle cryo-electron microscopy (cryo-EM) is crucial in resolving near-atomic resolution 3D structures of proteins, such as the SARS-COV-2 spike protein. To achieve high-resolution reconstruction, AI models for particle picking and pose estimation have been adopted. However, their performance is still limited as they lack high-quality annotated datasets. To address this, we introduce physics-informed generative cryo-electron microscopy (GenEM), which for the first time integrates physical-based cryo-EM simulation with a generative unpaired noise translation to generate physically correct synthetic cryo-EM datasets with realistic noises. Initially, GenEM simulates the cryo-EM imaging process based on a virtual specimen. To generate realistic noises, we leverage an unpaired noise translation via contrastive learning with a novel mask-guided sampling scheme. Extensive experiments show that GenEM is capable of generating realistic cryo-EM images. The generated dataset can further enhance particle picking and pose estimation models, eventually improving the reconstruction resolution. We will release our code and annotated synthetic datasets.

Via

Access Paper or Ask Questions

Gradient-Map-Guided Adaptive Domain Generalization for Cross Modality MRI Segmentation

Nov 16, 2023

Bingnan Li, Zhitong Gao, Xuming He

Abstract:Cross-modal MRI segmentation is of great value for computer-aided medical diagnosis, enabling flexible data acquisition and model generalization. However, most existing methods have difficulty in handling local variations in domain shift and typically require a significant amount of data for training, which hinders their usage in practice. To address these problems, we propose a novel adaptive domain generalization framework, which integrates a learning-free cross-domain representation based on image gradient maps and a class prior-informed test-time adaptation strategy for mitigating local domain shift. We validate our approach on two multi-modal MRI datasets with six cross-modal segmentation tasks. Across all the task settings, our method consistently outperforms competing approaches and shows a stable performance even with limited training data.

* 9 pages, Machine Learning for Health (ML4H) 2023

Via

Access Paper or Ask Questions