Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xi Zhang

Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Aug 26, 2024

Lazar Atanackovic, Xi Zhang, Brandon Amos, Mathieu Blanchette, Leo J. Lee, Yoshua Bengio, Alexander Tong, Kirill Neklyudov

Figure 1 for Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Figure 2 for Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Figure 3 for Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Figure 4 for Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold

Abstract:Numerous biological and physical processes can be modeled as systems of interacting entities evolving continuously over time, e.g. the dynamics of communicating cells or physical particles. Learning the dynamics of such systems is essential for predicting the temporal evolution of populations across novel samples and unseen environments. Flow-based models allow for learning these dynamics at the population level - they model the evolution of the entire distribution of samples. However, current flow-based models are limited to a single initial population and a set of predefined conditions which describe different dynamics. We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities. That is, the change of the population at any moment in time depends on the population itself due to the interactions between samples. In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depends on the microenvironment of cells specific to each patient. We propose Meta Flow Matching (MFM), a practical approach to integrating along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations. Namely, we embed the population of samples using a Graph Neural Network (GNN) and use these embeddings to train a Flow Matching model. This gives MFM the ability to generalize over the initial distributions unlike previously proposed methods. We demonstrate the ability of MFM to improve prediction of individual treatment responses on a large scale multi-patient single-cell drug screen dataset.

Via

Access Paper or Ask Questions

Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

Aug 06, 2024

Hao Xu, Xi Zhang, Xiaolin Wu

Figure 1 for Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

Figure 2 for Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

Figure 3 for Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

Figure 4 for Fast Point Cloud Geometry Compression with Context-based Residual Coding and INR-based Refinement

Abstract:Compressing a set of unordered points is far more challenging than compressing images/videos of regular sample grids, because of the difficulties in characterizing neighboring relations in an irregular layout of points. Many researchers resort to voxelization to introduce regularity, but this approach suffers from quantization loss. In this research, we use the KNN method to determine the neighborhoods of raw surface points. This gives us a means to determine the spatial context in which the latent features of 3D points are compressed by arithmetic coding. As such, the conditional probability model is adaptive to local geometry, leading to significant rate reduction. Additionally, we propose a dual-layer architecture where a non-learning base layer reconstructs the main structures of the point cloud at low complexity, while a learned refinement layer focuses on preserving fine details. This design leads to reductions in model complexity and coding latency by two orders of magnitude compared to SOTA methods. Moreover, we incorporate an implicit neural representation (INR) into the refinement layer, allowing the decoder to sample points on the underlying surface at arbitrary densities. This work is the first to effectively exploit content-aware local contexts for compressing irregular raw point clouds, achieving high rate-distortion performance, low complexity, and the ability to function as an arbitrary-scale upsampling network simultaneously.

* Accepted by ECCV 2024

Via

Access Paper or Ask Questions

Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation Synergy

Jul 25, 2024

Xiaohan Fang, Chaozhuo Li, Yi Zhao, Qian Zang, Litian Zhang, Jiquan Peng, Xi Zhang, Jibing Gong

Figure 1 for Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation Synergy

Figure 2 for Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation Synergy

Figure 3 for Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation Synergy

Figure 4 for Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation Synergy

Abstract:Knowledge Graph Alignment (KGA) aims to integrate knowledge from multiple sources to address the limitations of individual Knowledge Graphs (KGs) in terms of coverage and depth. However, current KGA models fall short in achieving a ``complete'' knowledge graph alignment. Existing models primarily emphasize the linkage of cross-graph entities but overlook aligning relations across KGs, thereby providing only a partial solution to KGA. The semantic correlations embedded in relations are largely overlooked, potentially restricting a comprehensive understanding of cross-KG signals. In this paper, we propose to conceptualize relation alignment as an independent task and conduct KGA by decomposing it into two distinct but highly correlated sub-tasks: entity alignment and relation alignment. To capture the mutually reinforcing correlations between these objectives, we propose a novel Expectation-Maximization-based model, EREM, which iteratively optimizes both sub-tasks. Experimental results on real-world datasets demonstrate that EREM consistently outperforms state-of-the-art models in both entity alignment and relation alignment tasks.

Via

Access Paper or Ask Questions

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Jul 21, 2024

Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li(+1 more)

Figure 1 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Figure 2 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Figure 3 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Figure 4 for MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Abstract:Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks across multiple benchmarks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images remain underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. Therefore, in this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source MLLMs and close-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as confused fine-grained perception, limited multi-image reasoning, and unstable in-context learning. The annotated data in MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

Jun 12, 2024

Peiyan Zhang, Yuchen Yan, Xi Zhang, Liying Kang, Chaozhuo Li, Feiran Huang, Senzhang Wang, Sunghun Kim

Figure 1 for GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

Figure 2 for GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

Figure 3 for GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

Figure 4 for GPT4Rec: Graph Prompt Tuning for Streaming Recommendation

Abstract:In the realm of personalized recommender systems, the challenge of adapting to evolving user preferences and the continuous influx of new users and items is paramount. Conventional models, typically reliant on a static training-test approach, struggle to keep pace with these dynamic demands. Streaming recommendation, particularly through continual graph learning, has emerged as a novel solution. However, existing methods in this area either rely on historical data replay, which is increasingly impractical due to stringent data privacy regulations; or are inability to effectively address the over-stability issue; or depend on model-isolation and expansion strategies. To tackle these difficulties, we present GPT4Rec, a Graph Prompt Tuning method for streaming Recommendation. Given the evolving user-item interaction graph, GPT4Rec first disentangles the graph patterns into multiple views. After isolating specific interaction patterns and relationships in different views, GPT4Rec utilizes lightweight graph prompts to efficiently guide the model across varying interaction patterns within the user-item graph. Firstly, node-level prompts are employed to instruct the model to adapt to changes in the attributes or properties of individual nodes within the graph. Secondly, structure-level prompts guide the model in adapting to broader patterns of connectivity and relationships within the graph. Finally, view-level prompts are innovatively designed to facilitate the aggregation of information from multiple disentangled views. These prompt designs allow GPT4Rec to synthesize a comprehensive understanding of the graph, ensuring that all vital aspects of the user-item interactions are considered and effectively integrated. Experiments on four diverse real-world datasets demonstrate the effectiveness and efficiency of our proposal.

* Accepted by SIGIR 2024. arXiv admin note: text overlap with arXiv:2303.11700 by other authors

Via

Access Paper or Ask Questions

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Jun 03, 2024

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang

Abstract:Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.

* 22 pages, 11 figures, 10 Tables

Via

Access Paper or Ask Questions

Uncertainty Quantification on Graph Learning: A Survey

Apr 23, 2024

Chao Chen, Chenghua Guo, Rui Xu, Xiangwen Liao, Xi Zhang, Sihong Xie, Hui Xiong, Philip Yu

Figure 1 for Uncertainty Quantification on Graph Learning: A Survey

Figure 2 for Uncertainty Quantification on Graph Learning: A Survey

Figure 3 for Uncertainty Quantification on Graph Learning: A Survey

Abstract:Graphical models, including Graph Neural Networks (GNNs) and Probabilistic Graphical Models (PGMs), have demonstrated their exceptional capabilities across numerous fields. These models necessitate effective uncertainty quantification to ensure reliable decision-making amid the challenges posed by model training discrepancies and unpredictable testing scenarios. This survey examines recent works that address uncertainty quantification within the model architectures, training, and inference of GNNs and PGMs. We aim to provide an overview of the current landscape of uncertainty in graphical models by organizing the recent methods into uncertainty representation and handling. By summarizing state-of-the-art methods, this survey seeks to deepen the understanding of uncertainty quantification in graphical models, thereby increasing their effectiveness and safety in critical applications.

Via

Access Paper or Ask Questions

Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

Apr 08, 2024

Tongkun Su, Jun Li, Xi Zhang, Haibo Jin, Hao Chen, Qiong Wang, Faqin Lv, Baoliang Zhao, Yin Hu

Figure 1 for Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

Figure 2 for Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

Figure 3 for Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

Figure 4 for Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

Abstract:Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. To the best of our knowledge, we are the first to utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features. In this work, we leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modality alignment. Our framework is applied to four downstream tasks: report generation, classification, segmentation, and detection across five datasets. Extensive experiments demonstrate the superiority of our framework compared to other state-of-the-art methods. Our code will be released upon acceptance.

Via

Access Paper or Ask Questions

FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Mar 30, 2024

Ziyi Zhou, Xiaoming Zhang, Litian Zhang, Jiacheng Liu, Xi Zhang, Chaozhuo Li

Figure 1 for FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Figure 2 for FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Figure 3 for FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Figure 4 for FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detecction

Abstract:Existing benchmarks for fake news detection have significantly contributed to the advancement of models in assessing the authenticity of news content. However, these benchmarks typically focus solely on news pertaining to a single semantic topic or originating from a single platform, thereby failing to capture the diversity of multi-domain news in real scenarios. In order to understand fake news across various domains, the external knowledge and fine-grained annotations are indispensable to provide precise evidence and uncover the diverse underlying strategies for fabrication, which are also ignored by existing benchmarks. To address this gap, we introduce a novel multi-domain knowledge-enhanced benchmark with fine-grained annotations, named \textbf{FineFake}. FineFake encompasses 16,909 data samples spanning six semantic topics and eight platforms. Each news item is enriched with multi-modal content, potential social context, semi-manually verified common knowledge, and fine-grained annotations that surpass conventional binary labels. Furthermore, we formulate three challenging tasks based on FineFake and propose a knowledge-enhanced domain adaptation network. Extensive experiments are conducted on FineFake under various scenarios, providing accurate and reliable benchmarks for future endeavors. The entire FineFake project is publicly accessible as an open-source repository at \url{https://github.com/Accuser907/FineFake}.

Via

Access Paper or Ask Questions

A Differential Geometric View and Explainability of GNN on Evolving Graphs

Mar 11, 2024

Yazheng Liu, Xi Zhang, Sihong Xie

Figure 1 for A Differential Geometric View and Explainability of GNN on Evolving Graphs

Figure 2 for A Differential Geometric View and Explainability of GNN on Evolving Graphs

Figure 3 for A Differential Geometric View and Explainability of GNN on Evolving Graphs

Figure 4 for A Differential Geometric View and Explainability of GNN on Evolving Graphs

Abstract:Graphs are ubiquitous in social networks and biochemistry, where Graph Neural Networks (GNN) are the state-of-the-art models for prediction. Graphs can be evolving and it is vital to formally model and understand how a trained GNN responds to graph evolution. We propose a smooth parameterization of the GNN predicted distributions using axiomatic attribution, where the distributions are on a low-dimensional manifold within a high-dimensional embedding space. We exploit the differential geometric viewpoint to model distributional evolution as smooth curves on the manifold. We reparameterize families of curves on the manifold and design a convex optimization problem to find a unique curve that concisely approximates the distributional evolution for human interpretation. Extensive experiments on node classification, link prediction, and graph classification tasks with evolving graphs demonstrate the better sparsity, faithfulness, and intuitiveness of the proposed method over the state-of-the-art methods.

* Accepted into ICLR 2023

Via

Access Paper or Ask Questions