Alert button
Picture for Chen Cai

Chen Cai

Alert button

Local-to-global Perspectives on Graph Neural Networks

Jun 18, 2023
Chen Cai

Figure 1 for Local-to-global Perspectives on Graph Neural Networks
Figure 2 for Local-to-global Perspectives on Graph Neural Networks
Figure 3 for Local-to-global Perspectives on Graph Neural Networks
Figure 4 for Local-to-global Perspectives on Graph Neural Networks

This thesis presents a local-to-global perspective on graph neural networks (GNN), the leading architecture to process graph-structured data. After categorizing GNN into local Message Passing Neural Networks (MPNN) and global Graph transformers, we present three pieces of work: 1) study the convergence property of a type of global GNN, Invariant Graph Networks, 2) connect the local MPNN and global Graph Transformer, and 3) use local MPNN for graph coarsening, a standard subroutine used in global modeling.

* Ph.D. thesis from UC San Diego that includes three works arXiv:2201.10129, arXiv:2301.11956, arXiv:2102.01350 
Viaarxiv icon

Top-Down Viewing for Weakly Supervised Grounded Image Captioning

Jun 15, 2023
Chen Cai, Suchen Wang, Kim-hui Yap

Figure 1 for Top-Down Viewing for Weakly Supervised Grounded Image Captioning
Figure 2 for Top-Down Viewing for Weakly Supervised Grounded Image Captioning
Figure 3 for Top-Down Viewing for Weakly Supervised Grounded Image Captioning
Figure 4 for Top-Down Viewing for Weakly Supervised Grounded Image Captioning

Weakly supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) first apply an off-the-shelf object detector to encode the input image into multiple region features; (2) and then leverage a soft-attention mechanism for captioning and grounding. However, object detectors are mainly designed to extract object semantics (i.e., the object category). Besides, they break down the structural images into pieces of individual proposals. As a result, the subsequent grounded captioner is often overfitted to find the correct object words, while overlooking the relation between objects (e.g., what is the person doing?), and selecting incompatible proposal regions for grounding. To address these difficulties, we propose a one-stage weakly supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. In addition, we explicitly inject a relation module into our one-stage framework to encourage the relation understanding through multi-label classification. The relation semantics aid the prediction of relation words in the caption. We observe that the relation words not only assist the grounded captioner in generating a more accurate caption but also improve the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.

Viaarxiv icon

On the Connection Between MPNN and Graph Transformer

Feb 03, 2023
Chen Cai, Truong Son Hy, Rose Yu, Yusu Wang

Figure 1 for On the Connection Between MPNN and Graph Transformer
Figure 2 for On the Connection Between MPNN and Graph Transformer
Figure 3 for On the Connection Between MPNN and Graph Transformer
Figure 4 for On the Connection Between MPNN and Graph Transformer

Graph Transformer (GT) recently has emerged as a new paradigm of graph learning algorithms, outperforming the previously popular Message Passing Neural Network (MPNN) on multiple benchmarks. Previous work (Kim et al., 2022) shows that with proper position embedding, GT can approximate MPNN arbitrarily well, implying that GT is at least as powerful as MPNN. In this paper, we study the inverse connection and show that MPNN with virtual node (VN), a commonly used heuristic with little theoretical understanding, is powerful enough to arbitrarily approximate the self-attention layer of GT. In particular, we first show that if we consider one type of linear transformer, the so-called Performer/Linear Transformer (Choromanski et al., 2020; Katharopoulos et al., 2020), then MPNN + VN with only O(1) depth and O(1) width can approximate a self-attention layer in Performer/Linear Transformer. Next, via a connection between MPNN + VN and DeepSets, we prove the MPNN + VN with O(n^d) width and O(1) depth can approximate the self-attention layer arbitrarily well, where d is the input feature dimension. Lastly, under some assumptions, we provide an explicit construction of MPNN + VN with O(1) width and O(n) depth approximating the self-attention layer in GT arbitrarily well. On the empirical side, we demonstrate that 1) MPNN + VN is a surprisingly strong baseline, outperforming GT on the recently proposed Long Range Graph Benchmark (LRGB) dataset, 2) our MPNN + VN improves over early implementation on a wide range of OGB datasets and 3) MPNN + VN outperforms Linear Transformer and MPNN on the climate modeling task.

Viaarxiv icon

Traffic4cast at NeurIPS 2021 -- Temporal and Spatial Few-Shot Transfer Learning in Gridded Geo-Spatial Processes

Apr 01, 2022
Christian Eichenberger, Moritz Neun, Henry Martin, Pedro Herruzo, Markus Spanring, Yichao Lu, Sungbin Choi, Vsevolod Konyakhin, Nina Lukashina, Aleksei Shpilman, Nina Wiedemann, Martin Raubal, Bo Wang, Hai L. Vu, Reza Mohajerpoor, Chen Cai, Inhi Kim, Luca Hermes, Andrew Melnik, Riza Velioglu, Markus Vieth, Malte Schilling, Alabi Bojesomo, Hasan Al Marzouqi, Panos Liatsis, Jay Santokhi, Dylan Hillier, Yiming Yang, Joned Sarwar, Anna Jordan, Emil Hewage, David Jonietz, Fei Tang, Aleksandra Gruca, Michael Kopp, David Kreil, Sepp Hochreiter

Figure 1 for Traffic4cast at NeurIPS 2021 -- Temporal and Spatial Few-Shot Transfer Learning in Gridded Geo-Spatial Processes
Figure 2 for Traffic4cast at NeurIPS 2021 -- Temporal and Spatial Few-Shot Transfer Learning in Gridded Geo-Spatial Processes
Figure 3 for Traffic4cast at NeurIPS 2021 -- Temporal and Spatial Few-Shot Transfer Learning in Gridded Geo-Spatial Processes
Figure 4 for Traffic4cast at NeurIPS 2021 -- Temporal and Spatial Few-Shot Transfer Learning in Gridded Geo-Spatial Processes

The IARAI Traffic4cast competitions at NeurIPS 2019 and 2020 showed that neural networks can successfully predict future traffic conditions 1 hour into the future on simply aggregated GPS probe data in time and space bins. We thus reinterpreted the challenge of forecasting traffic conditions as a movie completion task. U-Nets proved to be the winning architecture, demonstrating an ability to extract relevant features in this complex real-world geo-spatial process. Building on the previous competitions, Traffic4cast 2021 now focuses on the question of model robustness and generalizability across time and space. Moving from one city to an entirely different city, or moving from pre-COVID times to times after COVID hit the world thus introduces a clear domain shift. We thus, for the first time, release data featuring such domain shifts. The competition now covers ten cities over 2 years, providing data compiled from over 10^12 GPS probe data. Winning solutions captured traffic dynamics sufficiently well to even cope with these complex domain shifts. Surprisingly, this seemed to require only the previous 1h traffic dynamic history and static road graph as input.

* Pre-print under review, submitted to Proceedings of Machine Learning Research 
Viaarxiv icon

Generative Coarse-Graining of Molecular Conformations

Jan 28, 2022
Wujie Wang, Minkai Xu, Chen Cai, Benjamin Kurt Miller, Tess Smidt, Yusu Wang, Jian Tang, Rafael Gómez-Bombarelli

Figure 1 for Generative Coarse-Graining of Molecular Conformations
Figure 2 for Generative Coarse-Graining of Molecular Conformations
Figure 3 for Generative Coarse-Graining of Molecular Conformations
Figure 4 for Generative Coarse-Graining of Molecular Conformations

Coarse-graining (CG) of molecular simulations simplifies the particle representation by grouping selected atoms into pseudo-beads and therefore drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmapping, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative models and equivariant networks, we propose a novel model that rigorously embeds the vital probabilistic nature and geometric consistency requirements of the backmapping transformation. Our model encodes the FG uncertainties into an invariant latent space and decodes them back to FG geometries via equivariant convolutions. To standardize the evaluation of this domain, we further provide three comprehensive benchmarks based on molecular dynamics trajectories. Extensive experiments show that our approach always recovers more realistic structures and outperforms existing data-driven methods with a significant margin.

* 23 pages, 11 figures 
Viaarxiv icon

Convergence of Invariant Graph Networks

Jan 25, 2022
Chen Cai, Yusu Wang

Figure 1 for Convergence of Invariant Graph Networks
Figure 2 for Convergence of Invariant Graph Networks
Figure 3 for Convergence of Invariant Graph Networks
Figure 4 for Convergence of Invariant Graph Networks

Although theoretical properties such as expressive power and over-smoothing of graph neural networks (GNN) have been extensively studied recently, its convergence property is a relatively new direction. In this paper, we investigate the convergence of one powerful GNN, Invariant Graph Network (IGN) over graphs sampled from graphons. We first prove the stability of linear layers for general $k$-IGN (of order $k$) based on a novel interpretation of linear equivariant layers. Building upon this result, we prove the convergence of $k$-IGN under the model of \citet{ruiz2020graphon}, where we access the edge weight but the convergence error is measured for graphon inputs. Under the more natural (and more challenging) setting of \citet{keriven2020convergence} where one can only access 0-1 adjacency matrix sampled according to edge probability, we first show a negative result that the convergence of any IGN is not possible. We then obtain the convergence of a subset of IGNs, denoted as IGN-small, after the edge probability estimation. We show that IGN-small still contains function class rich enough that can approximate spectral GNNs arbitrarily well. Lastly, we perform experiments on various graphon models to verify our statements.

* 28 pages, 11 figures 
Viaarxiv icon

Traffic4cast -- Large-scale Traffic Prediction using 3DResNet and Sparse-UNet

Nov 10, 2021
Bo Wang, Reza Mohajerpoor, Chen Cai, Inhi Kim, Hai L. Vu

Figure 1 for Traffic4cast -- Large-scale Traffic Prediction using 3DResNet and Sparse-UNet
Figure 2 for Traffic4cast -- Large-scale Traffic Prediction using 3DResNet and Sparse-UNet
Figure 3 for Traffic4cast -- Large-scale Traffic Prediction using 3DResNet and Sparse-UNet
Figure 4 for Traffic4cast -- Large-scale Traffic Prediction using 3DResNet and Sparse-UNet

The IARAI competition Traffic4cast 2021 aims to predict short-term city-wide high-resolution traffic states given the static and dynamic traffic information obtained previously. The aim is to build a machine learning model for predicting the normalized average traffic speed and flow of the subregions of multiple large-scale cities using historical data points. The model is supposed to be generic, in a way that it can be applied to new cities. By considering spatiotemporal feature learning and modeling efficiency, we explore 3DResNet and Sparse-UNet approaches for the tasks in this competition. The 3DResNet based models use 3D convolution to learn the spatiotemporal features and apply sequential convolutional layers to enhance the temporal relationship of the outputs. The Sparse-UNet model uses sparse convolutions as the backbone for spatiotemporal feature learning. Since the latter algorithm mainly focuses on non-zero data points of the inputs, it dramatically reduces the computation time, while maintaining a competitive accuracy. Our results show that both of the proposed models achieve much better performance than the baseline algorithms. The codes and pretrained models are available at https://github.com/resuly/Traffic4Cast-2021.

Viaarxiv icon

Equivariant Subgraph Aggregation Networks

Oct 06, 2021
Beatrice Bevilacqua, Fabrizio Frasca, Derek Lim, Balasubramaniam Srinivasan, Chen Cai, Gopinath Balamurugan, Michael M. Bronstein, Haggai Maron

Figure 1 for Equivariant Subgraph Aggregation Networks
Figure 2 for Equivariant Subgraph Aggregation Networks
Figure 3 for Equivariant Subgraph Aggregation Networks
Figure 4 for Equivariant Subgraph Aggregation Networks

Message-passing neural networks (MPNNs) are the leading architecture for deep learning on graph-structured data, in large part due to their simplicity and scalability. Unfortunately, it was shown that these architectures are limited in their expressive power. This paper proposes a novel framework called Equivariant Subgraph Aggregation Networks (ESAN) to address this issue. Our main observation is that while two graphs may not be distinguishable by an MPNN, they often contain distinguishable subgraphs. Thus, we propose to represent each graph as a set of subgraphs derived by some predefined policy, and to process it using a suitable equivariant architecture. We develop novel variants of the 1-dimensional Weisfeiler-Leman (1-WL) test for graph isomorphism, and prove lower bounds on the expressiveness of ESAN in terms of these new WL variants. We further prove that our approach increases the expressive power of both MPNNs and more expressive architectures. Moreover, we provide theoretical results that describe how design choices such as the subgraph selection policy and equivariant neural architecture affect our architecture's expressive power. To deal with the increased computational cost, we propose a subgraph sampling scheme, which can be viewed as a stochastic version of our framework. A comprehensive set of experiments on real and synthetic datasets demonstrates that our framework improves the expressive power and overall performance of popular GNN architectures.

* 42 pages 
Viaarxiv icon