Abstract:Graph contrastive learning (GCL), standing as the dominant paradigm in the realm of graph pre-training, has yielded considerable progress. Nonetheless, its capacity for out-of-distribution (OOD) generalization has been relatively underexplored. In this work, we point out that the traditional optimization of InfoNCE in GCL restricts the cross-domain pairs only to be negative samples, which inevitably enlarges the distribution gap between different domains. This violates the requirement of domain invariance under OOD scenario and consequently impairs the model's OOD generalization performance. To address this issue, we propose a novel strategy "Negative as Positive", where the most semantically similar cross-domain negative pairs are treated as positive during GCL. Our experimental results, spanning a wide array of datasets, confirm that this method substantially improves the OOD generalization performance of GCL.
Abstract:Test-time adaptation (TTA) aims to improve model generalizability when test data diverges from training distribution, offering the distinct advantage of not requiring access to training data and processes, especially valuable in the context of large pre-trained models. However, current TTA methods fail to address the fundamental issue: covariate shift, i.e., the decreased generalizability can be attributed to the model's reliance on the marginal distribution of the training data, which may impair model calibration and introduce confirmation bias. To address this, we propose a novel energy-based perspective, enhancing the model's perception of target data distributions without requiring access to training data or processes. Building on this perspective, we introduce $\textbf{T}$est-time $\textbf{E}$nergy $\textbf{A}$daptation ($\textbf{TEA}$), which transforms the trained classifier into an energy-based model and aligns the model's distribution with the test data's, enhancing its ability to perceive test distributions and thus improving overall generalizability. Extensive experiments across multiple tasks, benchmarks and architectures demonstrate TEA's superior generalization performance against state-of-the-art methods. Further in-depth analyses reveal that TEA can equip the model with a comprehensive perception of test distribution, ultimately paving the way toward improved generalization and calibration.
Abstract:The generalization of neural networks is a central challenge in machine learning, especially concerning the performance under distributions that differ from training ones. Current methods, mainly based on the data-driven paradigm such as data augmentation, adversarial training, and noise injection, may encounter limited generalization due to model non-smoothness. In this paper, we propose to investigate generalization from a Partial Differential Equation (PDE) perspective, aiming to enhance it directly through the underlying function of neural networks, rather than focusing on adjusting input data. Specifically, we first establish the connection between neural network generalization and the smoothness of the solution to a specific PDE, namely ``transport equation''. Building upon this, we propose a general framework that introduces adaptive distributional diffusion into transport equation to enhance the smoothness of its solution, thereby improving generalization. In the context of neural networks, we put this theoretical framework into practice as PDE+ (\textbf{PDE} with \textbf{A}daptive \textbf{D}istributional \textbf{D}iffusion) which diffuses each sample into a distribution covering semantically similar inputs. This enables better coverage of potentially unobserved distributions in training, thus improving generalization beyond merely data-driven methods. The effectiveness of PDE+ is validated in extensive settings, including clean samples and various corruptions, demonstrating its superior performance compared to SOTA methods.
Abstract:Graph contrastive learning (GCL) emerges as the most representative approach for graph representation learning, which leverages the principle of maximizing mutual information (InfoMax) to learn node representations applied in downstream tasks. To explore better generalization from GCL to downstream tasks, previous methods heuristically define data augmentation or pretext tasks. However, the generalization ability of GCL and its theoretical principle are still less reported. In this paper, we first propose a metric named GCL-GE for GCL generalization ability. Considering the intractability of the metric due to the agnostic downstream task, we theoretically prove a mutual information upper bound for it from an information-theoretic perspective. Guided by the bound, we design a GCL framework named InfoAdv with enhanced generalization ability, which jointly optimizes the generalization metric and InfoMax to strike the right balance between pretext task fitting and the generalization ability on downstream tasks. We empirically validate our theoretical findings on a number of representative benchmarks, and experimental results demonstrate that our model achieves state-of-the-art performance.
Abstract:Improving the scalability of GNNs is critical for large graphs. Existing methods leverage three sampling paradigms including node-wise, layer-wise and subgraph sampling, then design unbiased estimator for scalability. However, the high variance still severely hinders GNNs' performance. On account that previous studies either lacks variance analysis or only focus on a particular sampling paradigm, we firstly propose an unified node sampling variance analysis framework and analyze the core challenge "circular dependency" for deriving the minimum variance sampler, i. e., sampling probability depends on node embeddings while node embeddings can not be calculated until sampling is finished. Existing studies either ignore the node embeddings or introduce external parameters, resulting in the lack of a both efficient and effective variance reduction methods. Therefore, we propose the \textbf{H}ierarchical \textbf{E}stimation based \textbf{S}ampling GNN (HE-SGNN) with first level estimating the node embeddings in sampling probability to break circular dependency, and second level employing sampling GNN operator to estimate the nodes' representations on the entire graph. Considering the technical difference, we propose different first level estimator, i.e., a time series simulation for layer-wise sampling and a feature based simulation for subgraph sampling. The experimental results on seven representative datasets demonstrate the effectiveness and efficiency of our method.