Entity alignment is to find identical entities in different knowledge graphs. Although embedding-based entity alignment has recently achieved remarkable progress, training data insufficiency remains a critical challenge. Conventional semi-supervised methods also suffer from the incorrect entity alignment in newly proposed training data. To resolve these issues, we design an iterative cycle-teaching framework for semi-supervised entity alignment. The key idea is to train multiple entity alignment models (called aligners) simultaneously and let each aligner iteratively teach its successor the proposed new entity alignment. We propose a diversity-aware alignment selection method to choose reliable entity alignment for each aligner. We also design a conflict resolution mechanism to resolve the alignment conflict when combining the new alignment of an aligner and that from its teacher. Besides, considering the influence of cycle-teaching order, we elaborately design a strategy to arrange the optimal order that can maximize the overall performance of multiple aligners. The cycle-teaching process can break the limitations of each model's learning capability and reduce the noise in new training data, leading to improved performance. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed cycle-teaching framework, which significantly outperforms the state-of-the-art models when the training data is insufficient and the new entity alignment has much noise.
Of theories for why large-scale machine learning models generalize despite being vastly overparameterized, which of their assumptions are needed to capture the qualitative phenomena of generalization in the real world? On one hand, we find that most theoretical analyses fall short of capturing these qualitative phenomena even for kernel regression, when applied to kernels derived from large-scale neural networks (e.g., ResNet-50) and real data (e.g., CIFAR-100). On the other hand, we find that the classical GCV estimator (Craven and Wahba, 1978) accurately predicts generalization risk even in such overparameterized settings. To bolster this empirical finding, we prove that the GCV estimator converges to the generalization risk whenever a local random matrix law holds. Finally, we apply this random matrix theory lens to explain why pretrained representations generalize better as well as what factors govern scaling laws for kernel regression. Our findings suggest that random matrix theory, rather than just being a toy model, may be central to understanding the properties of neural representations in practice.
Query-based video grounding is an important yet challenging task in video understanding, which aims to localize the target segment in an untrimmed video according to a sentence query. Most previous works achieve significant progress by addressing this task in a fully-supervised manner with segment-level labels, which require high labeling cost. Although some recent efforts develop weakly-supervised methods that only need the video-level knowledge, they generally match multiple pre-defined segment proposals with query and select the best one, which lacks fine-grained frame-level details for distinguishing frames with high repeatability and similarity within the entire video. To alleviate the above limitations, we propose a self-contrastive learning framework to address the query-based video grounding task under a weakly-supervised setting. Firstly, instead of utilizing redundant segment proposals, we propose a new grounding scheme that learns frame-wise matching scores referring to the query semantic to predict the possible foreground frames by only using the video-level annotations. Secondly, since some predicted frames (i.e., boundary frames) are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm to learn more discriminative frame-wise representations for distinguishing the false positive frames. In particular, we iteratively explore multi-scale hard negative samples that are close to positive samples in the representation space for distinguishing fine-grained frame-wise details, thus enforcing more accurate segment grounding. Extensive experiments on two challenging benchmarks demonstrate the superiority of our proposed method compared with the state-of-the-art methods.
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query. Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information. Such frame-level feature extraction leads to the obstacles of these methods in distinguishing ambiguous video frames with complicated contents and subtle appearance differences, thus limiting their performance. In order to differentiate fine-grained appearance similarities among consecutive frames, some state-of-the-art methods additionally employ a detection model like Faster R-CNN to obtain detailed object-level features in each frame for filtering out the redundant background contents. However, these methods suffer from missing motion analysis since the object detection module in Faster R-CNN lacks temporal modeling. To alleviate the above limitations, in this paper, we propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features to better reason the spatial-temporal object relations for accurately modelling the activity among consecutive frames. Specifically, we first develop three individual branches for motion, appearance, and 3D encoding separately to learn fine-grained motion-guided, appearance-guided, and 3D-aware object features, respectively. Then, both motion and appearance information from corresponding branches are associated to enhance the 3D-aware features for the final precise grounding. Extensive experiments on three challenging datasets (ActivityNet Caption, Charades-STA and TACoS) demonstrate that the proposed MA3SRN model achieves a new state-of-the-art.
It is now known that the expressive power of graph convolutional neural nets (GCN) does not grow infinitely with the number of layers. Instead, the GCN output approaches a subspace spanned by the first eigenvector of the normalized graph Laplacian matrix with the convergence rate characterized by the "eigen-gap": the difference between the Laplacian's first two distinct eigenvalues. To promote a deeper GCN architecture with sufficient expressiveness, in this paper, given an empirical covariance matrix $\bar{C}$ computed from observable data, we learn a sparse graph Laplacian matrix $L$ closest to $\bar{C}^{-1}$ while maintaining a desirable eigen-gap that slows down convergence. Specifically, we first define a sparse graph learning problem with constraints on the first eigenvector (the most common signal) and the eigen-gap. We solve the corresponding dual problem greedily, where a locally optimal eigen-pair is computed one at a time via a fast approximation of a semi-definite programming (SDP) formulation. The computed $L$ with the desired eigen-gap is normalized spectrally and used for supervised training of GCN for a targeted task. Experiments show that our proposal produced deeper GCNs and smaller errors compared to a competing scheme without explicit eigen-gap optimization.
High quality energy systems information is a crucial input to energy systems research, modeling, and decision-making. Unfortunately, precise information about energy systems is often of limited availability, incomplete, or only accessible for a substantial fee or through a non-disclosure agreement. Recently, remotely sensed data (e.g., satellite imagery, aerial photography) have emerged as a potentially rich source of energy systems information. However, the use of these data is frequently challenged by its sheer volume and complexity, precluding manual analysis. Recent breakthroughs in machine learning have enabled automated and rapid extraction of useful information from remotely sensed data, facilitating large-scale acquisition of critical energy system variables. Here we present a systematic review of the literature on this emerging topic, providing an in-depth survey and review of papers published within the past two decades. We first taxonomize the existing literature into ten major areas, spanning the energy value chain. Within each research area, we distill and critically discuss major features that are relevant to energy researchers, including, for example, key challenges regarding the accessibility and reliability of the methods. We then synthesize our findings to identify limitations and trends in the literature as a whole, and discuss opportunities for innovation.
3D dynamic point clouds provide a discrete representation of real-world objects or scenes in motion, which have been widely applied in immersive telepresence, autonomous driving, surveillance, \textit{etc}. However, point clouds acquired from sensors are usually perturbed by noise, which affects downstream tasks such as surface reconstruction and analysis. Although many efforts have been made for static point cloud denoising, few works address dynamic point cloud denoising. In this paper, we propose a novel gradient-based dynamic point cloud denoising method, exploiting the temporal correspondence for the estimation of gradient fields -- also a fundamental problem in dynamic point cloud processing and analysis. The gradient field is the gradient of the log-probability function of the noisy point cloud, based on which we perform gradient ascent so as to converge each point to the underlying clean surface. We estimate the gradient of each surface patch by exploiting the temporal correspondence, where the temporally corresponding patches are searched leveraging on rigid motion in classical mechanics. In particular, we treat each patch as a rigid object, which moves in the gradient field of an adjacent frame via force until reaching a balanced state, i.e., when the sum of gradients over the patch reaches 0. Since the gradient would be smaller when the point is closer to the underlying surface, the balanced patch would fit the underlying surface well, thus leading to the temporal correspondence. Finally, the position of each point in the patch is updated along the direction of the gradient averaged from corresponding patches in adjacent frames. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods.
Knowledge graphs (KGs) have become a valuable asset for many AI applications. Although some KGs contain plenty of facts, they are widely acknowledged as incomplete. To address this issue, many KG completion methods are proposed. Among them, open KG completion methods leverage the Web to find missing facts. However, noisy data collected from diverse sources may damage the completion accuracy. In this paper, we propose a new trustworthy method that exploits facts for a KG based on multi-sourced noisy data and existing facts in the KG. Specifically, we introduce a graph neural network with a holistic scoring function to judge the plausibility of facts with various value types. We design value alignment networks to resolve the heterogeneity between values and map them to entities even outside the KG. Furthermore, we present a truth inference model that incorporates data source qualities into the fact scoring function, and design a semi-supervised learning way to infer the truths from heterogeneous values. We conduct extensive experiments to compare our method with the state-of-the-arts. The results show that our method achieves superior accuracy not only in completing missing facts but also in discovering new facts.
Entity alignment is a crucial step in integrating knowledge graphs (KGs) from multiple sources. Previous attempts at entity alignment have explored different KG structures, such as neighborhood-based and path-based contexts, to learn entity embeddings, but they are limited in capturing the multi-context features. Moreover, most approaches directly utilize the embedding similarity to determine entity alignment without considering the global interaction among entities and relations. In this work, we propose an Informed Multi-context Entity Alignment (IMEA) model to address these issues. In particular, we introduce Transformer to flexibly capture the relation, path, and neighborhood contexts, and design holistic reasoning to estimate alignment probabilities based on both embedding similarity and the relation/entity functionality. The alignment evidence obtained from holistic reasoning is further injected back into the Transformer via the proposed soft label editing to inform embedding learning. Experimental results on several benchmark datasets demonstrate the superiority of our IMEA model compared with existing state-of-the-art entity alignment methods.
Training with the true labels of a dataset as opposed to randomized labels leads to faster optimization and better generalization. This difference is attributed to a notion of alignment between inputs and labels in natural datasets. We find that training neural networks with different architectures and optimizers on random or true labels enforces the same relationship between the hidden representations and the training labels, elucidating why neural network representations have been so successful for transfer. We first highlight why aligned features promote transfer and show in a classic synthetic transfer problem that alignment is the determining factor for positive and negative transfer to similar and dissimilar tasks. We then investigate a variety of neural network architectures and find that (a) alignment emerges across a variety of different architectures and optimizers, with more alignment arising from depth (b) alignment increases for layers closer to the output and (c) existing high-performance deep CNNs exhibit high levels of alignment.