This paper presents $\mu\text{KG}$, an open-source Python library for representation learning over knowledge graphs. $\mu\text{KG}$ supports joint representation learning over multi-source knowledge graphs (and also a single knowledge graph), multiple deep learning libraries (PyTorch and TensorFlow2), multiple embedding tasks (link prediction, entity alignment, entity typing, and multi-source link prediction), and multiple parallel computing modes (multi-process and multi-GPU computing). It currently implements 26 popular knowledge graph embedding models and supports 16 benchmark datasets. $\mu\text{KG}$ provides advanced implementations of embedding techniques with simplified pipelines of different tasks. It also comes with high-quality documentation for ease of use. $\mu\text{KG}$ is more comprehensive than existing knowledge graph embedding libraries. It is useful for a thorough comparison and analysis of various embedding models and tasks. We show that the jointly learned embeddings can greatly help knowledge-powered downstream tasks, such as multi-hop knowledge graph question answering. We will stay abreast of the latest developments in the related fields and incorporate them into $\mu\text{KG}$.
Temporal sentence grounding (TSG) is an important yet challenging task in multimedia information retrieval. Although previous TSG methods have achieved decent performance, they tend to capture the selection biases of frequently appeared video-query pairs in the dataset rather than present robust multimodal reasoning abilities, especially for the rarely appeared pairs. In this paper, we study the above issue of selection biases and accordingly propose a Debiasing-TSG (D-TSG) model to filter and remove the negative biases in both vision and language modalities for enhancing the model generalization ability. Specifically, we propose to alleviate the issue from two perspectives: 1) Feature distillation. We built a multi-modal debiasing branch to firstly capture the vision and language biases, and then apply a bias identification module to explicitly recognize the true negative biases and remove them from the benign multi-modal representations. 2) Contrastive sample generation. We construct two types of negative samples to enforce the model to accurately learn the aligned multi-modal semantics and make complete semantic reasoning. We apply the proposed model to both commonly and rarely appeared TSG cases, and demonstrate its effectiveness by achieving the state-of-the-art performance on three benchmark datasets (ActivityNet Caption, TACoS, and Charades-STA).
This paper addresses the problem of natural language video localization (NLVL). Almost all existing works follow the "only look once" framework that exploits a single model to directly capture the complex cross- and self-modal relations among video-query pairs and retrieve the relevant segment. However, we argue that these methods have overlooked two indispensable characteristics of an ideal localization method: 1) Frame-differentiable: considering the imbalance of positive/negative video frames, it is effective to highlight positive frames and weaken negative ones during the localization. 2) Boundary-precise: to predict the exact segment boundary, the model should capture more fine-grained differences between consecutive frames since their variations are often smooth. To this end, inspired by how humans perceive and localize a segment, we propose a two-step human-like framework called Skimming-Locating-Perusing (SLP). SLP consists of a Skimming-and-Locating (SL) module and a Bi-directional Perusing (BP) module. The SL module first refers to the query semantic and selects the best matched frame from the video while filtering out irrelevant frames. Then, the BP module constructs an initial segment based on this frame, and dynamically updates it by exploring its adjacent frames until no frame shares the same activity semantic. Experimental results on three challenging benchmarks show that our SLP is superior to the state-of-the-art methods and localizes more precise segment boundaries.
With the increasing attention in various 3D safety-critical applications, point cloud learning models have been shown to be vulnerable to adversarial attacks. Although existing 3D attack methods achieve high success rates, they delve into the data space with point-wise perturbation, which may neglect the geometric characteristics. Instead, we propose point cloud attacks from a new perspective -- the graph spectral domain attack, aiming to perturb graph transform coefficients in the spectral domain that corresponds to varying certain geometric structure. Specifically, leveraging on graph signal processing, we first adaptively transform the coordinates of points onto the spectral domain via graph Fourier transform (GFT) for compact representation. Then, we analyze the influence of different spectral bands on the geometric structure, based on which we propose to perturb the GFT coefficients via a learnable graph spectral filter. Considering the low-frequency components mainly contribute to the rough shape of the 3D object, we further introduce a low-frequency constraint to limit perturbations within imperceptible high-frequency components. Finally, the adversarial point cloud is generated by transforming the perturbed spectral representation back to the data domain via the inverse GFT. Experimental results demonstrate the effectiveness of the proposed attack in terms of both the imperceptibility and attack success rates.
Entity alignment is a basic and vital technique in knowledge graph (KG) integration. Over the years, research on entity alignment has resided on the assumption that KGs are static, which neglects the nature of growth of real-world KGs. As KGs grow, previous alignment results face the need to be revisited while new entity alignment waits to be discovered. In this paper, we propose and dive into a realistic yet unexplored setting, referred to as continual entity alignment. To avoid retraining an entire model on the whole KGs whenever new entities and triples come, we present a continual alignment method for this task. It reconstructs an entity's representation based on entity adjacency, enabling it to generate embeddings for new entities quickly and inductively using their existing neighbors. It selects and replays partial pre-aligned entity pairs to train only parts of KGs while extracting trustworthy alignment for knowledge augmentation. As growing KGs inevitably contain non-matchable entities, different from previous works, the proposed method employs bidirectional nearest neighbor matching to find new entity alignment and update old alignment. Furthermore, we also construct new datasets by simulating the growth of multilingual DBpedia. Extensive experiments demonstrate that our continual alignment method is more effective than baselines based on retraining or inductive learning.
Document-level relation extraction (RE) aims to identify the relations between entities throughout an entire document. It needs complex reasoning skills to synthesize various knowledge such as coreferences and commonsense. Large-scale knowledge graphs (KGs) contain a wealth of real-world facts, and can provide valuable knowledge to document-level RE. In this paper, we propose an entity knowledge injection framework to enhance current document-level RE models. Specifically, we introduce coreference distillation to inject coreference knowledge, endowing an RE model with the more general capability of coreference reasoning. We also employ representation reconciliation to inject factual knowledge and aggregate KG representations and document representations into a unified space. The experiments on two benchmark datasets validate the generalization of our entity knowledge injection framework and the consistent improvement to several document-level RE models.
This paper is the first to provide a thorough system design overview along with the fusion methods selection criteria of a real-world cooperative autonomous driving system, named Infrastructure-Augmented Autonomous Driving or IAAD. We present an in-depth introduction of the IAAD hardware and software on both road-side and vehicle-side computing and communication platforms. We extensively characterize the IAAD system in the context of real-world deployment scenarios and observe that the network condition that fluctuates along the road is currently the main technical roadblock for cooperative autonomous driving. To address this challenge, we propose new fusion methods, dubbed "inter-frame fusion" and "planning fusion" to complement the current state-of-the-art "intra-frame fusion". We demonstrate that each fusion method has its own benefit and constraint.
3D dynamic point clouds provide a discrete representation of real-world objects or scenes in motion, which have been widely applied in immersive telepresence, autonomous driving, surveillance, etc. However, point clouds acquired from sensors are usually perturbed by noise, which affects downstream tasks such as surface reconstruction and analysis. Although many efforts have been made for static point cloud denoising, dynamic point cloud denoising remains under-explored. In this paper, we propose a novel gradient-field-based dynamic point cloud denoising method, exploiting the temporal correspondence via the estimation of gradient fields -- a fundamental problem in dynamic point cloud processing and analysis. The gradient field is the gradient of the log-probability function of the noisy point cloud, based on which we perform gradient ascent so as to converge each point to the underlying clean surface. We estimate the gradient of each surface patch and exploit the temporal correspondence, where the temporally corresponding patches are searched leveraging on rigid motion in classical mechanics. In particular, we treat each patch as a rigid object, which moves in the gradient field of an adjacent frame via force until reaching a balanced state, i.e., when the sum of gradients over the patch reaches 0. Since the gradient would be smaller when the point is closer to the underlying surface, the balanced patch would fit the underlying surface well, thus leading to the temporal correspondence. Finally, the position of each point in the patch is updated along the direction of the gradient averaged from corresponding patches in adjacent frames. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods under both synthetic noise and simulated real-world noise.
In the past few years, there has been a dramatic growth in e-manga (electronic Japanese-style comics). Faced with the booming demand for manga research and the large amount of unlabeled manga data, we raised a new task, called unsupervised manga character re-identification. However, the artistic expression and stylistic limitations of manga pose many challenges to the re-identification problem. Inspired by the idea that some content-related features may help clustering, we propose a Face-body and Spatial-temporal Associated Clustering method (FSAC). In the face-body combination module, a face-body graph is constructed to solve problems such as exaggeration and deformation in artistic creation by using the integrity of the image. In the spatial-temporal relationship correction module, we analyze the appearance features of characters and design a temporal-spatial-related triplet loss to fine-tune the clustering. Extensive experiments on a manga book dataset with 109 volumes validate the superiority of our method in unsupervised manga character re-identification.
Airports have an important role in both military and civilian domains. The synthetic aperture radar (SAR) based airport detection has received increasing attention in recent years. However, due to the high cost of SAR imaging and annotation process, there is no publicly available SAR dataset for airport detection. As a result, deep learning methods have not been fully used in airport detection tasks. To provide a benchmark for airport detection research in SAR images, this paper introduces a large-scale SAR Airport Dataset (SAD). In order to adequately reflect the demands of real world applications, it contains 624 SAR images from Sentinel 1B and covers 104 airfield instances with different scales, orientations and shapes. The experiments of multiple deep learning approach on this dataset proves its effectiveness. It developing state-of-the-art airport area detection algorithms or other relevant tasks.