Word and sentence embeddings are useful feature representations in natural language processing. However, intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade. Word and sentence similarity tasks have become the de facto evaluation method. It leads models to overfit to such evaluations, negatively impacting embedding models' development. This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations. Further, we propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks. Extensive experiments are conducted based on 60+ models and popular datasets to certify our judgments. Finally, the practical evaluation toolkit is released for future benchmarking purposes.
An unsupervised point cloud object retrieval and pose estimation method, called PCRP, is proposed in this work. It is assumed that there exists a gallery point cloud set that contains point cloud objects with given pose orientation information. PCRP attempts to register the unknown point cloud object with those in the gallery set so as to achieve content-based object retrieval and pose estimation jointly, where the point cloud registration task is built upon an enhanced version of the unsupervised R-PointHop method. Experiments on the ModelNet40 dataset demonstrate the superior performance of PCRP in comparison with traditional and learning based methods.
Videos are developing in the trends of Ultra High Definition (UHD), High Frame Rate (HFR), High Dynamic Range (HDR), Wide Color Gammut (WCG) and high fidelity, which provide users with more realistic visual experiences. However, the amount of video data increases exponentially and requires high efficiency video compression for storage and network transmission. Perceptually optimized video coding aims to exploit visual redundancies in videos so as to maximize compression efficiency. In this paper, we present a systematic survey on the recent advances and challenges on perceptually optimized video coding. Firstly, we present problem formulation and framework of perceptually optimized video coding, which includes visual perception modelling, visual quality assessment and perception guided coding optimization. Secondly, the recent advances on visual factors, key computational visual models and quality assessment models are presented. Thirdly, we do systematic review on perceptual video coding optimizations from four key aspects, which includes perceptually optimized bit allocation, rate-distortion optimization, transform and quantization, filtering and enhancement. In each part, problem formulation, working flow, recent advances, advantages and challenges are presented. Fourthly, perceptual coding performance of latest coding standards and tools are experimentally analyzed. Finally, challenging issues and future opportunities on perceptual video coding are identified.
Entity type prediction is an important problem in knowledge graph (KG) research. A new KG entity type prediction method, named CORE (COmplex space Regression and Embedding), is proposed in this work. The proposed CORE method leverages the expressive power of two complex space embedding models; namely, RotatE and ComplEx models. It embeds entities and types in two different complex spaces using either RotatE or ComplEx. Then, we derive a complex regression model to link these two spaces. Finally, a mechanism to optimize embedding and regression parameters jointly is introduced. Experiments show that CORE outperforms benchmarking methods on representative KG entity type inference datasets. Strengths and weaknesses of various entity type prediction methods are analyzed.
Knowledge base completion is formulated as a binary classification problem in this work, where an XGBoost binary classifier is trained for each relation using relevant links in knowledge graphs (KGs). The new method, named KGBoost, adopts a modularized design and attempts to find hard negative samples so as to train a powerful classifier for missing link prediction. We conduct experiments on multiple benchmark datasets, and demonstrate that KGBoost outperforms state-of-the-art methods across most datasets. Furthermore, as compared with models trained by end-to-end optimization, KGBoost works well under the low-dimensional setting so as to allow a smaller model size.
Visual odometry aims to track the incremental motion of an object using the information captured by visual sensors. In this work, we study the point cloud odometry problem, where only the point cloud scans obtained by the LiDAR (Light Detection And Ranging) are used to estimate object's motion trajectory. A lightweight point cloud odometry solution is proposed and named the green point cloud odometry (GPCO) method. GPCO is an unsupervised learning method that predicts object motion by matching features of consecutive point cloud scans. It consists of three steps. First, a geometry-aware point sampling scheme is used to select discriminant points from the large point cloud. Second, the view is partitioned into four regions surrounding the object, and the PointHop++ method is used to extract point features. Third, point correspondences are established to estimate object motion between two consecutive scans. Experiments on the KITTI dataset are conducted to demonstrate the effectiveness of the GPCO method. It is observed that GPCO outperforms benchmarking deep learning methods in accuracy while it has a significantly smaller model size and less training time.
An unsupervised, lightweight and high-performance single object tracker, called UHP-SOT, was proposed by Zhou et al. recently. As an extension, we present an enhanced version and name it UHP-SOT++ in this work. Built upon the foundation of the discriminative-correlation-filters-based (DCF-based) tracker, two new ingredients are introduced in UHP-SOT and UHP-SOT++: 1) background motion modeling and 2) object box trajectory modeling. The main difference between UHP-SOT and UHP-SOT++ is the fusion strategy of proposals from three models (i.e., DCF, background motion and object box trajectory models). An improved fusion strategy is adopted by UHP-SOT++ for more robust tracking performance against large-scale tracking datasets. Our second contribution lies in an extensive evaluation of the performance of state-of-the-art supervised and unsupervised methods by testing them on four SOT benchmark datasets - OTB2015, TC128, UAV123 and LaSOT. Experiments show that UHP-SOT++ outperforms all previous unsupervised methods and several deep-learning (DL) methods in tracking accuracy. Since UHP-SOT++ has extremely small model size, high tracking performance, and low computational complexity (operating at a rate of 20 FPS on an i5 CPU even without code optimization), it is an ideal solution in real-time object tracking on resource-limited platforms. Based on the experimental results, we compare pros and cons of supervised and unsupervised trackers and provide a new perspective to understand the performance gap between supervised and unsupervised methods, which is the third contribution of this work.
A novel method for detecting CNN-generated images, called Attentive PixelHop (or A-PixelHop), is proposed in this work. It has three advantages: 1) low computational complexity and a small model size, 2) high detection performance against a wide range of generative models, and 3) mathematical transparency. A-PixelHop is designed under the assumption that it is difficult to synthesize high-quality, high-frequency components in local regions. It contains four building modules: 1) selecting edge/texture blocks that contain significant high-frequency components, 2) applying multiple filter banks to them to obtain rich sets of spatial-spectral responses as features, 3) feeding features to multiple binary classifiers to obtain a set of soft decisions, 4) developing an effective ensemble scheme to fuse the soft decisions into the final decision. Experimental results show that A-PixelHop outperforms state-of-the-art methods in detecting CycleGAN-generated images. Furthermore, it can generalize well to unseen generative models and datasets.
A neural network targeting at unsupervised image anomaly localization, called the PEDENet, is proposed in this work. PEDENet contains a patch embedding (PE) network, a density estimation (DE) network, and an auxiliary network called the location prediction (LP) network. The PE network takes local image patches as input and performs dimension reduction to get low-dimensional patch embeddings via a deep encoder structure. Being inspired by the Gaussian Mixture Model (GMM), the DE network takes those patch embeddings and then predicts the cluster membership of an embedded patch. The sum of membership probabilities is used as a loss term to guide the learning process. The LP network is a Multi-layer Perception (MLP), which takes embeddings from two neighboring patches as input and predicts their relative location. The performance of the proposed PEDENet is evaluated extensively and benchmarked with that of state-of-the-art methods.