Graph-based recommendation models work well for top-N recommender systems due to their capability to capture the potential relationships between entities. However, most of the existing methods only construct a single global item graph shared by all the users and regrettably ignore the diverse tastes between different user groups. Inspired by the success of local models for recommendation, this paper provides the first attempt to investigate multiple local item graphs along with a global item graph for graph-based recommendation models. We argue that recommendation on global and local graphs outperforms that on a single global graph or multiple local graphs. Specifically, we propose a novel graph-based recommendation model named GLIMG (Global and Local IteM Graphs), which simultaneously captures both the global and local user tastes. By integrating the global and local graphs into an adapted semi-supervised learning model, users' preferences on items are propagated globally and locally. Extensive experimental results on real-world datasets show that our proposed method consistently outperforms the state-of-the art counterparts on the top-N recommendation task.
Accurate estimation of remaining useful life (RUL) of industrial equipment can enable advanced maintenance schedules, increase equipment availability and reduce operational costs. However, existing deep learning methods for RUL prediction are not completely successful due to the following two reasons. First, relying on a single objective function to estimate the RUL will limit the learned representations and thus affect the prediction accuracy. Second, while longer sequences are more informative for modelling the sensor dynamics of equipment, existing methods are less effective to deal with very long sequences, as they mainly focus on the latest information. To address these two problems, we develop a novel attention-based sequence to sequence with auxiliary task (ATS2S) model. In particular, our model jointly optimizes both reconstruction loss to empower our model with predictive capabilities (by predicting next input sequence given current input sequence) and RUL prediction loss to minimize the difference between the predicted RUL and actual RUL. Furthermore, to better handle longer sequence, we employ the attention mechanism to focus on all the important input information during training process. Finally, we propose a new dual-latent feature representation to integrate the encoder features and decoder hidden states, to capture rich semantic information in data. We conduct extensive experiments on four real datasets to evaluate the efficacy of the proposed method. Experimental results show that our proposed method can achieve superior performance over 13 state-of-the-art methods consistently.
This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for K-nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning ten-million-level nonlinearly-separable datasets on a PC with 64GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669.
Ensemble clustering has been a popular research topic in data mining and machine learning. Despite its significant progress in recent years, there are still two challenging issues in the current ensemble clustering research. First, most of the existing algorithms tend to investigate the ensemble information at the object-level, yet often lack the ability to explore the rich information at higher levels of granularity. Second, they mostly focus on the direct connections (e.g., direct intersection or pair-wise co-occurrence) in the multiple base clusterings, but generally neglect the multi-scale indirect relationship hidden in them. To address these two issues, this paper presents a novel ensemble clustering approach based on fast propagation of cluster-wise similarities via random walks. We first construct a cluster similarity graph with the base clusters treated as graph nodes and the cluster-wise Jaccard coefficient exploited to compute the initial edge weights. Upon the constructed graph, a transition probability matrix is defined, based on which the random walk process is conducted to propagate the graph structural information. Specifically, by investigating the propagating trajectories starting from different nodes, a new cluster-wise similarity matrix can be derived by considering the trajectory relationship. Then, the newly obtained cluster-wise similarity matrix is mapped from the cluster-level to the object-level to achieve an enhanced co-association (ECA) matrix, which is able to simultaneously capture the object-wise co-occurrence relationship as well as the multi-scale cluster-wise relationship in ensembles. Finally, two novel consensus functions are proposed to obtain the consensus clustering result. Extensive experiments on a variety of real-world datasets have demonstrated the effectiveness and efficiency of our approach.
The emergence of high-dimensional data in various areas has brought new challenges to the ensemble clustering research. To deal with the curse of dimensionality, considerable efforts in ensemble clustering have been made by incorporating various subspace-based techniques. Besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large number of diversified metrics, and furthermore, how to jointly exploit the multi-level diversity in the large number of metrics, subspaces, and clusters, in a unified framework. To tackle this problem, this paper proposes a novel multi-diversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can thereby be constructed. Further, an entropy-based criterion is adopted to explore the cluster-wise diversity in ensembles, based on which the consensus function is therefore presented. Experimental results on twenty high-dimensional datasets have confirmed the superiority of our approach over the state-of-the-art.
Classification is one of the most popular and widely used supervised learning tasks, which categorizes objects into predefined classes based on known knowledge. Classification has been an important research topic in machine learning and data mining. Different classification methods have been proposed and applied to deal with various real-world problems. Unlike unsupervised learning such as clustering, a classifier is typically trained with labeled data before being used to make prediction, and usually achieves higher accuracy than unsupervised one. In this paper, we first define classification and then review several representative methods. After that, we study in details the application of classification to a critical problem in drug discovery, i.e., drug-target prediction, due to the challenges in predicting possible interactions between drugs and targets.