Credit scoring is a major application of machine learning for financial institutions to decide whether to approve or reject a credit loan. For sake of reliability, it is necessary for credit scoring models to be both accurate and globally interpretable. Simple classifiers, e.g., Logistic Regression (LR), are white-box models, but not powerful enough to model complex nonlinear interactions among features. Fortunately, automatic feature crossing is a promising way to find cross features to make simple classifiers to be more accurate without heavy handcrafted feature engineering. However, credit scoring is usually based on different aspects of users, and the data usually contains hundreds of feature fields. This makes existing automatic feature crossing methods not efficient for credit scoring. In this work, we find local piece-wise interpretations in Deep Neural Networks (DNNs) of a specific feature are usually inconsistent in different samples, which is caused by feature interactions in the hidden layers. Accordingly, we can design an automatic feature crossing method to find feature interactions in DNN, and use them as cross features in LR. We give definition of the interpretation inconsistency in DNN, based on which a novel feature crossing method for credit scoring prediction called DNN2LR is proposed. Apparently, the final model, i.e., a LR model empowered with cross features, generated by DNN2LR is a white-box model. Extensive experiments have been conducted on both public and business datasets from real-world credit scoring applications. Experimental shows that, DNN2LR can outperform the DNN model, as well as several feature crossing methods. Moreover, comparing with the state-of-the-art feature crossing methods, i.e., AutoCross, DNN2LR can accelerate the speed for feature crossing by about 10 to 40 times on datasets with large numbers of feature fields.
For sake of reliability, it is necessary for models in real-world applications to be both powerful and globally interpretable. Simple linear classifiers, e.g., Logistic Regression (LR), are globally interpretable, but not powerful enough to model complex nonlinear interactions among features in tabular data. Meanwhile, Deep Neural Networks (DNNs) have shown great effectiveness for modeling tabular data, but is not globally interpretable. Accordingly, it will be promising if we can propose a feature crossing method to find feature interactions in DNN, and use them as cross features in LR. The local piece-wise interpretations in DNN of a specific feature are usually inconsistent in different samples, which is caused by feature interactions in the hidden layers. Inspired by this, we give definition of the interpretation inconsistency in DNN, and accordingly propose a novel feature crossing method called DNN2LR. Extensive experiments have been conducted on five public datasets and two real-world datasets. The final model, a LR model empowered with cross features, generated by DNN2LR can outperform the complex DNN model, as well as several state-of-the-art feature crossing methods. The experimental results strongly verify the effectiveness and efficiency of DNN2LR, especially on real-world datasets with large numbers of feature fields.
In recent years, substantial progress has been made on Graph Convolutional Networks (GCNs). However, the computing of GCN usually requires a large memory space for keeping the entire graph. In consequence, GCN is not flexible enough, especially for large scale graphs in complex real-world applications. Fortunately, methods based on Matrix Factorization (MF) naturally support constructing mini-batches, and thus are more friendly to distributed computing compared with GCN. Accordingly, in this paper, we analyze the connections between GCN and MF, and simplify GCN as matrix factorization with unitization and co-training. Furthermore, under the guidance of our analysis, we propose an alternative model to GCN named Unitized and Co-training Matrix Factorization (UCMF). Extensive experiments have been conducted on several real-world datasets. On the task of semi-supervised node classification, the experimental results illustrate that UCMF achieves similar or superior performances compared with GCN. Meanwhile, distributed UCMF significantly outperforms distributed GCN methods, which shows that UCMF can greatly benefit large scale and complex real-world applications. Moreover, we have also conducted experiments on a typical task of graph embedding, i.e., community detection, and the proposed UCMF model outperforms several representative graph embedding models.
When dealing with continuous numeric features, we usually adopt feature discretization. In this work, to find the best way to conduct feature discretization, we present some theoretical analysis, in which we focus on analyzing correctness and robustness of feature discretization. Then, we propose a novel discretization method called Local Linear Encoding (LLE). Experiments on two numeric datasets show that, LLE can outperform conventional discretization method with much fewer model parameters.