Abstract:This paper introduces OGBoost, a scikit-learn-compatible Python package for ordinal regression using gradient boosting. Ordinal variables (e.g., rating scales, quality assessments) lie between nominal and continuous data, necessitating specialized methods that reflect their inherent ordering. Built on a coordinate-descent approach for optimization and the latent-variable framework for ordinal regression, OGBoost performs joint optimization of a latent continuous regression function (functional gradient descent) and a threshold vector that converts the latent continuous value into discrete class probabilities (classical gradient descent). In addition to the stanadard methods for scikit-learn classifiers, the GradientBoostingOrdinal class implements a "decision_function" that returns the (scalar) value of the latent function for each observation, which can be used as a high-resolution alternative to class labels for comparing and ranking observations. The class has the option to use cross-validation for early stopping rather than a single holdout validation set, a more robust approach for small and/or imbalanced datasets. Furthermore, users can select base learners with different underlying algorithms and/or hyperparameters for use throughout the boosting iterations, resulting in a `heterogeneous' ensemble approach that can be used as a more efficient alternative to hyperparameter tuning (e.g. via grid search). We illustrate the capabilities of OGBoost through examples, using the wine quality dataset from the UCI respository. The package is available on PyPI and can be installed via "pip install ogboost".
Abstract:Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as MCMC sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models, child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete nodes, concurrent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for NUMA architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and GPUs), resulting in cost-effectiveness, energy efficiency, and higher speed on multi-core x86 processors.