



Abstract:Exploratory data analysis is crucial for developing and understanding classification models from high-dimensional datasets. We explore the utility of a new unsupervised tree ensemble called uncharted forest for visualizing class associations, sample-sample associations, class heterogeneity, and uninformative classes for provenance studies. The uncharted forest algorithm can be used to partition data using random selections of variables and metrics based on statistical spread. After each tree is grown, a tally of the samples that arrive at every terminal node is maintained. Those tallies are stored in single sample association matrix and a likelihood measure for each sample being partitioned with one another can be made. That matrix may be readily viewed as a heat map, and the probabilities can be quantified via new metrics that account for class or cluster membership. We display the advantages and limitations of using this technique by applying it to two classification datasets and three provenance study datasets. Two of the metrics presented in this paper are also compared with widely used metrics from two algorithms that have variance-based clustering mechanisms.




Abstract:The resolution and calibration of pure spectra of minority components in measurements of chemical mixtures without prior knowledge of the mixture is a challenging problem. In this work, a combination of band target entropy minimization (BTEM) and target partial least squares (T-PLS) was used to obtain estimates for single pure component spectra and to calibrate those estimates in a true, one-at-a-time fashion. This approach allows for minor components to be targeted and their relative amounts estimated in the presence of other varying components in spectral data. The use of T-PLS estimation is an improvement to the BTEM method because it overcomes the need to identify all of the pure components prior to estimation. Estimated amounts from this combination were found to be similar to those obtained from a standard method, multivariate curve resolution-alternating least squares (MCR-ALS), on a simple, three component mixture dataset. Studies from two experimental datasets demonstrate where the combination of BTEM and T-PLS could model the pure component spectra and obtain concentration profiles of minor components but MCR-ALS could not.




Abstract:Five simple soft sensor methodologies with two update conditions were compared on two experimentally-obtained datasets and one simulated dataset. The soft sensors investigated were moving window partial least squares regression (and a recursive variant), moving window random forest regression, the mean moving window of $y$, and a novel random forest partial least squares regression ensemble (RF-PLS), all of which can be used with small sample sizes so that they can be rapidly placed online. It was found that, on two of the datasets studied, small window sizes led to the lowest prediction errors for all of the moving window methods studied. On the majority of datasets studied, the RF-PLS calibration method offered the lowest one-step-ahead prediction errors compared to those of the other methods, and it demonstrated greater predictive stability at larger time delays than moving window PLS alone. It was found that both the random forest and RF-PLS methods most adequately modeled the datasets that did not feature purely monotonic increases in property values, but that both methods performed more poorly than moving window PLS models on one dataset with purely monotonic property values. Other data dependent findings are presented and discussed.