We propose a novel ranking model that combines the Bradley-Terry-Luce probability model with a nonnegative matrix factorization framework to model and uncover the presence of latent variables that influence the performance of top tennis players. We derive an efficient, provably convergent, and numerically stable majorization-minimization-based algorithm to maximize the likelihood of datasets under the proposed statistical model. The model is tested on datasets involving the outcomes of matches between 20 top male and female tennis players over 14 major tournaments for men (including the Grand Slams and the ATP Masters 1000) and 16 major tournaments for women over the past 10 years. Our model automatically infers that the surface of the court (e.g., clay or hard court) is a key determinant of the performances of male players, but less so for females. Top players on various surfaces over this longitudinal period are also identified in an objective manner.
Nonnegative matrix factorization (NMF) is a linear dimensionality reduction technique for analyzing nonnegative data. A key aspect of NMF is the choice of the objective function that depends on the noise model (or statistics of the noise) assumed on the data. In many applications, the noise model is unknown and difficult to estimate. In this paper, we define a multi-objective NMF (MO-NMF) problem, where several objectives are combined within the same NMF model. We propose to use Lagrange duality to judiciously optimize for a set of weights to be used within the framework of the weighted-sum approach, that is, we minimize a single objective function which is a weighted sum of the all objective functions. We design a simple algorithm using multiplicative updates to minimize this weighted sum. We show how this can be used to find distributionally robust NMF (DR-NMF) solutions, that is, solutions that minimize the largest error among all objectives. We illustrate the effectiveness of this approach on synthetic, document and audio datasets. The results show that DR-NMF is robust to our incognizance of the noise model of the NMF problem.
We design and analyze TS-Cascade, a Thompson sampling algorithm for the cascading bandit problem. In TS-Cascade, Bayesian estimates of the click probability are constructed using a univariate Gaussian; this leads to a more efficient exploration procedure vis-\`a-vis existing UCB-based approaches. We also incorporate the empirical variance of each item's click probability into the Bayesian updates. These two novel features allow us to prove an expected regret bound of the form $\tilde{O}(\sqrt{KLT})$ where $L$ and $K$ are the number of ground items and the number of items in the chosen list respectively and $T\ge L$ is the number of Thompson sampling update steps. This matches the state-of-the-art regret bounds for UCB-based algorithms. More importantly, it is the first theoretical guarantee on a Thompson sampling algorithm for any stochastic combinatorial bandit problem model with partial feedback. Empirical experiments demonstrate superiority of TS-Cascade compared to existing UCB-based procedures in terms of the expected cumulative regret and the time complexity.
Motivated by real-world machine learning applications, we analyze approximations to the non-asymptotic fundamental limits of statistical classification. In the binary version of this problem, given two training sequences generated according to two {\em unknown} distributions $P_1$ and $P_2$, one is tasked to classify a test sequence which is known to be generated according to either $P_1$ or $P_2$. This problem can be thought of as an analogue of the binary hypothesis testing problem but in the present setting, the generating distributions are unknown. Due to finite sample considerations, we consider the second-order asymptotics (or dispersion-type) tradeoff between type-I and type-II error probabilities for tests which ensure that (i) the type-I error probability for {\em all} pairs of distributions decays exponentially fast and (ii) the type-II error probability for a {\em particular} pair of distributions is non-vanishing. We generalize our results to classification of multiple hypotheses with the rejection option.
We consider the problem of aggregating pairwise comparisons to obtain a consensus ranking order over a collection of objects. We use the popular Bradley-Terry-Luce (BTL) model which allows us to probabilistically describe pairwise comparisons between objects. In particular, we employ the Bayesian BTL model which allows for meaningful prior assumptions and to cope with situations where the number of objects is large and the number of comparisons between some objects is small or even zero. For the conventional Bayesian BTL model, we derive information-theoretic lower bounds on the Bayes risk of estimators for norm-based distortion functions. We compare the information-theoretic lower bound with the Bayesian Cram\'{e}r-Rao lower bound we derive for the case when the Bayes risk is the mean squared error. We illustrate the utility of the bounds through simulations by comparing them with the error performance of an expectation-maximization based inference algorithm proposed for the Bayesian BTL model. We draw parallels between pairwise comparisons in the BTL model and inter-player games represented as edges in a comparison graph and analyze the effect of various graph structures on the lower bounds. We also extend the information-theoretic and Bayesian Cram\'{e}r-Rao lower bounds to the more general Bayesian BTL model which takes into account home-field advantage.
We revisit the stochastic limited-memory BFGS (L-BFGS) algorithm. By proposing a new framework for the convergence analysis, we prove improved convergence rates and computational complexities of the stochastic L-BFGS algorithms compared to previous works. In addition, we propose several practical acceleration strategies to speed up the empirical performance of such algorithms. We also provide theoretical analyses for most of the strategies. Experiments on large-scale logistic and ridge regression problems demonstrate that our proposed strategies yield significant improvements vis-\`a-vis competing state-of-the-art algorithms.
The learning of mixture models can be viewed as a clustering problem. Indeed, given data samples independently generated from a mixture of distributions, we often would like to find the correct target clustering of the samples according to which component distribution they were generated from. For a clustering problem, practitioners often choose to use the simple k-means algorithm. k-means attempts to find an optimal clustering which minimizes the sum-of-squared distance between each point and its cluster center. In this paper, we provide sufficient conditions for the closeness of any optimal clustering and the correct target clustering assuming that the data samples are generated from a mixture of log-concave distributions. Moreover, we show that under similar or even weaker conditions on the mixture model, any optimal clustering for the samples with reduced dimensionality is also close to the correct target clustering. These results provide intuition for the informativeness of k-means (with and without dimensionality reduction) as an algorithm for learning mixture models. We verify the correctness of our theorems using numerical experiments and demonstrate using datasets with reduced dimensionality significant speed ups for the time required to perform clustering.
The multiplicative update (MU) algorithm has been extensively used to estimate the basis and coefficient matrices in nonnegative matrix factorization (NMF) problems under a wide range of divergences and regularizers. However, theoretical convergence guarantees have only been derived for a few special divergences without regularization. In this work, we provide a conceptually simple, self-contained, and unified proof for the convergence of the MU algorithm applied on NMF with a wide range of divergences and regularizers. Our main result shows the sequence of iterates (i.e., pairs of basis and coefficient matrices) produced by the MU algorithm converges to the set of stationary points of the non-convex NMF optimization problem. Our proof strategy has the potential to open up new avenues for analyzing similar problems in machine learning and signal processing.
We propose a geometric assumption on nonnegative data matrices such that under this assumption, we are able to provide upper bounds (both deterministic and probabilistic) on the relative error of nonnegative matrix factorization (NMF). The algorithm we propose first uses the geometric assumption to obtain an exact clustering of the columns of the data matrix; subsequently, it employs several rank-one NMFs to obtain the final decomposition. When applied to data matrices generated from our statistical model, we observe that our proposed algorithm produces factor matrices with comparable relative errors vis-\`a-vis classical NMF algorithms but with much faster speeds. On face image and hyperspectral imaging datasets, we demonstrate that our algorithm provides an excellent initialization for applying other NMF algorithms at a low computational cost. Finally, we show on face and text datasets that the combinations of our algorithm and several classical NMF algorithms outperform other algorithms in terms of clustering performance.
We propose a unified and systematic framework for performing online nonnegative matrix factorization in the presence of outliers. Our framework is particularly suited to large-scale data. We propose two solvers based on projected gradient descent and the alternating direction method of multipliers. We prove that the sequence of objective values converges almost surely by appealing to the quasi-martingale convergence theorem. We also show the sequence of learned dictionaries converges to the set of stationary points of the expected loss function almost surely. In addition, we extend our basic problem formulation to various settings with different constraints and regularizers. We also adapt the solvers and analyses to each setting. We perform extensive experiments on both synthetic and real datasets. These experiments demonstrate the computational efficiency and efficacy of our algorithms on tasks such as (parts-based) basis learning, image denoising, shadow removal and foreground-background separation.