We demonstrate the first possibility of a sub-linear memory sketch for solving the approximate near-neighbor search problem. In particular, we develop an online sketching algorithm that can compress $N$ vectors into a tiny sketch consisting of small arrays of counters whose size scales as $O(N^{b}\log^2{N})$, where $b < 1$ depending on the stability of the near-neighbor search. This sketch is sufficient to identify the top-$v$ near-neighbors with high probability. To the best of our knowledge, this is the first near-neighbor search algorithm that breaks the linear memory ($O(N)$) barrier. We achieve sub-linear memory by combining advances in locality sensitive hashing (LSH) based estimation, especially the recently-published ACE algorithm, with compressed sensing and heavy hitter techniques. We provide strong theoretical guarantees; in particular, our analysis sheds new light on the memory-accuracy tradeoff in the near-neighbor search setting and the role of sparsity in compressed sensing, which could be of independent interest. We rigorously evaluate our framework, which we call RACE (Repeated ACE) data structures on a friend recommendation task on the Google plus graph with more than 100,000 high-dimensional vectors. RACE provides compression that is orders of magnitude better than the random projection based alternative, which is unsurprising given the theoretical advantage. We anticipate that RACE will enable both new theoretical perspectives on near-neighbor search and new methodologies for applications like high-speed data mining, internet-of-things (IoT), and beyond.

Extreme Multilabel Text Classification (XMTC) is a text classification problem in which, (i) the output space is extremely large, (ii) each data point may have multiple positive labels, and (iii) the data follows a strongly imbalanced distribution. With applications in recommendation systems and automatic tagging of web-scale documents, the research on XMTC has been focused on improving prediction accuracy and dealing with imbalanced data. However, the robustness of deep learning based XMTC models against adversarial examples has been largely underexplored. In this paper, we investigate the behaviour of XMTC models under adversarial attacks. To this end, first, we define adversarial attacks in multilabel text classification problems. We categorize attacking multilabel text classifiers as (a) positive-targeted, where the target positive label should fall out of top-k predicted labels, and (b) negative-targeted, where the target negative label should be among the top-k predicted labels. Then, by experiments on APLC-XLNet and AttentionXML, we show that XMTC models are highly vulnerable to positive-targeted attacks but more robust to negative-targeted ones. Furthermore, our experiments show that the success rate of positive-targeted adversarial attacks has an imbalanced distribution. More precisely, tail classes are highly vulnerable to adversarial attacks for which an attacker can generate adversarial samples with high similarity to the actual data-points. To overcome this problem, we explore the effect of rebalanced loss functions in XMTC where not only do they increase accuracy on tail classes, but they also improve the robustness of these classes against adversarial attacks. The code for our experiments is available at https://github.com/xmc-aalto/adv-xmtc

Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and model development, downstream improvements from applying CDCR have not been shown yet. The reason lies in the fact that every CDCR system released to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability --- a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To approach this issue, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Whilst being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-and-miss. Via model introspection, we find that the importance of event actions, event time, etc. for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to move beyond corpus-tailored CDCR systems in the future -- the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and model implementation to the public.

We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes a context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. This model extends the blocking bandits model (Basu et al., NeurIPS19) to a contextual setting, and captures important scenarios such as recommendation systems or ad placement with diverse users, and processing diverse pool of jobs. This contextual setting, however, invalidates greedy solution techniques that are effective for its non-contextual counterpart. Assuming knowledge of the mean reward for each arm-context pair, we design a randomized LP-based algorithm which is $\alpha$-optimal in (large enough) $T$ time steps, where $\alpha = \tfrac{d_{\max}}{2d_{\max}-1}\left(1- \epsilon\right)$ for any $\epsilon >0$, and $d_{max}$ is the maximum delay of the arms. In the bandit setting, we show that a UCB based variant of the above online policy guarantees $\mathcal{O}\left(\log T\right)$ regret w.r.t. the $\alpha$-optimal strategy in $T$ time steps, which matches the $\Omega(\log(T))$ regret lower bound in this setting. Due to the time correlation caused by the blocking of arms, existing techniques for upper bounding regret fail. As a first, in the presence of such temporal correlations, we combine ideas from coupling of non-stationary Markov chains and opportunistic sub-sampling with suboptimality charging techniques from combinatorial bandits to prove our regret upper bounds.

Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, etc. However, existing learning methods for contextual bandit problems have one of two drawbacks: they either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute, but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.

We view the Information Bottleneck Principle (IBP: Tishby et al., 1999; Schwartz-Ziv and Tishby, 2017) and Predictive Information Bottleneck Principle (PIBP: Still et al., 2007; Alemi, 2019) as special cases of a family of general information bottleneck objectives (IBOs). Each IBO corresponds to a particular constrained optimization problem where the constraints apply to: (a) the mutual information between the training data and the learned model parameters or extracted representation of the data, and (b) the mutual information between the learned model parameters or extracted representation of the data and the test data (if any). The heuristics behind the IBP and PIBP are shown to yield different constraints in the corresponding constrained optimization problem formulations. We show how other heuristics lead to a new IBO, different from both the IBP and PIBP, and use the techniques from (Alemi, 2019) to derive and optimize a variational upper bound on the new IBO. We then apply the theory of general IBOs to resolve the seeming contradiction between, on the one hand, the recommendations of IBP and PIBP to maximize the mutual information between the model parameters and test data, and on the other, recent information-theoretic results (see Xu and Raginsky, 2017) suggesting that this mutual information should be minimized. The key insight is that the heuristics (and thus the constraints in the constrained optimization problems) of IBP and PIBP are not applicable to the scenario analyzed by (Xu and Raginsky, 2017) because the latter makes the additional assumption that the parameters of the trained model have been selected to minimize the empirical loss function. Aided by this insight, we formulate a new IBO that accounts for this property of the parameters of the trained model, and derive and optimize a variational bound on this IBO.

A Machine-Critical Application is a system that is fundamentally necessary to the success of specific and sensitive operations such as search and recovery, rescue, military, and emergency management actions. Recent advances in Machine Learning, Natural Language Processing, voice recognition, and speech processing technologies have naturally allowed the development and deployment of speech-based conversational interfaces to interact with various machine-critical applications. While these conversational interfaces have allowed users to give voice commands to carry out strategic and critical activities, their robustness to adversarial attacks remains uncertain and unclear. Indeed, Adversarial Artificial Intelligence (AI) which refers to a set of techniques that attempt to fool machine learning models with deceptive data, is a growing threat in the AI and machine learning research community, in particular for machine-critical applications. The most common reason of adversarial attacks is to cause a malfunction in a machine learning model. An adversarial attack might entail presenting a model with inaccurate or fabricated samples as it's training data, or introducing maliciously designed data to deceive an already trained model. While focusing on speech recognition for machine-critical applications, in this paper, we first review existing speech recognition techniques, then, we investigate the effectiveness of adversarial attacks and defenses against these systems, before outlining research challenges, defense recommendations, and future work. This paper is expected to serve researchers and practitioners as a reference to help them in understanding the challenges, position themselves and, ultimately, help them to improve existing models of speech recognition for mission-critical applications. Keywords: Mission-Critical Applications, Adversarial AI, Speech Recognition Systems.

Contextual multi-armed bandits have been studied for decades and adapted to various applications such as online advertising and personalized recommendation. To solve the exploitation-exploration tradeoff in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, linear contextual bandits have adopted ridge regression to estimate the reward function and combine it with TS or UCB strategies for exploration. However, this line of works explicitly assumes the reward is based on a linear function of arm vectors, which may not be true in real-world datasets. To overcome this challenge, a series of neural-based bandit algorithms have been proposed, where a neural network is assigned to learn the underlying reward function and TS or UCB are adapted for exploration. In this paper, we propose "EE-Net", a neural-based bandit approach with a novel exploration strategy. In addition to utilizing a neural network (Exploitation network) to learn the reward function, EE-Net adopts another neural network (Exploration network) to adaptively learn potential gains compared to currently estimated reward. Then, a decision-maker is constructed to combine the outputs from the Exploitation and Exploration networks. We prove that EE-Net achieves $\mathcal{O}(\sqrt{T\log T})$ regret, which is tighter than existing state-of-the-art neural bandit algorithms ($\mathcal{O}(\sqrt{T}\log T)$ for both UCB-based and TS-based). Through extensive experiments on four real-world datasets, we show that EE-Net outperforms existing linear and neural bandit approaches.

We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.

<<

417

418

419

420

421

422

423

424

425

426

427

428

429

>>