Alert button
Picture for Lin Xiao

Lin Xiao

Alert button

Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification

Nov 19, 2022
Lin Xiao, Pengyu Xu, Liping Jing, Xiangliang Zhang

Figure 1 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification
Figure 2 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification
Figure 3 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification
Figure 4 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification

Multi-label text classification (MLTC) is one of the key tasks in natural language processing. It aims to assign multiple target labels to one document. Due to the uneven popularity of labels, the number of documents per label follows a long-tailed distribution in most cases. It is much more challenging to learn classifiers for data-scarce tail labels than for data-rich head labels. The main reason is that head labels usually have sufficient information, e.g., a large intra-class diversity, while tail labels do not. In response, we propose a Pairwise Instance Relation Augmentation Network (PIRAN) to augment tailed-label documents for balancing tail labels and head labels. PIRAN consists of a relation collector and an instance generator. The former aims to extract the document pairwise relations from head labels. Taking these relations as perturbations, the latter tries to generate new document instances in high-level feature space around the limited given tailed-label instances. Meanwhile, two regularizers (diversity and consistency) are designed to constrain the generation process. The consistency-regularizer encourages the variance of tail labels to be close to head labels and further balances the whole datasets. And diversity-regularizer makes sure the generated instances have diversity and avoids generating redundant instances. Extensive experimental results on three benchmark datasets demonstrate that PIRAN consistently outperforms the SOTA methods, and dramatically improves the performance of tail labels.

Viaarxiv icon

Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

Oct 04, 2022
Rui Yuan, Simon S. Du, Robert M. Gower, Alessandro Lazaric, Lin Xiao

Figure 1 for Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as approximate versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\mathcal{O}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.

Viaarxiv icon

Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

Oct 04, 2022
Shicong Cen, Yuejie Chi, Simon S. Du, Lin Xiao

Figure 1 for Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games
Figure 2 for Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games

Multi-Agent Reinforcement Learning (MARL) -- where multiple agents learn to interact in a shared dynamic environment -- permeates across a wide range of critical applications. While there has been substantial progress on understanding the global convergence of policy optimization methods in single-agent RL, designing and analysis of efficient policy optimization algorithms in the MARL setting present significant challenges, which unfortunately, remain highly inadequately addressed by existing theory. In this paper, we focus on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games, and study equilibrium finding algorithms in both the infinite-horizon discounted setting and the finite-horizon episodic setting. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method and the value is updated on a slower timescale. We show that, in the full-information tabular setting, the proposed method achieves a finite-time last-iterate linear convergence to the quantal response equilibrium of the regularized problem, which translates to a sublinear last-iterate convergence to the Nash equilibrium by controlling the amount of regularization. Our convergence results improve upon the best known iteration complexities, and lead to a better understanding of policy optimization in competitive Markov games.

Viaarxiv icon

Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

Jun 14, 2022
Aaron Defazio, Baoyu Zhou, Lin Xiao

Figure 1 for Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method
Figure 2 for Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method
Figure 3 for Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method
Figure 4 for Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a learning rate scaling hyper-parameter to be carefully tuned. To overcome this restriction, we introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator, one that can both increase and decrease. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.

Viaarxiv icon

BiT: Robustly Binarized Multi-distilled Transformer

May 25, 2022
Zechun Liu, Barlas Oguz, Aasish Pappu, Lin Xiao, Scott Yih, Meng Li, Raghuraman Krishnamoorthi, Yashar Mehdad

Figure 1 for BiT: Robustly Binarized Multi-distilled Transformer
Figure 2 for BiT: Robustly Binarized Multi-distilled Transformer
Figure 3 for BiT: Robustly Binarized Multi-distilled Transformer
Figure 4 for BiT: Robustly Binarized Multi-distilled Transformer

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however is technically challenging from an optimization perspective. In this work, we identify a series of improvements which enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%.

Viaarxiv icon

On Continual Model Refinement in Out-of-Distribution Data Streams

May 04, 2022
Bill Yuchen Lin, Sida Wang, Xi Victoria Lin, Robin Jia, Lin Xiao, Xiang Ren, Wen-tau Yih

Figure 1 for On Continual Model Refinement in Out-of-Distribution Data Streams
Figure 2 for On Continual Model Refinement in Out-of-Distribution Data Streams
Figure 3 for On Continual Model Refinement in Out-of-Distribution Data Streams
Figure 4 for On Continual Model Refinement in Out-of-Distribution Data Streams

Real-world natural language processing (NLP) models need to be continually updated to fix the prediction errors in out-of-distribution (OOD) data streams while overcoming catastrophic forgetting. However, existing continual learning (CL) problem setups cannot cover such a realistic and complex scenario. In response to this, we propose a new CL problem formulation dubbed continual model refinement (CMR). Compared to prior CL settings, CMR is more practical and introduces unique challenges (boundary-agnostic and non-stationary distribution shift, diverse mixtures of multiple OOD data clusters, error-centric streams, etc.). We extend several existing CL approaches to the CMR setting and evaluate them extensively. For benchmarking and analysis, we propose a general sampling algorithm to obtain dynamic OOD data streams with controllable non-stationarity, as well as a suite of metrics measuring various aspects of online performance. Our experiments and detailed analysis reveal the promise and challenges of the CMR problem, supporting that studying CMR in dynamic OOD streams can benefit the longevity of deployed NLP models in production.

* Accepted to ACL 2022; Project website: https://cmr-nlp.github.io/ 
Viaarxiv icon

FedShuffle: Recipes for Better Use of Local Work in Federated Learning

Apr 27, 2022
Samuel Horváth, Maziar Sanjabi, Lin Xiao, Peter Richtárik, Michael Rabbat

Figure 1 for FedShuffle: Recipes for Better Use of Local Work in Federated Learning
Figure 2 for FedShuffle: Recipes for Better Use of Local Work in Federated Learning
Figure 3 for FedShuffle: Recipes for Better Use of Local Work in Federated Learning
Figure 4 for FedShuffle: Recipes for Better Use of Local Work in Federated Learning

The practice of applying several local updates before aggregation across clients has been empirically shown to be a successful approach to overcoming the communication bottleneck in Federated Learning (FL). In this work, we propose a general recipe, FedShuffle, that better utilizes the local updates in FL, especially in the heterogeneous regime. Unlike many prior works, FedShuffle does not assume any uniformity in the number of updates per device. Our FedShuffle recipe comprises four simple-yet-powerful ingredients: 1) local shuffling of the data, 2) adjustment of the local learning rates, 3) update weighting, and 4) momentum variance reduction (Cutkosky and Orabona, 2019). We present a comprehensive theoretical analysis of FedShuffle and show that both theoretically and empirically, our approach does not suffer from the objective function mismatch that is present in FL methods which assume homogeneous updates in heterogeneous FL setups, e.g., FedAvg (McMahan et al., 2017). In addition, by combining the ingredients above, FedShuffle improves upon FedNova (Wang et al., 2020), which was previously proposed to solve this mismatch. We also show that FedShuffle with momentum variance reduction can improve upon non-local methods under a Hessian similarity assumption. Finally, through experiments on synthetic and real-world datasets, we illustrate how each of the four ingredients used in FedShuffle helps improve the use of local updates in FL.

* 18 pages, 2 figures, 3 tables, 30 pages of supplementary materials 
Viaarxiv icon

Federated Learning with Partial Model Personalization

Apr 08, 2022
Krishna Pillutla, Kshitiz Malik, Abdelrahman Mohamed, Michael Rabbat, Maziar Sanjabi, Lin Xiao

Figure 1 for Federated Learning with Partial Model Personalization
Figure 2 for Federated Learning with Partial Model Personalization
Figure 3 for Federated Learning with Partial Model Personalization
Figure 4 for Federated Learning with Partial Model Personalization

We consider two federated learning algorithms for training partially personalized models, where the shared and personal parameters are updated either simultaneously or alternately on the devices. Both algorithms have been proposed in the literature, but their convergence properties are not fully understood, especially for the alternating variant. We provide convergence analyses of both algorithms in the general nonconvex setting with partial participation and delineate the regime where one dominates the other. Our experiments on real-world image, text, and speech datasets demonstrate that (a) partial personalization can obtain most of the benefits of full model personalization with a small fraction of personal parameters, and, (b) the alternating update algorithm often outperforms the simultaneous update algorithm.

Viaarxiv icon

On the Convergence Rates of Policy Gradient Methods

Jan 19, 2022
Lin Xiao

We consider infinite-horizon discounted Markov decision problems with finite state and action spaces. We show that with direct parametrization in the policy space, the weighted value function, although non-convex in general, is both quasi-convex and quasi-concave. While quasi-convexity helps explain the convergence of policy gradient methods to global optima, quasi-concavity hints at their convergence guarantees using arbitrarily large step sizes that are not dictated by the Lipschitz constant charactering smoothness of the value function. In particular, we show that when using geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. In addition, we develop a theory of weak gradient-mapping dominance and use it to prove sharper sublinear convergence rate of the projected policy gradient method. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model.

Viaarxiv icon

Importance Estimation from Multiple Perspectives for Keyphrase Extraction

Nov 11, 2021
Mingyang Song, Liping Jing, Lin Xiao

Figure 1 for Importance Estimation from Multiple Perspectives for Keyphrase Extraction
Figure 2 for Importance Estimation from Multiple Perspectives for Keyphrase Extraction
Figure 3 for Importance Estimation from Multiple Perspectives for Keyphrase Extraction
Figure 4 for Importance Estimation from Multiple Perspectives for Keyphrase Extraction

Keyphrase extraction is a fundamental task in Natural Language Processing, which usually contains two main parts: candidate keyphrase extraction and keyphrase importance estimation. From the view of human understanding documents, we typically measure the importance of phrase according to its syntactic accuracy, information saliency, and concept consistency simultaneously. However, most existing keyphrase extraction approaches only focus on the part of them, which leads to biased results. In this paper, we propose a new approach to estimate the importance of keyphrase from multiple perspectives (called as \textit{KIEMP}) and further improve the performance of keyphrase extraction. Specifically, \textit{KIEMP} estimates the importance of phrase with three modules: a chunking module to measure its syntactic accuracy, a ranking module to check its information saliency, and a matching module to judge the concept (i.e., topic) consistency between phrase and the whole document. These three modules are seamlessly jointed together via an end-to-end multi-task learning model, which is helpful for three parts to enhance each other and balance the effects of three perspectives. Experimental results on six benchmark datasets show that \textit{KIEMP} outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.

* 11 pages, 2 figures, Accepted by EMNLP 2021 (main conference) 
Viaarxiv icon