Alert button
Picture for Martin Takáč

Martin Takáč

Alert button

Byzantine-Tolerant Methods for Distributed Variational Inequalities

Nov 08, 2023
Nazarii Tupitsa, Abdulla Jasem Almansoori, Yanlin Wu, Martin Takáč, Karthik Nandakumar, Samuel Horváth, Eduard Gorbunov

Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work (Adibi et al., 2022) addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.

* NeurIPS 2023; 69 pages, 12 figures 
Viaarxiv icon

Stochastic Gradient Descent with Preconditioned Polyak Step-size

Oct 03, 2023
Farshed Abdukhakimov, Chulu Xiang, Dmitry Kamzolov, Martin Takáč

Stochastic Gradient Descent (SGD) is one of the many iterative optimization methods that are widely used in solving machine learning problems. These methods display valuable properties and attract researchers and industrial machine learning engineers with their simplicity. However, one of the weaknesses of this type of methods is the necessity to tune learning rate (step-size) for every loss function and dataset combination to solve an optimization problem and get an efficient performance in a given time budget. Stochastic Gradient Descent with Polyak Step-size (SPS) is a method that offers an update rule that alleviates the need of fine-tuning the learning rate of an optimizer. In this paper, we propose an extension of SPS that employs preconditioning techniques, such as Hutchinson's method, Adam, and AdaGrad, to improve its performance on badly scaled and/or ill-conditioned datasets.

Viaarxiv icon

WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Apr 08, 2023
Dávid Šuba, Marek Šuppa, Jozef Kubík, Endre Hamerlik, Martin Takáč

Figure 1 for WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition
Figure 2 for WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition
Figure 3 for WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition
Figure 4 for WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range of practical applications. The performance of state-of-the-art NER methods depends on high quality manually anotated datasets which still do not exist for some languages. In this work we aim to remedy this situation in Slovak by introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We benchmark it by evaluating state-of-the-art multilingual Pretrained Language Models and comparing it to the existing silver-standard Slovak NER dataset. We also conduct few-shot experiments and show that training on a sliver-standard dataset yields better results. To enable future work that can be based on Slovak NER, we release the dataset, code, as well as the trained models publicly under permissible licensing terms at https://github.com/NaiveNeuron/WikiGoldSK.

* BSNLP 2023 Workshop at EACL 2023 
Viaarxiv icon

In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise

Jan 02, 2023
Asma Ahmed Hashmi, Artem Agafonov, Aigerim Zhumabayeva, Mohammad Yaqub, Martin Takáč

Figure 1 for In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise
Figure 2 for In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise
Figure 3 for In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise
Figure 4 for In Quest of Ground Truth: Learning Confident Models and Estimating Uncertainty in the Presence of Annotator Noise

The performance of the Deep Learning (DL) models depends on the quality of labels. In some areas, the involvement of human annotators may lead to noise in the data. When these corrupted labels are blindly regarded as the ground truth (GT), DL models suffer from performance deficiency. This paper presents a method that aims to learn a confident model in the presence of noisy labels. This is done in conjunction with estimating the uncertainty of multiple annotators. We robustly estimate the predictions given only the noisy labels by adding entropy or information-based regularizer to the classifier network. We conduct our experiments on a noisy version of MNIST, CIFAR-10, and FMNIST datasets. Our empirical results demonstrate the robustness of our method as it outperforms or performs comparably to other state-of-the-art (SOTA) methods. In addition, we evaluated the proposed method on the curated dataset, where the noise type and level of various annotators depend on the input image style. We show that our approach performs well and is adept at learning annotators' confusion. Moreover, we demonstrate how our model is more confident in predicting GT than other baselines. Finally, we assess our approach for segmentation problem and showcase its effectiveness with experiments.

Viaarxiv icon

Partial Disentanglement with Partially-Federated GANs (PaDPaF)

Dec 07, 2022
Abdulla Jasem Almansoori, Samuel Horváth, Martin Takáč

Figure 1 for Partial Disentanglement with Partially-Federated GANs (PaDPaF)
Figure 2 for Partial Disentanglement with Partially-Federated GANs (PaDPaF)
Figure 3 for Partial Disentanglement with Partially-Federated GANs (PaDPaF)
Figure 4 for Partial Disentanglement with Partially-Federated GANs (PaDPaF)

Federated learning has become a popular machine learning paradigm with many potential real-life applications, including recommendation systems, the Internet of Things (IoT), healthcare, and self-driving cars. Though most current applications focus on classification-based tasks, learning personalized generative models remains largely unexplored, and their benefits in the heterogeneous setting still need to be better understood. This work proposes a novel architecture combining global client-agnostic and local client-specific generative models. We show that using standard techniques for training federated models, our proposed model achieves privacy and personalization that is achieved by implicitly disentangling the globally-consistent representation (i.e. content) from the client-dependent variations (i.e. style). Using such decomposition, personalized models can generate locally unseen labels while preserving the given style of the client and can predict the labels for all clients with high accuracy by training a simple linear classifier on the global content features. Furthermore, disentanglement enables other essential applications, such as data anonymization, by sharing only content. Extensive experimental evaluation corroborates our findings, and we also provide partial theoretical justifications for the proposed approach.

* 20 pages, 17 figures 
Viaarxiv icon

Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points

Nov 02, 2022
Rachael Tappenden, Martin Takáč

Figure 1 for Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points
Figure 2 for Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points
Figure 3 for Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points
Figure 4 for Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points

This work shows that applying Gradient Descent (GD) with a fixed step size to minimize a (possibly nonconvex) quadratic function is equivalent to running the Power Method (PM) on the gradients. The connection between GD with a fixed step size and the PM, both with and without fixed momentum, is thus established. Consequently, valuable eigen-information is available via GD. Recent examples show that GD with a fixed step size, applied to locally quadratic nonconvex functions, can take exponential time to escape saddle points (Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnabas Poczos: "Gradient descent can take exponential time to escape saddle points"; S. Paternain, A. Mokhtari, and A. Ribeiro: "A newton-based method for nonconvex optimization with fast evasion of saddle points"). Here, those examples are revisited and it is shown that eigenvalue information was missing, so that the examples may not provide a complete picture of the potential practical behaviour of GD. Thus, ongoing investigation of the behaviour of GD on nonconvex functions, possibly with an \emph{adaptive} or \emph{variable} step size, is warranted. It is shown that, in the special case of a quadratic in $R^2$, if an eigenvalue is known, then GD with a fixed step size will converge in two iterations, and a complete eigen-decomposition is available. By considering the dynamics of the gradients and iterates, new step size strategies are proposed to improve the practical performance of GD. Several numerical examples are presented, which demonstrate the advantages of exploiting the GD--PM connection.

Viaarxiv icon

FLECS-CGD: A Federated Learning Second-Order Framework via Compression and Sketching with Compressed Gradient Differences

Oct 18, 2022
Artem Agafonov, Brahim Erraji, Martin Takáč

Figure 1 for FLECS-CGD: A Federated Learning Second-Order Framework via Compression and Sketching with Compressed Gradient Differences
Figure 2 for FLECS-CGD: A Federated Learning Second-Order Framework via Compression and Sketching with Compressed Gradient Differences
Figure 3 for FLECS-CGD: A Federated Learning Second-Order Framework via Compression and Sketching with Compressed Gradient Differences

In the recent paper FLECS (Agafonov et al, FLECS: A Federated Learning Second-Order Framework via Compression and Sketching), the second-order framework FLECS was proposed for the Federated Learning problem. This method utilize compression of sketched Hessians to make communication costs low. However, the main bottleneck of FLECS is gradient communication without compression. In this paper, we propose the modification of FLECS with compressed gradient differences, which we call FLECS-CGD (FLECS with Compressed Gradient Differences) and make it applicable for stochastic optimization. Convergence guarantees are provided in strongly convex and nonconvex cases. Experiments show the practical benefit of proposed approach.

Viaarxiv icon

SP2: A Second Order Stochastic Polyak Method

Jul 17, 2022
Shuang Li, William J. Swartworth, Martin Takáč, Deanna Needell, Robert M. Gower

Figure 1 for SP2: A Second Order Stochastic Polyak Method
Figure 2 for SP2: A Second Order Stochastic Polyak Method
Figure 3 for SP2: A Second Order Stochastic Polyak Method
Figure 4 for SP2: A Second Order Stochastic Polyak Method

Recently the "SP" (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.

Viaarxiv icon

Suppressing Poisoning Attacks on Federated Learning for Medical Imaging

Jul 15, 2022
Naif Alkhunaizi, Dmitry Kamzolov, Martin Takáč, Karthik Nandakumar

Figure 1 for Suppressing Poisoning Attacks on Federated Learning for Medical Imaging
Figure 2 for Suppressing Poisoning Attacks on Federated Learning for Medical Imaging
Figure 3 for Suppressing Poisoning Attacks on Federated Learning for Medical Imaging
Figure 4 for Suppressing Poisoning Attacks on Federated Learning for Medical Imaging

Collaboration among multiple data-owning entities (e.g., hospitals) can accelerate the training process and yield better machine learning models due to the availability and diversity of data. However, privacy concerns make it challenging to exchange data while preserving confidentiality. Federated Learning (FL) is a promising solution that enables collaborative training through exchange of model parameters instead of raw data. However, most existing FL solutions work under the assumption that participating clients are \emph{honest} and thus can fail against poisoning attacks from malicious parties, whose goal is to deteriorate the global model performance. In this work, we propose a robust aggregation rule called Distance-based Outlier Suppression (DOS) that is resilient to byzantine failures. The proposed method computes the distance between local parameter updates of different clients and obtains an outlier score for each client using Copula-based Outlier Detection (COPOD). The resulting outlier scores are converted into normalized weights using a softmax function, and a weighted average of the local parameters is used for updating the global model. DOS aggregation can effectively suppress parameter updates from malicious clients without the need for any hyperparameter selection, even when the data distributions are heterogeneous. Evaluation on two medical imaging datasets (CheXpert and HAM10000) demonstrates the higher robustness of DOS method against a variety of poisoning attacks in comparison to other state-of-the-art methods. The code can be found here https://github.com/Naiftt/SPAFD.

Viaarxiv icon

On Scaled Methods for Saddle Point Problems

Jun 16, 2022
Aleksandr Beznosikov, Aibek Alanov, Dmitry Kovalev, Martin Takáč, Alexander Gasnikov

Figure 1 for On Scaled Methods for Saddle Point Problems
Figure 2 for On Scaled Methods for Saddle Point Problems
Figure 3 for On Scaled Methods for Saddle Point Problems
Figure 4 for On Scaled Methods for Saddle Point Problems

Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation. We use the Extra Gradient and its improved version with negative momentum as the basic method. Experimental studies on GANs show good applicability not only for Adam, but also for other less popular methods.

* 51 pages, 2 algorithms with 4 options for each, 12 figures, 5 tables, 2 theorems 
Viaarxiv icon