Alert button
Picture for Valentin Malykh

Valentin Malykh

Alert button

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

May 19, 2023
Nikita Sorokin, Dmitry Abulkhanov, Sergey Nikolenko, Valentin Malykh

Figure 1 for CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Figure 2 for CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Figure 3 for CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Figure 4 for CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this work, we formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset. Moreover, we present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. The resulting CCT-LM model, initialized with GraphCodeBERT and fine-tuned with CCT, achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67\% MAP and AdvTest code search benchmark with 47.18\% MRR; it also shows the best results on the newly created multilingual clone detection benchmark XCD across all programming languages.

Viaarxiv icon

Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets

May 19, 2023
Ivan Sedykh, Dmitry Abulkhanov, Nikita Sorokin, Sergey Nikolenko, Valentin Malykh

Figure 1 for Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
Figure 2 for Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
Figure 3 for Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
Figure 4 for Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets

Code search is an important task that has seen many developments in recent years. However, previous attempts have mostly considered the problem of searching for code by a text query. We argue that using a code snippet (and possibly an associated traceback) as a query and looking for answers with bugfixing instructions and code samples is a natural use case that is not covered by existing approaches. Moreover, existing datasets use comments extracted from code rather than full-text descriptions as text, making them unsuitable for this use case. We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; it turns out that in this setting, existing architectures fall short of the simplest BM25 baseline even after fine-tuning. We present a new single encoder model SnippeR that outperforms several strong baselines on the SearchBySnippet dataset with a result of 0.451 Recall@10; we propose the SearchBySnippet dataset and SnippeR as a new important benchmark for code search evaluation.

Viaarxiv icon

DetIE: Multilingual Open Information Extraction Inspired by Object Detection

Jun 24, 2022
Michael Vasilkovsky, Anton Alekseev, Valentin Malykh, Ilya Shenbin, Elena Tutubalina, Dmitriy Salikhov, Mikhail Stepnov, Andrey Chertok, Sergey Nikolenko

Figure 1 for DetIE: Multilingual Open Information Extraction Inspired by Object Detection
Figure 2 for DetIE: Multilingual Open Information Extraction Inspired by Object Detection
Figure 3 for DetIE: Multilingual Open Information Extraction Inspired by Object Detection
Figure 4 for DetIE: Multilingual Open Information Extraction Inspired by Object Detection

State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorithms from computer vision. We use an order-agnostic loss based on bipartite matching that forces unique predictions and a Transformer-based encoder-only architecture for sequence labeling. The proposed approach is faster and shows superior or similar performance in comparison with state of the art models on standard benchmarks in terms of both quality metrics and inference time. Our model sets the new state of the art performance of 67.7% F1 on CaRB evaluated as OIE2016 while being 3.35x faster at inference than previous state of the art. We also evaluate the multilingual version of our model in the zero-shot setting for two languages and introduce a strategy for generating synthetic multilingual data to fine-tune the model for each specific language. In this setting, we show performance improvement 15% on multilingual Re-OIE2016, reaching 75% F1 for both Portuguese and Spanish languages. Code and models are available at https://github.com/sberbank-ai/DetIE.

* Accepted to the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) 
Viaarxiv icon

Template-based Approach to Zero-shot Intent Recognition

Jun 22, 2022
Dmitry Lamanov, Pavel Burnyshev, Ekaterina Artemova, Valentin Malykh, Andrey Bout, Irina Piontkovskaya

Figure 1 for Template-based Approach to Zero-shot Intent Recognition
Figure 2 for Template-based Approach to Zero-shot Intent Recognition
Figure 3 for Template-based Approach to Zero-shot Intent Recognition
Figure 4 for Template-based Approach to Zero-shot Intent Recognition

The recent advances in transfer learning techniques and pre-training of large contextualized encoders foster innovation in real-life applications, including dialog assistants. Practical needs of intent recognition require effective data usage and the ability to constantly update supported intents, adopting new ones, and abandoning outdated ones. In particular, the generalized zero-shot paradigm, in which the model is trained on the seen intents and tested on both seen and unseen intents, is taking on new importance. In this paper, we explore the generalized zero-shot setup for intent recognition. Following best practices for zero-shot text classification, we treat the task with a sentence pair modeling approach. We outperform previous state-of-the-art f1-measure by up to 16\% for unseen intents, using intent labels and user utterances and without accessing external sources (such as knowledge bases). Further enhancement includes lexicalization of intent labels, which improves performance by up to 7\%. By using task transferring from other sentence pair tasks, such as Natural Language Inference, we gain additional improvements.

* accepted to INLG 2022 
Viaarxiv icon

WikiMulti: a Corpus for Cross-Lingual Summarization

Apr 23, 2022
Pavel Tikhonov, Valentin Malykh

Figure 1 for WikiMulti: a Corpus for Cross-Lingual Summarization
Figure 2 for WikiMulti: a Corpus for Cross-Lingual Summarization
Figure 3 for WikiMulti: a Corpus for Cross-Lingual Summarization
Figure 4 for WikiMulti: a Corpus for Cross-Lingual Summarization

Cross-lingual summarization (CLS) is the task to produce a summary in one particular language for a source document in a different language. We introduce WikiMulti - a new dataset for cross-lingual summarization based on Wikipedia articles in 15 languages. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We make our dataset publicly available here: https://github.com/tikhonovpavel/wikimulti

Viaarxiv icon

Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models

Feb 15, 2022
Alena Fenogenova, Maria Tikhonova, Vladislav Mikhailov, Tatiana Shavrina, Anton Emelyanov, Denis Shevelev, Alexandr Kukushkin, Valentin Malykh, Ekaterina Artemova

Figure 1 for Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models
Figure 2 for Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models
Figure 3 for Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models
Figure 4 for Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models

In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks. This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological improvements, including fixes of the benchmark vulnerabilities unresolved in the previous version: novel and improved tests for understanding the meaning of a word in context (RUSSE) along with reading comprehension and common sense reasoning (DaNetQA, RuCoS, MuSeRC). Together with the release of the updated datasets, we improve the benchmark toolkit based on \texttt{jiant} framework for consistent training and evaluation of NLP-models of various architectures which now supports the most recent models for Russian. Finally, we provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO (MOdel ResOurCe COmparison), in which the models are evaluated according to the weighted average metric over all tasks, the inference speed, and the occupied amount of RAM. Russian SuperGLUE is publicly available at https://russiansuperglue.com/.

* Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference "Dialogue" (2021) Issue 20 
Viaarxiv icon

A Single Example Can Improve Zero-Shot Data Generation

Aug 16, 2021
Pavel Burnyshev, Valentin Malykh, Andrey Bout, Ekaterina Artemova, Irina Piontkovskaya

Figure 1 for A Single Example Can Improve Zero-Shot Data Generation
Figure 2 for A Single Example Can Improve Zero-Shot Data Generation
Figure 3 for A Single Example Can Improve Zero-Shot Data Generation
Figure 4 for A Single Example Can Improve Zero-Shot Data Generation

Sub-tasks of intent classification, such as robustness to distribution shift, adaptation to specific user groups and personalization, out-of-domain detection, require extensive and flexible datasets for experiments and evaluation. As collecting such datasets is time- and labor-consuming, we propose to use text generation methods to gather datasets. The generator should be trained to generate utterances that belong to the given intent. We explore two approaches to generating task-oriented utterances. In the zero-shot approach, the model is trained to generate utterances from seen intents and is further used to generate utterances for intents unseen during training. In the one-shot approach, the model is presented with a single utterance from a test intent. We perform a thorough automatic, and human evaluation of the dataset generated utilizing two proposed approaches. Our results reveal that the attributes of the generated data are close to original test sets, collected via crowd-sourcing.

* To appear in INLG2021 proceedings 
Viaarxiv icon

MOROCCO: Model Resource Comparison Framework

Apr 29, 2021
Valentin Malykh, Alexander Kukushkin, Ekaterina Artemova, Vladislav Mikhailov, Maria Tikhonova, Tatiana Shavrina

Figure 1 for MOROCCO: Model Resource Comparison Framework
Figure 2 for MOROCCO: Model Resource Comparison Framework
Figure 3 for MOROCCO: Model Resource Comparison Framework
Figure 4 for MOROCCO: Model Resource Comparison Framework

The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to compare language models compatible with \texttt{jiant} environment which supports over 50 NLU tasks, including SuperGLUE benchmark and multiple probing suites. We demonstrate its applicability for two GLUE-like suites in different languages.

Viaarxiv icon

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Nov 02, 2020
Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, Andrey Evlampiev

Figure 1 for RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Figure 2 for RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Figure 3 for RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
Figure 4 for RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.

* to appear in EMNLP 2020 
Viaarxiv icon