Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Miltiadis Allamanis

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Sep 27, 2019

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, Marc Brockschmidt

Figure 1 for CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Figure 2 for CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Figure 3 for CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Figure 4 for CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Abstract:Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.

Via

Access Paper or Ask Questions

Program Synthesis and Semantic Parsing with Learned Code Idioms

Jul 23, 2019

Richard Shin, Miltiadis Allamanis, Marc Brockschmidt, Oleksandr Polozov

Figure 1 for Program Synthesis and Semantic Parsing with Learned Code Idioms

Figure 2 for Program Synthesis and Semantic Parsing with Learned Code Idioms

Figure 3 for Program Synthesis and Semantic Parsing with Learned Code Idioms

Figure 4 for Program Synthesis and Semantic Parsing with Learned Code Idioms

Abstract:Program synthesis of general-purpose source code from natural language specifications is challenging due to the need to reason about high-level patterns in the target program and low-level implementation details at the same time. In this work, we present PATOIS, a system that allows a neural program synthesizer to explicitly interleave high-level and low-level reasoning at every generation step. It accomplishes this by automatically mining common code idioms from a given corpus, incorporating them into the underlying language for neural synthesis, and training a tree-based neural synthesizer to use these idioms during code generation. We evaluate PATOIS on two complex semantic parsing datasets and show that using learned code idioms improves the synthesizer's accuracy.

Via

Access Paper or Ask Questions

The Adverse Effects of Code Duplication in Machine Learning Models of Code

Jan 11, 2019

Miltiadis Allamanis

Figure 1 for The Adverse Effects of Code Duplication in Machine Learning Models of Code

Figure 2 for The Adverse Effects of Code Duplication in Machine Learning Models of Code

Figure 3 for The Adverse Effects of Code Duplication in Machine Learning Models of Code

Figure 4 for The Adverse Effects of Code Duplication in Machine Learning Models of Code

Abstract:The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of near-duplicate code on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this article, we study the effect of code duplication to machine learning models showing that reported metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present an "errata" for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them, and release tools to help the community avoid this problem in future research.

Via

Access Paper or Ask Questions

Structured Neural Summarization

Nov 05, 2018

Patrick Fernandes, Miltiadis Allamanis, Marc Brockschmidt

Figure 1 for Structured Neural Summarization

Figure 2 for Structured Neural Summarization

Figure 3 for Structured Neural Summarization

Figure 4 for Structured Neural Summarization

Abstract:Summarization of long sequences into a concise statement is a core problem in natural language processing, requiring non-trivial understanding of the input. Based on the promising results of graph neural networks on highly structured data, we develop a framework to extend existing sequence encoders with a graph component that can reason about long-distance relationships in weakly structured data such as text. In an extensive evaluation, we show that the resulting hybrid sequence-graph models outperform both pure sequence models as well as pure graph models on a range of summarization tasks.

Via

Access Paper or Ask Questions

Learning to Represent Edits

Oct 31, 2018

Pengcheng Yin, Graham Neubig, Miltiadis Allamanis, Marc Brockschmidt, Alexander L. Gaunt

Figure 1 for Learning to Represent Edits

Figure 2 for Learning to Represent Edits

Figure 3 for Learning to Represent Edits

Figure 4 for Learning to Represent Edits

Abstract:We introduce the problem of learning distributed representations of edits. By combining a "neural editor" with an "edit encoder", our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem.

Via

Access Paper or Ask Questions

Constrained Graph Variational Autoencoders for Molecule Design

May 23, 2018

Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, Alexander L. Gaunt

Figure 1 for Constrained Graph Variational Autoencoders for Molecule Design

Figure 2 for Constrained Graph Variational Autoencoders for Molecule Design

Figure 3 for Constrained Graph Variational Autoencoders for Molecule Design

Figure 4 for Constrained Graph Variational Autoencoders for Molecule Design

Abstract:Graphs are ubiquitous data structures for representing interactions between entities. With an emphasis on the use of graphs to represent chemical molecules, we explore the task of learning to generate graphs that conform to a distribution observed in training data. We propose a variational autoencoder model in which both encoder and decoder are graph-structured. Our decoder assumes a sequential ordering of graph extension steps and we discuss and analyze design choices that mitigate the potential downsides of this linearization. Experiments compare our approach with a wide range of baselines on the molecule generation task and show that our method is more successful at matching the statistics of the original dataset on semantically important metrics. Furthermore, we show that by using appropriate shaping of the latent space, our model allows us to design molecules that are (locally) optimal in desired properties.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions

Generative Code Modeling with Graphs

May 22, 2018

Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, Oleksandr Polozov

Figure 1 for Generative Code Modeling with Graphs

Figure 2 for Generative Code Modeling with Graphs

Figure 3 for Generative Code Modeling with Graphs

Figure 4 for Generative Code Modeling with Graphs

Abstract:Generative models for source code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. The generative procedure interleaves grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.

Via

Access Paper or Ask Questions

A Survey of Machine Learning for Big Code and Naturalness

May 05, 2018

Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton

Figure 1 for A Survey of Machine Learning for Big Code and Naturalness

Figure 2 for A Survey of Machine Learning for Big Code and Naturalness

Figure 3 for A Survey of Machine Learning for Big Code and Naturalness

Figure 4 for A Survey of Machine Learning for Big Code and Naturalness

Abstract:Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.

* Website accompanying this survey paper can be found at https://ml4code.github.io

Via

Access Paper or Ask Questions

Learning to Represent Programs with Graphs

May 04, 2018

Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi

Figure 1 for Learning to Represent Programs with Graphs

Figure 2 for Learning to Represent Programs with Graphs

Figure 3 for Learning to Represent Programs with Graphs

Figure 4 for Learning to Represent Programs with Graphs

Abstract:Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code's known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures. In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.

* Published in ICLR 2018. arXiv admin note: text overlap with arXiv:1705.07867

Via

Access Paper or Ask Questions

Learning Continuous Semantic Representations of Symbolic Expressions

Jun 10, 2017

Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, Charles Sutton

Figure 1 for Learning Continuous Semantic Representations of Symbolic Expressions

Figure 2 for Learning Continuous Semantic Representations of Symbolic Expressions

Figure 3 for Learning Continuous Semantic Representations of Symbolic Expressions

Figure 4 for Learning Continuous Semantic Representations of Symbolic Expressions

Abstract:Combining abstract, symbolic reasoning with continuous neural reasoning is a grand challenge of representation learning. As a step in this direction, we propose a new architecture, called neural equivalence networks, for the problem of learning continuous semantic representations of algebraic and logical expressions. These networks are trained to represent semantic equivalence, even of expressions that are syntactically very different. The challenge is that semantic representations must be computed in a syntax-directed manner, because semantics is compositional, but at the same time, small changes in syntax can lead to very large changes in semantics, which can be difficult for continuous neural architectures. We perform an exhaustive evaluation on the task of checking equivalence on a highly diverse class of symbolic algebraic and boolean expression types, showing that our model significantly outperforms existing architectures.

* Accepted to ICML 2017

Via

Access Paper or Ask Questions