Alert button
Picture for Saurabh Pujar

Saurabh Pujar

Alert button

Automated Code generation for Information Technology Tasks in YAML through Large Language Models

May 05, 2023
Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matthew Jones, Alessandro Morari, Ruchir Puri

Figure 1 for Automated Code generation for Information Technology Tasks in YAML through Large Language Models
Figure 2 for Automated Code generation for Information Technology Tasks in YAML through Large Language Models
Figure 3 for Automated Code generation for Information Technology Tasks in YAML through Large Language Models
Figure 4 for Automated Code generation for Information Technology Tasks in YAML through Large Language Models

The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as the ones used for IT Automation, have received far less attention, despite involving many active developers and being an essential component of modern cloud platforms. This work focuses on the generation of Ansible-YAML, a widely used markup language for IT Automation. We present Ansible Wisdom, a natural-language to Ansible-YAML code generation tool, aimed at improving IT automation productivity. Ansible Wisdom is a transformer-based model, extended by training with a new dataset containing Ansible-YAML. We also develop two novel performance metrics for YAML and Ansible to capture the specific characteristics of this domain. Results show that Ansible Wisdom can accurately generate Ansible script from natural language prompts with performance comparable or better than existing state of the art code generation models.

Viaarxiv icon

Contrastive Learning for Source Code with Structural and Functional Properties

Oct 08, 2021
Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, Saikat Chakraborty

Figure 1 for Contrastive Learning for Source Code with Structural and Functional Properties
Figure 2 for Contrastive Learning for Source Code with Structural and Functional Properties
Figure 3 for Contrastive Learning for Source Code with Structural and Functional Properties
Figure 4 for Contrastive Learning for Source Code with Structural and Functional Properties

Pre-trained transformer models have recently shown promises for understanding the source code. Most existing works expect to understand code from the textual features and limited structural knowledge of code. However, the program functionalities sometimes cannot be fully revealed by the code sequence, even with structure information. Programs can contain very different tokens and structures while sharing the same functionality, but changing only one or a few code tokens can introduce unexpected or malicious program behaviors while preserving the syntax and most tokens. In this work, we present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code. We first employ automated, structure-guided code transformation algorithms that generate (i.) functionally equivalent code that looks drastically different from the original one, and (ii.) textually and syntactically very similar code that is functionally distinct from the original. We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective. To encode the structure information, we introduce a new node-type masked language model objective that helps the model learn about structural context. We pre-train BOOST with a much smaller dataset than the state-of-the-art models, but our small models can still match or outperform these large models in code understanding and generation tasks.

Viaarxiv icon

Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

May 25, 2021
Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler

Figure 1 for Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Figure 2 for Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Figure 3 for Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Figure 4 for Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Advancements in deep learning and machine learning algorithms have enabled breakthrough progress in computer vision, speech recognition, natural language processing and beyond. In addition, over the last several decades, software has been built into the fabric of every aspect of our society. Together, these two trends have generated new interest in the fast-emerging research area of AI for Code. As software development becomes ubiquitous across all industries and code infrastructure of enterprise legacy applications ages, it is more critical than ever to increase software development productivity and modernize legacy applications. Over the last decade, datasets like ImageNet, with its large scale and diversity, have played a pivotal role in algorithmic advancements from computer vision to language and speech understanding. In this paper, we present Project CodeNet, a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate the algorithmic advancements in AI for Code. It consists of 14M code samples and about 500M lines of code in 55 different programming languages. Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety programming languages, to advances in code performance (both runtime, and memory) improvement techniques. CodeNet also provides sample input and output test sets for over 7M code samples, which can be critical for determining code equivalence in different languages. As a usability feature, we provide several preprocessing tools in Project CodeNet to transform source codes into representations that can be readily used as inputs into machine learning models.

* 11 Pages including references, 10 pages of appendix 
Viaarxiv icon

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

Feb 16, 2021
Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, Zhong Su

Figure 1 for D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis
Figure 2 for D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis
Figure 3 for D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis
Figure 4 for D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.

* Accepted to the 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP '21) 
Viaarxiv icon

Exploring Software Naturalness through Neural Language Models

Jun 24, 2020
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, Giacomo Domeniconi

Figure 1 for Exploring Software Naturalness through Neural Language Models
Figure 2 for Exploring Software Naturalness through Neural Language Models
Figure 3 for Exploring Software Naturalness through Neural Language Models
Figure 4 for Exploring Software Naturalness through Neural Language Models

The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks. Present approaches to code analysis depend heavily on features derived from the Abstract Syntax Tree (AST) while our transformer-based language models work on raw source code. This work is the first to investigate whether such language models can discover AST features automatically. To achieve this, we introduce a sequence labeling task that directly probes the language models understanding of AST. Our results show that transformer based language models achieve high accuracy in the AST tagging task. Furthermore, we evaluate our model on a software vulnerability identification task. Importantly, we show that our approach obtains vulnerability identification results comparable to graph based approaches that rely heavily on compilers for feature extraction.

Viaarxiv icon

Exploring Software Naturalness throughNeural Language Models

Jun 22, 2020
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, Giacomo Domeniconi

Figure 1 for Exploring Software Naturalness throughNeural Language Models
Figure 2 for Exploring Software Naturalness throughNeural Language Models
Figure 3 for Exploring Software Naturalness throughNeural Language Models
Figure 4 for Exploring Software Naturalness throughNeural Language Models

The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.

Viaarxiv icon

The TechQA Dataset

Nov 08, 2019
Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh Garg, Dinesh Khandelwal, Scott McCarley, Mike McCawley, Mohamed Nasr, Lin Pan, Cezar Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos, Andrzej Sakrajda, Avirup Sil, Rosario Uceda-Sosa, Todd Ward, Rong Zhang

Figure 1 for The TechQA Dataset
Figure 2 for The TechQA Dataset
Figure 3 for The TechQA Dataset
Figure 4 for The TechQA Dataset

We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 dev, and 490 evaluation question/answer pairs -- thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TechQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote---a technical document that addresses a specific technical issue. We also release a collection of the 801,998 publicly available Technotes as of April 4, 2019 as a companion resource that might be used for pretraining, to learn representations of the IT domain language.

* Long version of conference paper to be submitted 
Viaarxiv icon