Alert button
Picture for Zeyu Gao

Zeyu Gao

Alert button

How Far Have We Gone in Vulnerability Detection Using Large Language Models

Nov 21, 2023
Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, Chao Zhang

As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of Large Language Models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.

Viaarxiv icon

kTrans: Knowledge-Aware Transformer for Binary Code Embedding

Aug 24, 2023
Wenyu Zhu, Hao Wang, Yuchen Zhou, Jiaming Wang, Zihan Sha, Zeyu Gao, Chao Zhang

Figure 1 for kTrans: Knowledge-Aware Transformer for Binary Code Embedding
Figure 2 for kTrans: Knowledge-Aware Transformer for Binary Code Embedding
Figure 3 for kTrans: Knowledge-Aware Transformer for Binary Code Embedding
Figure 4 for kTrans: Knowledge-Aware Transformer for Binary Code Embedding

Binary Code Embedding (BCE) has important applications in various reverse engineering tasks such as binary code similarity detection, type recovery, control-flow recovery and data-flow analysis. Recent studies have shown that the Transformer model can comprehend the semantics of binary code to support downstream tasks. However, existing models overlooked the prior knowledge of assembly language. In this paper, we propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. By feeding explicit knowledge as additional inputs to the Transformer, and fusing implicit knowledge with a novel pre-training task, kTrans provides a new perspective to incorporating domain knowledge into a Transformer framework. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR). Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively. kTrans is publicly available at: https://github.com/Learner0x5a/kTrans-release

Viaarxiv icon

Spiking Neural Network for Ultra-low-latency and High-accurate Object Detection

Jun 27, 2023
Jinye Qu, Zeyu Gao, Tielin Zhang, Yanfeng Lu, Huajin Tang, Hong Qiao

Figure 1 for Spiking Neural Network for Ultra-low-latency and High-accurate Object Detection
Figure 2 for Spiking Neural Network for Ultra-low-latency and High-accurate Object Detection
Figure 3 for Spiking Neural Network for Ultra-low-latency and High-accurate Object Detection
Figure 4 for Spiking Neural Network for Ultra-low-latency and High-accurate Object Detection

Spiking Neural Networks (SNNs) have garnered widespread interest for their energy efficiency and brain-inspired event-driven properties. While recent methods like Spiking-YOLO have expanded the SNNs to more challenging object detection tasks, they often suffer from high latency and low detection accuracy, making them difficult to deploy on latency sensitive mobile platforms. Furthermore, the conversion method from Artificial Neural Networks (ANNs) to SNNs is hard to maintain the complete structure of the ANNs, resulting in poor feature representation and high conversion errors. To address these challenges, we propose two methods: timesteps compression and spike-time-dependent integrated (STDI) coding. The former reduces the timesteps required in ANN-SNN conversion by compressing information, while the latter sets a time-varying threshold to expand the information holding capacity. We also present a SNN-based ultra-low latency and high accurate object detection model (SUHD) that achieves state-of-the-art performance on nontrivial datasets like PASCAL VOC and MS COCO, with about remarkable 750x fewer timesteps and 30% mean average precision (mAP) improvement, compared to the Spiking-YOLO on MS COCO datasets. To the best of our knowledge, SUHD is the deepest spike-based object detection model to date that achieves ultra low timesteps to complete the lossless conversion.

* 14 pages, 10 figures 
Viaarxiv icon

Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

Oct 08, 2022
Zeyu Gao, Yao Mu, Ruoyan Shen, Chen Chen, Yangang Ren, Jianyu Chen, Shengbo Eben Li, Ping Luo, Yanfeng Lu

Figure 1 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model
Figure 2 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model
Figure 3 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model
Figure 4 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is hard to cope with the corner cases during the driving process. To solve the above challenges, we present a semantic masked recurrent world model (SEM2), which introduces a latent filter to extract key task-relevant features and reconstruct a semantic mask via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show that our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.

* 11 pages, 7 figures, 1 table, submitted to Deep RL Workshop 2022 
Viaarxiv icon

Meta Mask Correction for Nuclei Segmentation in Histopathological Image

Nov 24, 2021
Jiangbo Shi, Chang Jia, Zeyu Gao, Tieliang Gong, Chunbao Wang, Chen Li

Figure 1 for Meta Mask Correction for Nuclei Segmentation in Histopathological Image
Figure 2 for Meta Mask Correction for Nuclei Segmentation in Histopathological Image
Figure 3 for Meta Mask Correction for Nuclei Segmentation in Histopathological Image
Figure 4 for Meta Mask Correction for Nuclei Segmentation in Histopathological Image

Nuclei segmentation is a fundamental task in digital pathology analysis and can be automated by deep learning-based methods. However, the development of such an automated method requires a large amount of data with precisely annotated masks which is hard to obtain. Training with weakly labeled data is a popular solution for reducing the workload of annotation. In this paper, we propose a novel meta-learning-based nuclei segmentation method which follows the label correction paradigm to leverage data with noisy masks. Specifically, we design a fully conventional meta-model that can correct noisy masks using a small amount of clean meta-data. Then the corrected masks can be used to supervise the training of the segmentation model. Meanwhile, a bi-level optimization method is adopted to alternately update the parameters of the main segmentation model and the meta-model in an end-to-end way. Extensive experimental results on two nuclear segmentation datasets show that our method achieves the state-of-the-art result. It even achieves comparable performance with the model training on supervised data in some noisy settings.

* Accepted by BIBM 2021 
Viaarxiv icon

PIMIP: An Open Source Platform for Pathology Information Management and Integration

Nov 09, 2021
Jialun Wu, Anyu Mao, Xinrui Bao, Haichuan Zhang, Zeyu Gao, Chunbao Wang, Tieliang Gong, Chen Li

Figure 1 for PIMIP: An Open Source Platform for Pathology Information Management and Integration
Figure 2 for PIMIP: An Open Source Platform for Pathology Information Management and Integration
Figure 3 for PIMIP: An Open Source Platform for Pathology Information Management and Integration
Figure 4 for PIMIP: An Open Source Platform for Pathology Information Management and Integration

Digital pathology plays a crucial role in the development of artificial intelligence in the medical field. The digital pathology platform can make the pathological resources digital and networked, and realize the permanent storage of visual data and the synchronous browsing processing without the limitation of time and space. It has been widely used in various fields of pathology. However, there is still a lack of an open and universal digital pathology platform to assist doctors in the management and analysis of digital pathological sections, as well as the management and structured description of relevant patient information. Most platforms cannot integrate image viewing, annotation and analysis, and text information management. To solve the above problems, we propose a comprehensive and extensible platform PIMIP. Our PIMIP has developed the image annotation functions based on the visualization of digital pathological sections. Our annotation functions support multi-user collaborative annotation and multi-device annotation, and realize the automation of some annotation tasks. In the annotation task, we invited a professional pathologist for guidance. We introduce a machine learning module for image analysis. The data we collected included public data from local hospitals and clinical examples. Our platform is more clinical and suitable for clinical use. In addition to image data, we also structured the management and display of text information. So our platform is comprehensive. The platform framework is built in a modular way to support users to add machine learning modules independently, which makes our platform extensible.

* BIBM 2021 accepted, including 8 pages, 8 figures 
Viaarxiv icon

BioIE: Biomedical Information Extraction with Multi-head Attention Enhanced Graph Convolutional Network

Oct 26, 2021
Jialun Wu, Yang Liu, Zeyu Gao, Tieliang Gong, Chunbao Wang, Chen Li

Figure 1 for BioIE: Biomedical Information Extraction with Multi-head Attention Enhanced Graph Convolutional Network
Figure 2 for BioIE: Biomedical Information Extraction with Multi-head Attention Enhanced Graph Convolutional Network
Figure 3 for BioIE: Biomedical Information Extraction with Multi-head Attention Enhanced Graph Convolutional Network
Figure 4 for BioIE: Biomedical Information Extraction with Multi-head Attention Enhanced Graph Convolutional Network

Constructing large-scaled medical knowledge graphs can significantly boost healthcare applications for medical surveillance, bring much attention from recent research. An essential step in constructing large-scale MKG is extracting information from medical reports. Recently, information extraction techniques have been proposed and show promising performance in biomedical information extraction. However, these methods only consider limited types of entity and relation due to the noisy biomedical text data with complex entity correlations. Thus, they fail to provide enough information for constructing MKGs and restrict the downstream applications. To address this issue, we propose Biomedical Information Extraction, a hybrid neural network to extract relations from biomedical text and unstructured medical reports. Our model utilizes a multi-head attention enhanced graph convolutional network to capture the complex relations and context information while resisting the noise from the data. We evaluate our model on two major biomedical relationship extraction tasks, chemical-disease relation and chemical-protein interaction, and a cross-hospital pan-cancer pathology report corpus. The results show that our method achieves superior performance than baselines. Furthermore, we evaluate the applicability of our method under a transfer learning setting and show that BioIE achieves promising performance in processing medical text from different formats and writing styles.

* BIBM 2021 accepted, including 9 pages, 1 figure 
Viaarxiv icon

A Personalized Diagnostic Generation Framework Based on Multi-source Heterogeneous Data

Oct 26, 2021
Jialun Wu, Zeyu Gao, Haichuan Zhang, Ruonan Zhang, Tieliang Gong, Chunbao Wang, Chen Li

Figure 1 for A Personalized Diagnostic Generation Framework Based on Multi-source Heterogeneous Data
Figure 2 for A Personalized Diagnostic Generation Framework Based on Multi-source Heterogeneous Data
Figure 3 for A Personalized Diagnostic Generation Framework Based on Multi-source Heterogeneous Data
Figure 4 for A Personalized Diagnostic Generation Framework Based on Multi-source Heterogeneous Data

Personalized diagnoses have not been possible due to sear amount of data pathologists have to bear during the day-to-day routine. This lead to the current generalized standards that are being continuously updated as new findings are reported. It is noticeable that these effective standards are developed based on a multi-source heterogeneous data, including whole-slide images and pathology and clinical reports. In this study, we propose a framework that combines pathological images and medical reports to generate a personalized diagnosis result for individual patient. We use nuclei-level image feature similarity and content-based deep learning method to search for a personalized group of population with similar pathological characteristics, extract structured prognostic information from descriptive pathology reports of the similar patient population, and assign importance of different prognostic factors to generate a personalized pathological diagnosis result. We use multi-source heterogeneous data from TCGA (The Cancer Genome Atlas) database. The result demonstrate that our framework matches the performance of pathologists in the diagnosis of renal cell carcinoma. This framework is designed to be generic, thus could be applied for other types of cancer. The weights could provide insights to the known prognostic factors and further guide more precise clinical treatment protocols.

* BIBM 2021 accepted, including 9 pages, 3 figures 
Viaarxiv icon

W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image

Oct 26, 2021
Anyu Mao, Jialun Wu, Xinrui Bao, Zeyu Gao, Tieliang Gong, Chen Li

Figure 1 for W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image
Figure 2 for W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image
Figure 3 for W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image
Figure 4 for W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image

Pathological diagnosis is the gold standard for cancer diagnosis, but it is labor-intensive, in which tasks such as cell detection, classification, and counting are particularly prominent. A common solution for automating these tasks is using nucleus segmentation technology. However, it is hard to train a robust nucleus segmentation model, due to several challenging problems, the nucleus adhesion, stacking, and excessive fusion with the background. Recently, some researchers proposed a series of automatic nucleus segmentation methods based on point annotation, which can significant improve the model performance. Nevertheless, the point annotation needs to be marked by experienced pathologists. In order to take advantage of segmentation methods based on point annotation, further alleviate the manual workload, and make cancer diagnosis more efficient and accurate, it is necessary to develop an automatic nucleus detection algorithm, which can automatically and efficiently locate the position of the nucleus in the pathological image and extract valuable information for pathologists. In this paper, we propose a W-shaped network for automatic nucleus detection. Different from the traditional U-Net based method, mapping the original pathology image to the target mask directly, our proposed method split the detection task into two sub-tasks. The first sub-task maps the original pathology image to the binary mask, then the binary mask is mapped to the density mask in the second sub-task. After the task is split, the task's difficulty is significantly reduced, and the network's overall performance is improved.

* BIBM 2021 accepted,including 8 pages, 3 figures 
Viaarxiv icon