Approach to Predict Software Vulnerability Based on Multiple-Level N-gram Feature Extraction and Heterogeneous Ensemble Learning

Software vulnerabilities are one of the roots of computer security problems. The traditional static analysis and dynamic analysis methods based on software source code mainly have some deficiencies, such as high false positive rate, high false negative rate and insufficient semantic information capt...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of software engineering and knowledge engineering Vol. 32; no. 10; pp. 1559 - 1582
Main Authors Zhang, Bing, Gao, Yuan, Wu, Jingyi, Wang, Ning, Wang, Qian, Ren, Jiadong
Format Journal Article
LanguageEnglish
Published Singapore World Scientific Publishing Company 01.10.2022
World Scientific Publishing Co. Pte., Ltd
Subjects
Online AccessGet full text
ISSN0218-1940
1793-6403
DOI10.1142/S0218194022500620

Cover

More Information
Summary:Software vulnerabilities are one of the roots of computer security problems. The traditional static analysis and dynamic analysis methods based on software source code mainly have some deficiencies, such as high false positive rate, high false negative rate and insufficient semantic information captured. Nevertheless, the application of machine learning, Natural Language Processing and other technologies in software vulnerability prediction can effectively mitigate such issues. This paper proposed a vulnerability prediction method based on multiple-level N-gram feature extraction and heterogeneous ensemble learning. First, by code intermediate representation and constructing a multiple-level N-gram feature generation model, two kinds of N-gram semantic features with different window size and different granularity at word and char level were extracted to retain the semantic and structural information of code. Second, TF–IDF was used to construct the vector space model as the input of prediction model. As a single classifier was prone to overfitting and poor generalization, this paper conducted benchmark testing on five classical machine learning algorithms (NB, SVM, DT, LR, RF), and then combined four (SVM, DT, LR, RF) among them, which had better performance as the base classifiers to form the stacking heterogeneous ensemble method to build the vulnerability prediction model. Finally, the proposed method was verified on buffer overflow vulnerability and resource management vulnerability datasets, with a lowest false positive rate and false negative rate which can reach 1.58% and 4.06%, respectively.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0218-1940
1793-6403
DOI:10.1142/S0218194022500620