Approach to Predict Software Vulnerability Based on Multiple-Level N-gram Feature Extraction and Heterogeneous Ensemble Learning

Software vulnerabilities are one of the roots of computer security problems. The traditional static analysis and dynamic analysis methods based on software source code mainly have some deficiencies, such as high false positive rate, high false negative rate and insufficient semantic information capt...

Full description

Saved in:

Bibliographic Details
Published in	International journal of software engineering and knowledge engineering Vol. 32; no. 10; pp. 1559 - 1582
Main Authors	Zhang, Bing, Gao, Yuan, Wu, Jingyi, Wang, Ning, Wang, Qian, Ren, Jiadong
Format	Journal Article
Language	English
Published	Singapore World Scientific Publishing Company 01.10.2022 World Scientific Publishing Co. Pte., Ltd
Subjects	Algorithms Classifiers Cybersecurity Ensemble learning Feature extraction Machine learning Natural language processing Prediction models Resource management Semantics Software Software reliability Source code Support vector machines Vector spaces Multiple-level N-gram feature extraction vulnerability prediction heterogeneous ensemble learning
Online Access	Get full text
ISSN	0218-1940 1793-6403
DOI	10.1142/S0218194022500620

Cover

More Information
Summary:	Software vulnerabilities are one of the roots of computer security problems. The traditional static analysis and dynamic analysis methods based on software source code mainly have some deficiencies, such as high false positive rate, high false negative rate and insufficient semantic information captured. Nevertheless, the application of machine learning, Natural Language Processing and other technologies in software vulnerability prediction can effectively mitigate such issues. This paper proposed a vulnerability prediction method based on multiple-level N-gram feature extraction and heterogeneous ensemble learning. First, by code intermediate representation and constructing a multiple-level N-gram feature generation model, two kinds of N-gram semantic features with different window size and different granularity at word and char level were extracted to retain the semantic and structural information of code. Second, TF–IDF was used to construct the vector space model as the input of prediction model. As a single classifier was prone to overfitting and poor generalization, this paper conducted benchmark testing on five classical machine learning algorithms (NB, SVM, DT, LR, RF), and then combined four (SVM, DT, LR, RF) among them, which had better performance as the base classifiers to form the stacking heterogeneous ensemble method to build the vulnerability prediction model. Finally, the proposed method was verified on buffer overflow vulnerability and resource management vulnerability datasets, with a lowest false positive rate and false negative rate which can reach 1.58% and 4.06%, respectively.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0218-1940 1793-6403
DOI:	10.1142/S0218194022500620