Using Multi-features and Ensemble Learning Method for Imbalanced Malware Classification

The ever-growing malware threats in the cyber spacecalls for techniques that are more effective than widely deployedsignature-based detection system. To counter large volumes ofmalware variants, machine learning techniques have been appliedfor automated malware classification. Despite these efforts...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE Trustcom/BigDataSE/ISPA pp. 965 - 973
Main Authors Yunan Zhang, Qingjia Huang, Xinjian Ma, Zeming Yang, Jianguo Jiang
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.08.2016
Subjects
Online AccessGet full text
ISSN2324-9013
DOI10.1109/TrustCom.2016.0163

Cover

More Information
Summary:The ever-growing malware threats in the cyber spacecalls for techniques that are more effective than widely deployedsignature-based detection system. To counter large volumes ofmalware variants, machine learning techniques have been appliedfor automated malware classification. Despite these efforts haveachieved a certain success, the accuracy and efficiency stillremained inadequate to meet demand, especially when thesemachine learning techniques are used in the situation of multipleclass classification and imbalanced training data. Against thisbackdrop, the goal of this paper is to build a malware classificationsystem that could be used to improve the above mentionedsituation. Our system is based on multiple categories of staticfeatures and ensemble learning method. Compared to sometraditional systems it has the following advantages. Firstly, withmultiple categories of features, our system could classify malwareto their corresponding family effectively and efficiently and at thesame time avoid the influence of evasion in certain extent. Ourmethod don't need any unpacking process and extract featuresfrom the bytes file and disassembled asm file directly. Secondly, the system employed two efficient ensemble learning models, namely XGBoost and ExtraTreeClassifer, and also combinedstacking method to construct the final classifier. Finally, weexperimented our system with the dataset provided by Microsofthosted in Kaggle for malware classification competition, andthe final results show that our method could classify malwareto their family effectively and efficiently with the accuracy of0.9972 in training set and logloss of 0.00395 in testing set. Ourwork not only offer insights into how to use multiple features forclassification, but also shed light on how to develop a scalabletechniques for automated malware classification in practice.
ISSN:2324-9013
DOI:10.1109/TrustCom.2016.0163