Using Multi-features and Ensemble Learning Method for Imbalanced Malware Classification
The ever-growing malware threats in the cyber spacecalls for techniques that are more effective than widely deployedsignature-based detection system. To counter large volumes ofmalware variants, machine learning techniques have been appliedfor automated malware classification. Despite these efforts...
        Saved in:
      
    
          | Published in | 2016 IEEE Trustcom/BigDataSE/ISPA pp. 965 - 973 | 
|---|---|
| Main Authors | , , , , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
            IEEE
    
        01.08.2016
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 2324-9013 | 
| DOI | 10.1109/TrustCom.2016.0163 | 
Cover
| Summary: | The ever-growing malware threats in the cyber spacecalls for techniques that are more effective than widely deployedsignature-based detection system. To counter large volumes ofmalware variants, machine learning techniques have been appliedfor automated malware classification. Despite these efforts haveachieved a certain success, the accuracy and efficiency stillremained inadequate to meet demand, especially when thesemachine learning techniques are used in the situation of multipleclass classification and imbalanced training data. Against thisbackdrop, the goal of this paper is to build a malware classificationsystem that could be used to improve the above mentionedsituation. Our system is based on multiple categories of staticfeatures and ensemble learning method. Compared to sometraditional systems it has the following advantages. Firstly, withmultiple categories of features, our system could classify malwareto their corresponding family effectively and efficiently and at thesame time avoid the influence of evasion in certain extent. Ourmethod don't need any unpacking process and extract featuresfrom the bytes file and disassembled asm file directly. Secondly, the system employed two efficient ensemble learning models, namely XGBoost and ExtraTreeClassifer, and also combinedstacking method to construct the final classifier. Finally, weexperimented our system with the dataset provided by Microsofthosted in Kaggle for malware classification competition, andthe final results show that our method could classify malwareto their family effectively and efficiently with the accuracy of0.9972 in training set and logloss of 0.00395 in testing set. Ourwork not only offer insights into how to use multiple features forclassification, but also shed light on how to develop a scalabletechniques for automated malware classification in practice. | 
|---|---|
| ISSN: | 2324-9013 | 
| DOI: | 10.1109/TrustCom.2016.0163 |