Effects of feature selection and normalization on network intrusion detection

The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence (AI) techniques (such as machine learning (ML) and deep learning (DL)) to build more efficient and reliable intrusion detection systems (IDSs). However, the adven...

Full description

Saved in:
Bibliographic Details
Published inData science and management Vol. 8; no. 1; pp. 23 - 39
Main Authors Umar, Mubarak Albarka, Chen, Zhanfang, Shuaib, Khaled, Liu, Yan
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.03.2025
KeAi Communications Co. Ltd
Subjects
Online AccessGet full text
ISSN2666-7649
2666-7649
DOI10.1016/j.dsm.2024.08.001

Cover

More Information
Summary:The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using artificial intelligence (AI) techniques (such as machine learning (ML) and deep learning (DL)) to build more efficient and reliable intrusion detection systems (IDSs). However, the advent of larger IDS datasets has negatively impacted the performance and computational complexity of AI-based IDSs. Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies have been performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML/DL algorithm used, which most of the existing studies give little emphasis on. Thus, this study provides an in-depth analysis of feature selection and normalization effects on IDS models built using three IDS datasets: NSL-KDD, UNSW-NB15, and CSE–CIC–IDS2018, and various AI algorithms. A wrapper-based approach, which tends to give superior performance, and min-max normalization methods were used for feature selection and normalization, respectively. Numerous IDS models were implemented using the full and feature-selected copies of the datasets with and without normalization. The models were evaluated using popular evaluation metrics in IDS modeling, intra- and inter-model comparisons were performed between models and with state-of-the-art works. Random forest (RF) models performed better on NSL-KDD and UNSW-NB15 datasets with accuracies of 99.86% and 96.01%, respectively, whereas artificial neural network (ANN) achieved the best accuracy of 95.43% on the CSE–CIC–IDS2018 dataset. The RF models also achieved an excellent performance compared to recent works. The results show that normalization and feature selection positively affect IDS modeling. Furthermore, while feature selection benefits simpler algorithms (such as RF), normalization is more useful for complex algorithms like ANNs and deep neural networks (DNNs), and algorithms such as Naive Bayes are unsuitable for IDS modeling. The study also found that the UNSW-NB15 and CSE–CIC–IDS2018 datasets are more complex and more suitable for building and evaluating modern-day IDS than the NSL-KDD dataset. Our findings suggest that prioritizing robust algorithms like RF, alongside complex models such as ANN and DNN, can significantly enhance IDS performance. These insights provide valuable guidance for managers to develop more effective security measures by focusing on high detection rates and low false alert rates. •Normalization and feature selection effectively improve IDS performance and computation time across AI algorithms.•The CSE–CIC–IDS2018 and UNSW-NB15 datasets offer a more challenging and realistic benchmark for modern IDS research.•Feature selection aids simpler algorithms like RF, while normalization benefits complex models like ANNs and DNNs.•Managers can enhance IDS by prioritizing robust algorithms and balancing detection rates with low false alerts.
ISSN:2666-7649
2666-7649
DOI:10.1016/j.dsm.2024.08.001