GA-iForest: An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outlier Detection

With the development of data age, data quality has become one of the problems that people pay much attention to. As a field of data mining, outlier detection is related to the quality of data. The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in r...

Full description

Saved in:
Bibliographic Details
Published inTransactions of Nanjing University of Aeronautics & Astronautics Vol. 36; no. 6; pp. 1026 - 1038
Main Authors Li, Kexin, Li, Jing, Liu, Shuji, Li, Zhao, Bo, Jue, Liu, Biqi
Format Journal Article
LanguageChinese
English
Published Nanjing Nanjing University of Aeronautics and Astronautics 01.12.2019
College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,P.R.China%State Grid Liaoning Electric Power Supply Co.,LTD,Shenyang 110004,P.R.China
Subjects
Online AccessGet full text
ISSN1005-1120
DOI10.16356/j.1005-1120.2019.06.015

Cover

Abstract With the development of data age, data quality has become one of the problems that people pay much attention to. As a field of data mining, outlier detection is related to the quality of data. The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years. In the process of constructing the isolation tree by the isolated forest algorithm, as the isolation tree is continuously generated, the difference of isolation trees will gradually decrease or even no difference, which will result in the waste of memory and reduced efficiency of outlier detection. And in the constructed isolation trees, some isolation trees cannot detect outlier. In this paper, an improved iForest-based method GA-iForest is proposed. This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees, thereby reducing some duplicate, similar and poor detection isolation trees and improving the accuracy and stability of outlier detection. In the experiment, Ubuntu system and Spark platform are used to build the experiment environment. The outlier datasets provided by ODDS are used as test. According to indicators such as the accuracy, recall rate, ROC curves, AUC and execution time, the performance of the proposed method is evaluated. Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection, but also reduce the number of isolation trees by 20%-40% compared with the original iForest method.
AbstractList TP301.6; With the development of data age,data quality has become one of the problems that people pay muchattention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.
With the development of data age, data quality has become one of the problems that people pay much attention to. As a field of data mining, outlier detection is related to the quality of data. The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years. In the process of constructing the isolation tree by the isolated forest algorithm, as the isolation tree is continuously generated, the difference of isolation trees will gradually decrease or even no difference, which will result in the waste of memory and reduced efficiency of outlier detection. And in the constructed isolation trees, some isolation trees cannot detect outlier. In this paper, an improved iForest-based method GA-iForest is proposed. This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees, thereby reducing some duplicate, similar and poor detection isolation trees and improving the accuracy and stability of outlier detection. In the experiment, Ubuntu system and Spark platform are used to build the experiment environment. The outlier datasets provided by ODDS are used as test. According to indicators such as the accuracy, recall rate, ROC curves, AUC and execution time, the performance of the proposed method is evaluated. Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection, but also reduce the number of isolation trees by 20%-40% compared with the original iForest method.
Author Li, Kexin
Bo, Jue
Li, Jing
Liu, Shuji
Liu, Biqi
Li, Zhao
AuthorAffiliation College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,P.R.China%State Grid Liaoning Electric Power Supply Co.,LTD,Shenyang 110004,P.R.China
AuthorAffiliation_xml – name: College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,P.R.China%State Grid Liaoning Electric Power Supply Co.,LTD,Shenyang 110004,P.R.China
Author_xml – sequence: 1
  givenname: Kexin
  surname: Li
  fullname: Li, Kexin
– sequence: 2
  givenname: Jing
  surname: Li
  fullname: Li, Jing
– sequence: 3
  givenname: Shuji
  surname: Liu
  fullname: Liu, Shuji
– sequence: 4
  givenname: Zhao
  surname: Li
  fullname: Li, Zhao
– sequence: 5
  givenname: Jue
  surname: Bo
  fullname: Bo, Jue
– sequence: 6
  givenname: Biqi
  surname: Liu
  fullname: Liu, Biqi
BookMark eNpFkM1uwjAQhH2gUinlHSz1nHRtE8fpLeWvSKhc2jNykg0Ygk0dR_D4TUXVnlb6dmZHOw9kYJ1FQiiDmEmRyOdDzACSiDEOMQeWxSBjYMmADP_4PRm3rSkAZAoiVXJI_DKPzMJ5bMMLzS2d17UpDdpAV61rdMCK3rZ04fUJL84f6atue-wsXaLFYEqaNzvnTdifaO08fe9O6E2pGzrTQdNNFxqDns4wYBmMs4_krtZNi-PfOSKfi_nH9C1ab5arab6OzkxIGXEmk5qzTEqlOVcZn9RZBVoxVMigVIUAxmCitBQZK4qqwkxrqdJqIqskTZQYkeh296Jtre1ue3Cdt33i1h72x32ortdiiz9NgYS-whF5uunP3n11_cv_Bi6UgCzt48Q3chRtXA
ClassificationCodes TP301.6
ContentType Journal Article
Copyright Copyright Nanjing University of Aeronautics and Astronautics 2019
Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
Copyright_xml – notice: Copyright Nanjing University of Aeronautics and Astronautics 2019
– notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
DBID 7TB
8FD
FR3
H8D
L7M
2B.
4A8
92I
93N
PSX
TCJ
DOI 10.16356/j.1005-1120.2019.06.015
DatabaseName Mechanical & Transportation Engineering Abstracts
Technology Research Database
Engineering Research Database
Aerospace Database
Advanced Technologies Database with Aerospace
Wanfang Data Journals - Hong Kong
WANFANG Data Centre
Wanfang Data Journals
万方数据期刊 - 香港版
China Online Journals (COJ)
China Online Journals (COJ)
DatabaseTitle Aerospace Database
Engineering Research Database
Technology Research Database
Mechanical & Transportation Engineering Abstracts
Advanced Technologies Database with Aerospace
DatabaseTitleList
Aerospace Database
DeliveryMethod fulltext_linktorsrc
EndPage 1038
ExternalDocumentID njhkhtdxxb_e201906016
GrantInformation_xml – fundername: the StateGrid Liaoning Electric Power Supply CO,LTD.We are grateful to the reviewers who have given their support and valuable comments and the financial support for the"Key Technology and Application Research of the Self-Service Grid Big Data Governance "
  funderid: (No.SGLNXT00YJJS1800110)"
GroupedDBID 7TB
8FD
ALMA_UNASSIGNED_HOLDINGS
CDYEO
FR3
H8D
L7M
2B.
4A8
92I
93N
PSX
TCJ
ID FETCH-LOGICAL-p1366-2165f219668a228924f9d0a81e8e10c8b3011048a6391bbdde9aa687d46d57583
ISSN 1005-1120
IngestDate Thu May 29 04:06:45 EDT 2025
Mon Jun 30 04:13:48 EDT 2025
IsPeerReviewed false
IsScholarly true
Issue 6
Keywords genetic algorithm
isolation tree
isolated forest
outlier detection
feature selection
Language Chinese
English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-p1366-2165f219668a228924f9d0a81e8e10c8b3011048a6391bbdde9aa687d46d57583
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
PQID 2383097104
PQPubID 2047913
PageCount 13
ParticipantIDs wanfang_journals_njhkhtdxxb_e201906016
proquest_journals_2383097104
PublicationCentury 2000
PublicationDate 2019-12-01
PublicationDateYYYYMMDD 2019-12-01
PublicationDate_xml – month: 12
  year: 2019
  text: 2019-12-01
  day: 01
PublicationDecade 2010
PublicationPlace Nanjing
PublicationPlace_xml – name: Nanjing
PublicationTitle Transactions of Nanjing University of Aeronautics & Astronautics
PublicationTitle_FL Transactions of Nanjing University of Aeronautics and Astronautics
PublicationYear 2019
Publisher Nanjing University of Aeronautics and Astronautics
College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,P.R.China%State Grid Liaoning Electric Power Supply Co.,LTD,Shenyang 110004,P.R.China
Publisher_xml – name: Nanjing University of Aeronautics and Astronautics
– name: College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,P.R.China%State Grid Liaoning Electric Power Supply Co.,LTD,Shenyang 110004,P.R.China
SSID ssib006703786
ssib051367785
ssib022315922
ssib018830051
ssib000269172
ssib001051656
ssib001129202
Score 2.1334808
Snippet With the development of data age, data quality has become one of the problems that people pay much attention to. As a field of data mining, outlier detection...
TP301.6; With the development of data age,data quality has become one of the problems that people pay muchattention to.As a field of data mining,outlier...
SourceID wanfang
proquest
SourceType Aggregation Database
StartPage 1026
SubjectTerms Accuracy
Algorithms
Data analysis
Data mining
Genetic algorithms
Outliers (statistics)
Stability
Trees
Title GA-iForest: An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outlier Detection
URI https://www.proquest.com/docview/2383097104
https://d.wanfangdata.com.cn/periodical/njhkhtdxxb-e201906016
Volume 36
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVEBS
  databaseName: Inspec with Full Text
  issn: 1005-1120
  databaseCode: ADMLS
  dateStart: 20181001
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  omitProxy: false
  ssIdentifier: ssib000269172
  providerName: EBSCOhost
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELf2ISReEAgQGwP5AfMSZSRp4ji8uWv2RTcQ26SJl8lpHJoxUuhSMfiH-De5c9LUm4Y0eIkc23EbOz_fnX3-HSGvelmQxL4u3DwRuRuO8p6bJQV3w1jnYB0osH3QUDw45Lsn4f5pdLq0_NvyWprV2ebo163nSv5nVCEPxhVPyf7DyHaNQgakYXzhCiMM1zuN8Y50S4yteYlPywrDJJfmgKOzBz-sUJlsilE_bXywnD6IrRy3CJBwGtla5cXnybSsx1-Nx-HhrNnCgclQ1cp5P6sv8DzKQNfGZ6uylVmWRkwMWBKzNGR9yeQWSwUTgknJ0pjJBDMXORGTIWa2idgkBkxyK8GZgHYilm6z_hY-iI_3Wd_HogRKY2xZJKaoqdOtZAz3nHf6qqys-_25YDa3J87ReHZeWuWfxmpir3v4yQ0fEmtdZR7_opsOccthsTMB3ypIqnNcebnu7CL1FKydWUeHLS_rLsN6CExksJsh48Pmx00T1ZwFkTEFnJ1pmTvDUjUr56kJXIQnGjDAnWPCsuK8jtP88HgA16Oxrn4qqIpEfV5ot2nJIEMO6weeLaQalpgWjLbEAQWRW9oL8t3fKhmRh7ARjfPm0a8xMeS1zYHaG7zj1fn4y7jOr66yM401DXHPMlkNQHx6K2RVDg6GR9aGNQdoW0ewYbK3mfhQrQ8spjkOksZiLvSFwLgJnagBpRX07AWzZITMgrEJsNv9_9YvD1_szV9e65rZeO-HqgrofEt_PH5IHrSGH5UNih-RJV09Jt8XCH4rK9rhl87xS5tC2uGXGvzSSUVb_NIOvxTwSzv8UsQvbfFLO_w-ISfb6fHWrtsGQXG_wTtzN4BuLECt4FyoIBBJEBZJ7inha6F9byQylNAghhWYGn6WgbaSKMVFnIc8B1NM9J6SlWpS6WeEgu2VRb4qRkWow0KNMp0lXAXKKwTo_Zm_RjbmvXXWznKXZ6DS95BnzgvXyOu2Bxelt34k63et-JzcXyB7g6zU05l-ATp-nb1sv68_0bnhrg
linkProvider EBSCOhost
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=GA-iForest%3AAn+Efficient+Isolated+Forest+Framework+Based+on+Genetic+Algorithm+for+Numerical+Data+Outlier+Detection&rft.jtitle=%E5%8D%97%E4%BA%AC%E8%88%AA%E7%A9%BA%E8%88%AA%E5%A4%A9%E5%A4%A7%E5%AD%A6%E5%AD%A6%E6%8A%A5%EF%BC%88%E8%8B%B1%E6%96%87%E7%89%88%EF%BC%89&rft.au=LI+Kexin&rft.au=LI+Jing&rft.au=LIU+Shuji&rft.au=LI+Zhao&rft.date=2019-12-01&rft.pub=College+of+Computer+Science+and+Technology%2CNanjing+University+of+Aeronautics+and+Astronautics%2CNanjing+211106%2CP.R.China%25State+Grid+Liaoning+Electric+Power+Supply+Co.%2CLTD%2CShenyang+110004%2CP.R.China&rft.issn=1005-1120&rft.volume=36&rft.issue=6&rft.spage=1026&rft.epage=1038&rft_id=info:doi/10.16356%2Fj.1005-1120.2019.06.015&rft.externalDocID=njhkhtdxxb_e201906016
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fnjhkhtdxxb-e%2Fnjhkhtdxxb-e.jpg