基于改进K-means的局部离群点检测方法

TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群点的检测性能较弱.基于此,本文通过引入快速搜索和发现密度峰值方法改进K-means聚类算法,提出了一种名为KLOD(local outlier detection based on improved K-means and least-squares methods)的局部离群点检测方法,以实现对局部离群点的精确检测.首先,利用快速搜索和发现密度峰值方法计算数据点的局部密度和相对距离,并将二者相乘得到γ值.其次,将γ值降序排序...

Full description

Saved in:
Bibliographic Details
Published in工程科学与技术 Vol. 56; no. 4; pp. 66 - 77
Main Authors 周玉, 夏浩, 岳学震, 王培崇
Format Journal Article
LanguageChinese
Published 华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031 01.07.2024
Subjects
Online AccessGet full text
ISSN2096-3246
DOI10.12454/j.jsuese.202201398

Cover

Abstract TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群点的检测性能较弱.基于此,本文通过引入快速搜索和发现密度峰值方法改进K-means聚类算法,提出了一种名为KLOD(local outlier detection based on improved K-means and least-squares methods)的局部离群点检测方法,以实现对局部离群点的精确检测.首先,利用快速搜索和发现密度峰值方法计算数据点的局部密度和相对距离,并将二者相乘得到γ值.其次,将γ值降序排序,利用肘部法则选择γ值最大的k个数据点作为K-means聚类算法的初始聚类中心.然后,通过K-means聚类算法将数据集聚类成k个簇,计算数据点在每个维度上的目标函数值并进行升序排列.接着,确定数据点的每个维度的离散程度并选择适当的拟合函数和拟合点,通过最小二乘法对升序排列的每个簇的每1维目标函数值进行函数拟合并求导,以获取变化率.最后,结合信息熵,将每个数据点的每个维度目标函数值乘以相应的变化率进行加权,得到最终的异常得分,并将异常值得分较高的top-n个数据点视为离群点.通过人工数据集和UCI数据集,对KLOD、LOF和KNN方法在准确度上进行仿真实验对比.结果表明KLOD方法相较于KNN和LOF方法具有更高的准确度.本文提出的KLOD方法能够有效改善K-means聚类算法的聚类效果,并且在局部离群点检测方面具有较好的精度和性能.
AbstractList TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群点的检测性能较弱.基于此,本文通过引入快速搜索和发现密度峰值方法改进K-means聚类算法,提出了一种名为KLOD(local outlier detection based on improved K-means and least-squares methods)的局部离群点检测方法,以实现对局部离群点的精确检测.首先,利用快速搜索和发现密度峰值方法计算数据点的局部密度和相对距离,并将二者相乘得到γ值.其次,将γ值降序排序,利用肘部法则选择γ值最大的k个数据点作为K-means聚类算法的初始聚类中心.然后,通过K-means聚类算法将数据集聚类成k个簇,计算数据点在每个维度上的目标函数值并进行升序排列.接着,确定数据点的每个维度的离散程度并选择适当的拟合函数和拟合点,通过最小二乘法对升序排列的每个簇的每1维目标函数值进行函数拟合并求导,以获取变化率.最后,结合信息熵,将每个数据点的每个维度目标函数值乘以相应的变化率进行加权,得到最终的异常得分,并将异常值得分较高的top-n个数据点视为离群点.通过人工数据集和UCI数据集,对KLOD、LOF和KNN方法在准确度上进行仿真实验对比.结果表明KLOD方法相较于KNN和LOF方法具有更高的准确度.本文提出的KLOD方法能够有效改善K-means聚类算法的聚类效果,并且在局部离群点检测方面具有较好的精度和性能.
Abstract_FL Objective Outliers are defined as data points generated for various special reasons.They are often regarded as noise points due to their deviation from normal data points and are considered points of research value,occupying a small proportion of the dataset.The task of outlier detection in-volves identifying these points and analyzing their potential abnormal information through the analysis of data attribute features.This process aims to uncover unusual patterns or behaviors within the dataset that can provide insights into unique phenomena or anomalies.Most clustering-based outlier detection methods primarily detect outliers in the dataset from a global perspective,with weaker performance in detecting local out-liers.Hence,an improved K-means clustering algorithm is proposed by introducing fast search and discovering density peak methods.A local out-lier detection method,named KLOD(local outlier detection based on improved K-means and least squares methods),is developed to achieve pre-cise detection of local outliers. Methods The K-means clustering algorithm is characterized by hard clustering,meaning that after clustering the dataset,each data point has a clear association with one cluster or another.This property makes it suitable for outlier detection,as outliers significantly affect the clustering pro-cess.However,selecting initial cluster centers and determining the number of clusters is crucial as they directly impact the clustering effective-ness.To select the accurate cluster center,clustering by fast search and finding density peaks is utilized to compute the local density and relative distance of data points,constructing a decision graph based on these metrics.The challenge lies in accurately determining the cutoff distance dc,making it difficult to precisely identify the number of cluster centers from the decision graph obtained using a single dc value.The elbow method is employed to determine the optimal number of clusters for an unknown dataset for the best clustering effectiveness to address the challenge of determining the number of clusters.When clustering data into different numbers of clusters,the cost function value changes accordingly.The number of clusters is depicted on the x-axis,and the cost function value is on they-axis.The changes in the cost function value with the number of clusters are recorded and plotted as a line graph.When there is no significant decrease in the cost function value with an increase in the number of cluster centers,the position of the"elbow"is observed to determine the optimal number of clusters.After determining the initial cluster centers and the number of clusters k,the dataset is clustered using the K-means clustering algorithm to obtain k clusters and their corresponding cluster centers.The objective function value for each data point in each dimension within each cluster is then computed.Then,the objective function val-ues for each dimension of the data points in each cluster are sorted in ascending order.The objective function values,sorted in ascending order,are fitted using the least squares approach to obtain a curve.The derivative of this fitted curve is then calculated to obtain the slope,providing in-sight into the rate of change of the objective function values within each cluster.Each dimension's degree of dispersion and information content can vary in the dataset,so different weights are assigned to each dimension.Information entropy is employed to measure the dataset's degree of dispersion,and higher weightage is given to dimensions with higher outlier degrees to represent their impact on the overall dataset.By incorporat-ing information entropy,each dimension's objective function value for each data point is weighted by the corresponding change rate.This process results in the final anomaly score,and the top-n data points with high anomaly scores are considered outliers. Results and Discussions The experimental results indicated that in the artificial dataset,KLOD,KNN,and LOF all detect sparse local outliers ef-fectively.However,the LOF algorithm struggles to detect outliers within outlier clusters.Additionally,the KNN method cannot detect local out-liers within densely distributed clusters when there is a considerable distance between normal data points.In contrast,the KLOD method analyzes each cluster individually,addressing the issue of uneven cluster densities.The KLOD method analyzes each dimension of the data points within each cluster separately,achieving accurate detection.In the UCI dataset,the KLOD method achieves optimal detection accuracy in 10 datasets,with detection accuracy on par with KNN and LOF in 2 datasets.Compared to the KNN and LOF algorithms,KLOD also demonstrates high ac-curacy in outlier detection.The fast search density peak method is applied to calculate the local density and relative distance of data points,and the y value of each data point is determined based on these two metrics to improve the K-means clustering algorithm.However,the size of y is in-fluenced by the cutoff distance dc,making it difficult to intuitively choose k initial cluster centers.Hence,the elbow method selects the k data points with the largest γ values as initial cluster centers for the K-means clustering algorithm.Least squares fitting is employed to fit the objective function values for each dimension sorted in ascending order.This method highlights the degree of outlierness of outliers,incorporating more out-lier information into the final anomaly score. Conclusions Experimental results on artificial and UCI real datasets demonstrated that the KLOD method can detect local outliers with moderate outliers.Compared to the KNN and LOF methods,it significantly improves detection accuracy.However,due to limitations of the K-means al-gorithm itself,its clustering performance is poor for datasets containing arbitrarily shaped clusters,affecting detection performance.Therefore,fu-ture studies can focus on enhancing the performance of outlier detection methods on datasets with arbitrary cluster shapes.
Author 夏浩
岳学震
周玉
王培崇
AuthorAffiliation 华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
AuthorAffiliation_xml – name: 华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
Author_FL WANG Peichong
ZHOU Yu
YUE Xuezhen
XIA Hao
Author_FL_xml – sequence: 1
  fullname: ZHOU Yu
– sequence: 2
  fullname: XIA Hao
– sequence: 3
  fullname: YUE Xuezhen
– sequence: 4
  fullname: WANG Peichong
Author_xml – sequence: 1
  fullname: 周玉
– sequence: 2
  fullname: 夏浩
– sequence: 3
  fullname: 岳学震
– sequence: 4
  fullname: 王培崇
BookMark eNotjz1Lw0Ach2-oYK39BH4FE__3mrtRim9YcNE5XHKXYtQreBQzFnUTnNRBBwdFuhZxSAX9MjbRb2FApx_P8jz8llDLDZ1FaAVDiAnjbC0Pcz-y3oYECAFMlWyhNgElAkqYWERd7w8ToIJRzgVvIzx_nH3Nrqub8vvzYTc4sdr5-v5yPh3_XEzql_f647k-L6uncfV2Vd2V1evtMlrI9LG33f_toIPNjf3edtDf29rprfcDjyGSgZGZsIZiLFiDEQOegsqs1ZSDSpTBJmUKIooZaC45MVhwqyOBZcKsjSTtoNU_75l2mXaDOB-OTl1TjH1qiiKJB-lR0bxkwAAk_QWbQFl8
ClassificationCodes TP301.6
ContentType Journal Article
Copyright Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
Copyright_xml – notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
DBID 2B.
4A8
92I
93N
PSX
TCJ
DOI 10.12454/j.jsuese.202201398
DatabaseName Wanfang Data Journals - Hong Kong
WANFANG Data Centre
Wanfang Data Journals
万方数据期刊 - 香港版
China Online Journals (COJ)
China Online Journals (COJ)
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
DocumentTitle_FL Local Outlier Detection Method Based on Improved K-means
EndPage 77
ExternalDocumentID scdxxb_gckx202404008
GrantInformation_xml – fundername: (国家自然科学基金); (国家自然科学基金); (河南省高等学校青年骨干教师培养计划); (河北省高等学校科学技术研究项目)
  funderid: (国家自然科学基金); (国家自然科学基金); (河南省高等学校青年骨干教师培养计划); (河北省高等学校科学技术研究项目)
GroupedDBID -0C
-SC
-S~
2B.
2RA
4A8
5VR
92I
92M
93N
9D9
9DC
AFUIB
ALMA_UNASSIGNED_HOLDINGS
CAJEC
CQIGP
GROUPED_DOAJ
PB1
PB9
PSX
Q--
R-C
RT3
T8S
TCJ
U1F
U5C
ID FETCH-LOGICAL-s1078-d8f6ed311640787405c09feea3509b9d1dc49073140a5852d165ea7618b4ee783
ISSN 2096-3246
IngestDate Thu May 29 03:53:57 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 4
Keywords K-means
least squares method
最小二乘法
peak density
目标函数值
离群点检测
密度峰值
K均值聚类
outlier detection
objective function
Language Chinese
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-s1078-d8f6ed311640787405c09feea3509b9d1dc49073140a5852d165ea7618b4ee783
PageCount 12
ParticipantIDs wanfang_journals_scdxxb_gckx202404008
PublicationCentury 2000
PublicationDate 2024-07-01
PublicationDateYYYYMMDD 2024-07-01
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-01
  day: 01
PublicationDecade 2020
PublicationTitle 工程科学与技术
PublicationTitle_FL Advanced Engineering Sciences
PublicationYear 2024
Publisher 华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
Publisher_xml – name: 华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
SSID ssib036435565
ssib050593459
ssib041261190
ssib030194745
ssib051371919
ssj0003313526
ssib027967859
Score 2.4129763
Snippet TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群...
SourceID wanfang
SourceType Aggregation Database
StartPage 66
Title 基于改进K-means的局部离群点检测方法
URI https://d.wanfangdata.com.cn/periodical/scdxxb-gckx202404008
Volume 56
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ (Directory of Open Access Journals)
  issn: 2096-3246
  databaseCode: DOA
  dateStart: 20220101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.doaj.org/
  omitProxy: true
  ssIdentifier: ssj0003313526
  providerName: Directory of Open Access Journals
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07bxQxEF6Fo6FBIEC8lSKuooXzc-1y97KnCARVIqU77euCQBwSd5FOqSKgQ6ICCigoQIgWIYoLEvwZcgf_ghmv93ajSxHSrCx7dvx5xusZrx_jeSsw4CWUFpnPZIIhzNLUBz-5AEcuTcA6ZLyd4QHne_fV-qa4syW3llorjV1LO6P0VrZ75LmSk2gV8kCveEr2PzQ7ZwoZkAb9whM0DM9j6ZjEkpguiUISC3zqmMSKGEgbEmsSdYmJ7vqPC3SW44AYoBD4TkRxg0NsiOYk1FgUKhJFmIhiEgpMaGa5KBJySwwEkujIVqBcUcRJGbuycm8tc-AmLU9t6SEBVVMsCtewIsSqHVYdOuamQ8L5JmPbLuqgAaE2zRLAp7sOUHioJGIIqa7IINva0lbcokpuxr4kKhL384OJ-UbZsrsilV6ziCHRISawtbfxVczRAKORE1gNSAcVGl8DmhdZYiOOklcTvSEhg4_A0ldtAyQIwCqvFCtwMPGqkOg4EyYPU5dwJUoCqsReIaxgF8EJ7DAhtWoB4XdPBK6sLrDiLVURk0hZVtj7VtsSB_ja_DCY3PrgbqumrSwvgXdjgmgYPqUaLlQZmGfBODMhhbXODzGuJN5RyxjOQHTti8x3iA6zfDxOe9vZozHqHU2NPuWdZoFSrPHjBKwGCww4XLVTDhbLiEYMBw4-t5T1Wr6gTFFarw1LDGrZuFFJUh5Q4y45RH-Oc4pBJDBKZSUSd30ZNuj2YnPs-b9BPxlsN1zVjXPeWTfHXA7LAeO8t7T74IJHDz7s_95_NX09-fPrvRsUZu9eHHzd-_v8y-zzj9nPT7Nnk-nHven3l9O3k-m3Nxe9zW680Vn3XbwUf0jB0_dz3VdFzim1i_MBTMWytukXRcJhVpCanOaZMGDSqWgnUkuWUyWLJFBUp6IoAs0vea3Bk0Fx2VtO-poXfSiTpi3yDI9ZS5lwDXJiGePyikdcE3tuPBz2jlLZ1WPSXfPO1J_3da81erpT3ABvf5TetMr-B6y7saQ
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%E5%9F%BA%E4%BA%8E%E6%94%B9%E8%BF%9BK-means%E7%9A%84%E5%B1%80%E9%83%A8%E7%A6%BB%E7%BE%A4%E7%82%B9%E6%A3%80%E6%B5%8B%E6%96%B9%E6%B3%95&rft.jtitle=%E5%B7%A5%E7%A8%8B%E7%A7%91%E5%AD%A6%E4%B8%8E%E6%8A%80%E6%9C%AF&rft.au=%E5%91%A8%E7%8E%89&rft.au=%E5%A4%8F%E6%B5%A9&rft.au=%E5%B2%B3%E5%AD%A6%E9%9C%87&rft.au=%E7%8E%8B%E5%9F%B9%E5%B4%87&rft.date=2024-07-01&rft.pub=%E5%8D%8E%E5%8C%97%E6%B0%B4%E5%88%A9%E6%B0%B4%E7%94%B5%E5%A4%A7%E5%AD%A6%E7%94%B5%E6%B0%94%E5%B7%A5%E7%A8%8B%E5%AD%A6%E9%99%A2%2C%E6%B2%B3%E5%8D%97%E9%83%91%E5%B7%9E+450045%25%E6%B2%B3%E5%8C%97%E5%9C%B0%E8%B4%A8%E5%A4%A7%E5%AD%A6%E4%BF%A1%E6%81%AF%E5%B7%A5%E7%A8%8B%E5%AD%A6%E9%99%A2%2C%E6%B2%B3%E5%8C%97%E7%9F%B3%E5%AE%B6%E5%BA%84+050031&rft.issn=2096-3246&rft.volume=56&rft.issue=4&rft.spage=66&rft.epage=77&rft_id=info:doi/10.12454%2Fj.jsuese.202201398&rft.externalDocID=scdxxb_gckx202404008
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fscdxxb-gckx%2Fscdxxb-gckx.jpg