基于改进K-means的局部离群点检测方法

TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群点的检测性能较弱.基于此,本文通过引入快速搜索和发现密度峰值方法改进K-means聚类算法,提出了一种名为KLOD(local outlier detection based on improved K-means and least-squares methods)的局部离群点检测方法,以实现对局部离群点的精确检测.首先,利用快速搜索和发现密度峰值方法计算数据点的局部密度和相对距离,并将二者相乘得到γ值.其次,将γ值降序排序...

Full description

Saved in:

Bibliographic Details
Published in	工程科学与技术 Vol. 56; no. 4; pp. 66 - 77
Main Authors	周玉, 夏浩, 岳学震, 王培崇
Format	Journal Article
Language	Chinese
Published	华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031 01.07.2024
Subjects	K-means least squares method 最小二乘法 peak density 目标函数值离群点检测密度峰值 K均值聚类 outlier detection objective function
Online Access	Get full text
ISSN	2096-3246
DOI	10.12454/j.jsuese.202201398

Cover

Abstract	TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群点的检测性能较弱.基于此,本文通过引入快速搜索和发现密度峰值方法改进K-means聚类算法,提出了一种名为KLOD(local outlier detection based on improved K-means and least-squares methods)的局部离群点检测方法,以实现对局部离群点的精确检测.首先,利用快速搜索和发现密度峰值方法计算数据点的局部密度和相对距离,并将二者相乘得到γ值.其次,将γ值降序排序,利用肘部法则选择γ值最大的k个数据点作为K-means聚类算法的初始聚类中心.然后,通过K-means聚类算法将数据集聚类成k个簇,计算数据点在每个维度上的目标函数值并进行升序排列.接着,确定数据点的每个维度的离散程度并选择适当的拟合函数和拟合点,通过最小二乘法对升序排列的每个簇的每1维目标函数值进行函数拟合并求导,以获取变化率.最后,结合信息熵,将每个数据点的每个维度目标函数值乘以相应的变化率进行加权,得到最终的异常得分,并将异常值得分较高的top-n个数据点视为离群点.通过人工数据集和UCI数据集,对KLOD、LOF和KNN方法在准确度上进行仿真实验对比.结果表明KLOD方法相较于KNN和LOF方法具有更高的准确度.本文提出的KLOD方法能够有效改善K-means聚类算法的聚类效果,并且在局部离群点检测方面具有较好的精度和性能.
AbstractList	TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群点的检测性能较弱.基于此,本文通过引入快速搜索和发现密度峰值方法改进K-means聚类算法,提出了一种名为KLOD(local outlier detection based on improved K-means and least-squares methods)的局部离群点检测方法,以实现对局部离群点的精确检测.首先,利用快速搜索和发现密度峰值方法计算数据点的局部密度和相对距离,并将二者相乘得到γ值.其次,将γ值降序排序,利用肘部法则选择γ值最大的k个数据点作为K-means聚类算法的初始聚类中心.然后,通过K-means聚类算法将数据集聚类成k个簇,计算数据点在每个维度上的目标函数值并进行升序排列.接着,确定数据点的每个维度的离散程度并选择适当的拟合函数和拟合点,通过最小二乘法对升序排列的每个簇的每1维目标函数值进行函数拟合并求导,以获取变化率.最后,结合信息熵,将每个数据点的每个维度目标函数值乘以相应的变化率进行加权,得到最终的异常得分,并将异常值得分较高的top-n个数据点视为离群点.通过人工数据集和UCI数据集,对KLOD、LOF和KNN方法在准确度上进行仿真实验对比.结果表明KLOD方法相较于KNN和LOF方法具有更高的准确度.本文提出的KLOD方法能够有效改善K-means聚类算法的聚类效果,并且在局部离群点检测方面具有较好的精度和性能.
Abstract_FL	Objective Outliers are defined as data points generated for various special reasons.They are often regarded as noise points due to their deviation from normal data points and are considered points of research value,occupying a small proportion of the dataset.The task of outlier detection in-volves identifying these points and analyzing their potential abnormal information through the analysis of data attribute features.This process aims to uncover unusual patterns or behaviors within the dataset that can provide insights into unique phenomena or anomalies.Most clustering-based outlier detection methods primarily detect outliers in the dataset from a global perspective,with weaker performance in detecting local out-liers.Hence,an improved K-means clustering algorithm is proposed by introducing fast search and discovering density peak methods.A local out-lier detection method,named KLOD(local outlier detection based on improved K-means and least squares methods),is developed to achieve pre-cise detection of local outliers. Methods The K-means clustering algorithm is characterized by hard clustering,meaning that after clustering the dataset,each data point has a clear association with one cluster or another.This property makes it suitable for outlier detection,as outliers significantly affect the clustering pro-cess.However,selecting initial cluster centers and determining the number of clusters is crucial as they directly impact the clustering effective-ness.To select the accurate cluster center,clustering by fast search and finding density peaks is utilized to compute the local density and relative distance of data points,constructing a decision graph based on these metrics.The challenge lies in accurately determining the cutoff distance dc,making it difficult to precisely identify the number of cluster centers from the decision graph obtained using a single dc value.The elbow method is employed to determine the optimal number of clusters for an unknown dataset for the best clustering effectiveness to address the challenge of determining the number of clusters.When clustering data into different numbers of clusters,the cost function value changes accordingly.The number of clusters is depicted on the x-axis,and the cost function value is on they-axis.The changes in the cost function value with the number of clusters are recorded and plotted as a line graph.When there is no significant decrease in the cost function value with an increase in the number of cluster centers,the position of the"elbow"is observed to determine the optimal number of clusters.After determining the initial cluster centers and the number of clusters k,the dataset is clustered using the K-means clustering algorithm to obtain k clusters and their corresponding cluster centers.The objective function value for each data point in each dimension within each cluster is then computed.Then,the objective function val-ues for each dimension of the data points in each cluster are sorted in ascending order.The objective function values,sorted in ascending order,are fitted using the least squares approach to obtain a curve.The derivative of this fitted curve is then calculated to obtain the slope,providing in-sight into the rate of change of the objective function values within each cluster.Each dimension's degree of dispersion and information content can vary in the dataset,so different weights are assigned to each dimension.Information entropy is employed to measure the dataset's degree of dispersion,and higher weightage is given to dimensions with higher outlier degrees to represent their impact on the overall dataset.By incorporat-ing information entropy,each dimension's objective function value for each data point is weighted by the corresponding change rate.This process results in the final anomaly score,and the top-n data points with high anomaly scores are considered outliers. Results and Discussions The experimental results indicated that in the artificial dataset,KLOD,KNN,and LOF all detect sparse local outliers ef-fectively.However,the LOF algorithm struggles to detect outliers within outlier clusters.Additionally,the KNN method cannot detect local out-liers within densely distributed clusters when there is a considerable distance between normal data points.In contrast,the KLOD method analyzes each cluster individually,addressing the issue of uneven cluster densities.The KLOD method analyzes each dimension of the data points within each cluster separately,achieving accurate detection.In the UCI dataset,the KLOD method achieves optimal detection accuracy in 10 datasets,with detection accuracy on par with KNN and LOF in 2 datasets.Compared to the KNN and LOF algorithms,KLOD also demonstrates high ac-curacy in outlier detection.The fast search density peak method is applied to calculate the local density and relative distance of data points,and the y value of each data point is determined based on these two metrics to improve the K-means clustering algorithm.However,the size of y is in-fluenced by the cutoff distance dc,making it difficult to intuitively choose k initial cluster centers.Hence,the elbow method selects the k data points with the largest γ values as initial cluster centers for the K-means clustering algorithm.Least squares fitting is employed to fit the objective function values for each dimension sorted in ascending order.This method highlights the degree of outlierness of outliers,incorporating more out-lier information into the final anomaly score. Conclusions Experimental results on artificial and UCI real datasets demonstrated that the KLOD method can detect local outliers with moderate outliers.Compared to the KNN and LOF methods,it significantly improves detection accuracy.However,due to limitations of the K-means al-gorithm itself,its clustering performance is poor for datasets containing arbitrarily shaped clusters,affecting detection performance.Therefore,fu-ture studies can focus on enhancing the performance of outlier detection methods on datasets with arbitrary cluster shapes.
Author	夏浩岳学震周玉王培崇
AuthorAffiliation	华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
AuthorAffiliation_xml	– name: 华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
Author_FL	WANG Peichong ZHOU Yu YUE Xuezhen XIA Hao
Author_FL_xml	– sequence: 1 fullname: ZHOU Yu – sequence: 2 fullname: XIA Hao – sequence: 3 fullname: YUE Xuezhen – sequence: 4 fullname: WANG Peichong
Author_xml	– sequence: 1 fullname: 周玉 – sequence: 2 fullname: 夏浩 – sequence: 3 fullname: 岳学震 – sequence: 4 fullname: 王培崇
BookMark	eNotjz1Lw0Ach2-oYK39BH4FE__3mrtRim9YcNE5XHKXYtQreBQzFnUTnNRBBwdFuhZxSAX9MjbRb2FApx_P8jz8llDLDZ1FaAVDiAnjbC0Pcz-y3oYECAFMlWyhNgElAkqYWERd7w8ToIJRzgVvIzx_nH3Nrqub8vvzYTc4sdr5-v5yPh3_XEzql_f647k-L6uncfV2Vd2V1evtMlrI9LG33f_toIPNjf3edtDf29rprfcDjyGSgZGZsIZiLFiDEQOegsqs1ZSDSpTBJmUKIooZaC45MVhwqyOBZcKsjSTtoNU_75l2mXaDOB-OTl1TjH1qiiKJB-lR0bxkwAAk_QWbQFl8
ClassificationCodes	TP301.6
ContentType	Journal Article
Copyright	Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
Copyright_xml	– notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
DBID	2B. 4A8 92I 93N PSX TCJ
DOI	10.12454/j.jsuese.202201398
DatabaseName	Wanfang Data Journals - Hong Kong WANFANG Data Centre Wanfang Data Journals 万方数据期刊 - 香港版 China Online Journals (COJ) China Online Journals (COJ)
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
DocumentTitle_FL	Local Outlier Detection Method Based on Improved K-means
EndPage	77
ExternalDocumentID	scdxxb_gckx202404008
GrantInformation_xml	– fundername: (国家自然科学基金); (国家自然科学基金); (河南省高等学校青年骨干教师培养计划); (河北省高等学校科学技术研究项目) funderid: (国家自然科学基金); (国家自然科学基金); (河南省高等学校青年骨干教师培养计划); (河北省高等学校科学技术研究项目)
GroupedDBID	-0C -SC -S~ 2B. 2RA 4A8 5VR 92I 92M 93N 9D9 9DC AFUIB ALMA_UNASSIGNED_HOLDINGS CAJEC CQIGP GROUPED_DOAJ PB1 PB9 PSX Q-- R-C RT3 T8S TCJ U1F U5C
ID	FETCH-LOGICAL-s1078-d8f6ed311640787405c09feea3509b9d1dc49073140a5852d165ea7618b4ee783
ISSN	2096-3246
IngestDate	Thu May 29 03:53:57 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	4
Keywords	K-means least squares method 最小二乘法 peak density 目标函数值离群点检测密度峰值 K均值聚类 outlier detection objective function
Language	Chinese
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-s1078-d8f6ed311640787405c09feea3509b9d1dc49073140a5852d165ea7618b4ee783
PageCount	12
ParticipantIDs	wanfang_journals_scdxxb_gckx202404008
PublicationCentury	2000
PublicationDate	2024-07-01
PublicationDateYYYYMMDD	2024-07-01
PublicationDate_xml	– month: 07 year: 2024 text: 2024-07-01 day: 01
PublicationDecade	2020
PublicationTitle	工程科学与技术
PublicationTitle_FL	Advanced Engineering Sciences
PublicationYear	2024
Publisher	华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
Publisher_xml	– name: 华北水利水电大学电气工程学院,河南郑州 450045%河北地质大学信息工程学院,河北石家庄 050031
SSID	ssib036435565 ssib050593459 ssib041261190 ssib030194745 ssib051371919 ssj0003313526 ssib027967859
Score	2.4129763
Snippet	TP301.6; 离群点检测任务是指检测与正常数据在特征属性上存在显著差异的异常数据.大多数基于聚类的离群点检测方法主要从全局角度对数据集中的离群点进行检测,而对局部离群...
SourceID	wanfang
SourceType	Aggregation Database
StartPage	66
Title	基于改进K-means的局部离群点检测方法
URI	https://d.wanfangdata.com.cn/periodical/scdxxb-gckx202404008
Volume	56
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAON databaseName: DOAJ (Directory of Open Access Journals) issn: 2096-3246 databaseCode: DOA dateStart: 20220101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.doaj.org/ omitProxy: true ssIdentifier: ssj0003313526 providerName: Directory of Open Access Journals
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07bxQxEF6Fo6FBIEC8lSKuooXzc-1y97KnCARVIqU77euCQBwSd5FOqSKgQ6ICCigoQIgWIYoLEvwZcgf_ghmv93ajSxHSrCx7dvx5xusZrx_jeSsw4CWUFpnPZIIhzNLUBz-5AEcuTcA6ZLyd4QHne_fV-qa4syW3llorjV1LO6P0VrZ75LmSk2gV8kCveEr2PzQ7ZwoZkAb9whM0DM9j6ZjEkpguiUISC3zqmMSKGEgbEmsSdYmJ7vqPC3SW44AYoBD4TkRxg0NsiOYk1FgUKhJFmIhiEgpMaGa5KBJySwwEkujIVqBcUcRJGbuycm8tc-AmLU9t6SEBVVMsCtewIsSqHVYdOuamQ8L5JmPbLuqgAaE2zRLAp7sOUHioJGIIqa7IINva0lbcokpuxr4kKhL384OJ-UbZsrsilV6ziCHRISawtbfxVczRAKORE1gNSAcVGl8DmhdZYiOOklcTvSEhg4_A0ldtAyQIwCqvFCtwMPGqkOg4EyYPU5dwJUoCqsReIaxgF8EJ7DAhtWoB4XdPBK6sLrDiLVURk0hZVtj7VtsSB_ja_DCY3PrgbqumrSwvgXdjgmgYPqUaLlQZmGfBODMhhbXODzGuJN5RyxjOQHTti8x3iA6zfDxOe9vZozHqHU2NPuWdZoFSrPHjBKwGCww4XLVTDhbLiEYMBw4-t5T1Wr6gTFFarw1LDGrZuFFJUh5Q4y45RH-Oc4pBJDBKZSUSd30ZNuj2YnPs-b9BPxlsN1zVjXPeWTfHXA7LAeO8t7T74IJHDz7s_95_NX09-fPrvRsUZu9eHHzd-_v8y-zzj9nPT7Nnk-nHven3l9O3k-m3Nxe9zW680Vn3XbwUf0jB0_dz3VdFzim1i_MBTMWytukXRcJhVpCanOaZMGDSqWgnUkuWUyWLJFBUp6IoAs0vea3Bk0Fx2VtO-poXfSiTpi3yDI9ZS5lwDXJiGePyikdcE3tuPBz2jlLZ1WPSXfPO1J_3da81erpT3ABvf5TetMr-B6y7saQ
linkProvider	Directory of Open Access Journals
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%E5%9F%BA%E4%BA%8E%E6%94%B9%E8%BF%9BK-means%E7%9A%84%E5%B1%80%E9%83%A8%E7%A6%BB%E7%BE%A4%E7%82%B9%E6%A3%80%E6%B5%8B%E6%96%B9%E6%B3%95&rft.jtitle=%E5%B7%A5%E7%A8%8B%E7%A7%91%E5%AD%A6%E4%B8%8E%E6%8A%80%E6%9C%AF&rft.au=%E5%91%A8%E7%8E%89&rft.au=%E5%A4%8F%E6%B5%A9&rft.au=%E5%B2%B3%E5%AD%A6%E9%9C%87&rft.au=%E7%8E%8B%E5%9F%B9%E5%B4%87&rft.date=2024-07-01&rft.pub=%E5%8D%8E%E5%8C%97%E6%B0%B4%E5%88%A9%E6%B0%B4%E7%94%B5%E5%A4%A7%E5%AD%A6%E7%94%B5%E6%B0%94%E5%B7%A5%E7%A8%8B%E5%AD%A6%E9%99%A2%2C%E6%B2%B3%E5%8D%97%E9%83%91%E5%B7%9E+450045%25%E6%B2%B3%E5%8C%97%E5%9C%B0%E8%B4%A8%E5%A4%A7%E5%AD%A6%E4%BF%A1%E6%81%AF%E5%B7%A5%E7%A8%8B%E5%AD%A6%E9%99%A2%2C%E6%B2%B3%E5%8C%97%E7%9F%B3%E5%AE%B6%E5%BA%84+050031&rft.issn=2096-3246&rft.volume=56&rft.issue=4&rft.spage=66&rft.epage=77&rft_id=info:doi/10.12454%2Fj.jsuese.202201398&rft.externalDocID=scdxxb_gckx202404008
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fscdxxb-gckx%2Fscdxxb-gckx.jpg