Scalable Evidential K-Nearest Neighbor Classification on Big Data

The K -Nearest Neighbor (K-NN) algorithm has garnered widespread utilization in real-world scenarios, due to its exceptional interpretability that other classification algorithms may not have. The evidential K-NN (EK-NN) algorithm builds upon the same nearest neighbor search procedure as K-NN, and p...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on big data Vol. 10; no. 3; pp. 226 - 237
Main Authors	Gong, Chaoyu, Demmel, Jim, You, Yang
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.06.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Approximation algorithms Big Data Classification Classification algorithms Datasets distributed gradient descent Evidential <named-content xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" content-type="math" xlink:type="simple"> <inline-formula> <tex-math notation="LaTeX"> K</tex-math> </inline-formula> </named-content>-nearest neighbor classification Nearest neighbor methods Parameter estimation Program processors spark Sparks Training
Online Access	Get full text
ISSN	2332-7790 2372-2096
DOI	10.1109/TBDATA.2023.3327220

Cover

More Information
Summary:	The K -Nearest Neighbor (K-NN) algorithm has garnered widespread utilization in real-world scenarios, due to its exceptional interpretability that other classification algorithms may not have. The evidential K-NN (EK-NN) algorithm builds upon the same nearest neighbor search procedure as K-NN, and provides more informative classification outcomes. However, EK-NN is not practical for Big Data because it is computationally complex. First, the search for K nearest neighbors of test samples from <inline-formula><tex-math notation="LaTeX">n</tex-math> <mml:math><mml:mi>n</mml:mi></mml:math><inline-graphic xlink:href="gong-ieq1-3327220.gif"/> </inline-formula> training samples requires <inline-formula><tex-math notation="LaTeX">O(n^{2})</tex-math> <mml:math><mml:mrow><mml:mi>O</mml:mi><mml:mo>(</mml:mo><mml:msup><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="gong-ieq2-3327220.gif"/> </inline-formula> operations. Additionally, estimating parameters involves performing complicated matrix calculations that increase in scale as the dataset becomes larger. To address these issues, we propose two scalable EK-NN classifiers, Global Exact EK-NN and Local Approximate EK-NN, under the distributed Spark framework. Along with the Local Approximate EK-NN, a new distributed gradient descent algorithm is developed to learn parameters. Data parallelism is used to reduce negative impacts caused by data distribution differences. Experimental results show that Our algorithms are able to achieve state-of-the-art scaling efficiency and accuracy on large datasets with more than 10 million samples.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2332-7790 2372-2096
DOI:	10.1109/TBDATA.2023.3327220