Enabling Smart Data: Noise filtering in Big Data classification

•We have tackled the problem of noise in Big Data classification.•Two Big Data preprocessing approaches to remove noisy examples are proposed: HME-BD and HTE-BD.•HME-BD is a homogeneous ensemble filter, while HTE-BD is a heterogeneous ensemble filter.•The proposed algorithms constitute a first suita...

Full description

Saved in:

Bibliographic Details
Published in	Information sciences Vol. 479; pp. 135 - 152
Main Authors	García-Gil, Diego, Luengo, Julián, García, Salvador, Herrera, Francisco
Format	Journal Article
Language	English
Published	Elsevier Inc 01.04.2019
Subjects	Big Data Class noise Classification Label noise Smart Data Big Data Smart Data Label noise Classification Class noise
Online Access	Get full text
ISSN	0020-0255 1872-6291
DOI	10.1016/j.ins.2018.12.002

Cover

More Information
Summary:	•We have tackled the problem of noise in Big Data classification.•Two Big Data preprocessing approaches to remove noisy examples are proposed: HME-BD and HTE-BD.•HME-BD is a homogeneous ensemble filter, while HTE-BD is a heterogeneous ensemble filter.•The proposed algorithms constitute a first suitable noise filtering approach in Big Data domains. In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem.
ISSN:	0020-0255 1872-6291
DOI:	10.1016/j.ins.2018.12.002