Learning algorithms may perform worse with increasing training set size: Algorithm–data incompatibility

In machine learning problems a learning algorithm tries to learn the input–output dependency (relationship) of a system from a training dataset. This input–output relationship is usually deformed by a random noise. From experience, simulations, and special case theories, most practitioners believe t...

Full description

Saved in:
Bibliographic Details
Published inComputational statistics & data analysis Vol. 74; pp. 181 - 197
Main Authors Yousef, Waleed A., Kundu, Subrata
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.06.2014
Subjects
Online AccessGet full text
ISSN0167-9473
1872-7352
DOI10.1016/j.csda.2013.05.021

Cover

More Information
Summary:In machine learning problems a learning algorithm tries to learn the input–output dependency (relationship) of a system from a training dataset. This input–output relationship is usually deformed by a random noise. From experience, simulations, and special case theories, most practitioners believe that increasing the size of the training set improves the performance of the learning algorithm. It is shown that this phenomenon is not true in general for any pair of a learning algorithm and a data distribution. In particular, it is proven that for certain distributions and learning algorithms, increasing the training set size may result in a worse performance and increasing the training set size infinitely may result in the worst performance—even when there is no model misspecification for the input–output relationship. Simulation results and analysis of real datasets are provided to support the mathematical argument.
Bibliography:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
ObjectType-Article-1
ObjectType-Feature-2
ISSN:0167-9473
1872-7352
DOI:10.1016/j.csda.2013.05.021