A new strategy to prevent over-fitting in partial least squares models based on model population analysis

[Display omitted] •A new strategy for preventing over-fitting in partial least squares models.•A new criterion combines model prediction ability and model stability.•Model stability is sensitive to over-fitting.•The new criterion has a clear maximum on partial least squares component selection. Part...

Full description

Saved in:

Bibliographic Details
Published in	Analytica chimica acta Vol. 880; pp. 32 - 41
Main Authors	Deng, Bai-Chuan, Yun, Yong-Huan, Liang, Yi-Zeng, Cao, Dong-Sheng, Xu, Qing-Song, Yi, Lun-Zhao, Huang, Xin
Format	Journal Article
Language	English
Published	Netherlands Elsevier B.V 23.06.2015
Subjects	Algorithms analytical chemistry Cross-validation Glycine max - chemistry Glycine max - metabolism least squares Least-Squares Analysis Model population analysis Model selection Model stability Models, Chemical Over-fitting Partial least squares prediction Software Spectrophotometry, Ultraviolet Over-fitting Model population analysis Model selection Partial least squares Model stability Cross-validation
Online Access	Get full text
ISSN	0003-2670 1873-4324 1873-4324
DOI	10.1016/j.aca.2015.04.045

Cover

More Information
Summary:	[Display omitted] •A new strategy for preventing over-fitting in partial least squares models.•A new criterion combines model prediction ability and model stability.•Model stability is sensitive to over-fitting.•The new criterion has a clear maximum on partial least squares component selection. Partial least squares (PLS) is one of the most widely used methods for chemical modeling. However, like many other parameter tunable methods, it has strong tendency of over-fitting. Thus, a crucial step in PLS model building is to select the optimal number of latent variables (nLVs). Cross-validation (CV) is the most popular method for PLS model selection because it selects a model from the perspective of prediction ability. However, a clear minimum of prediction errors may not be obtained in CV which makes the model selection difficult. To solve the problem, we proposed a new strategy for PLS model selection which combines the cross-validated coefficient of determination (Qcv2) and model stability (S). S is defined as the stability of PLS regression vectors which is obtained using model population analysis (MPA). The results show that, when a clear maximum of Qcv2 is not obtained, S can provide additional information of over-fitting and it helps in finding the optimal nLVs. Compared with other regression vector based indictors such as the Euclidean 2-norm (B2), the Durbin Watson statistic (DW) and the jaggedness (J), S is more sensitive to over-fitting. The model selected by our method has both good prediction ability and stability.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0003-2670 1873-4324 1873-4324
DOI:	10.1016/j.aca.2015.04.045