Estimation of predictive performance for test data in applicability domains using y‐randomization

A new measure of predictive performance for objective variables in regression analysis is proposed, enabling the y‐errors of new samples or test samples to be estimated in the applicability domains (ADs) of regression models. The proposed measure, based on y‐randomization, considers chance correlati...

Full description

Saved in:
Bibliographic Details
Published inJournal of chemometrics Vol. 33; no. 9
Main Author Kaneko, Hiromasa
Format Journal Article
LanguageEnglish
Published Chichester Wiley Subscription Services, Inc 01.09.2019
Subjects
Online AccessGet full text
ISSN0886-9383
1099-128X
DOI10.1002/cem.3171

Cover

More Information
Summary:A new measure of predictive performance for objective variables in regression analysis is proposed, enabling the y‐errors of new samples or test samples to be estimated in the applicability domains (ADs) of regression models. The proposed measure, based on y‐randomization, considers chance correlations and is calculated using only training data. This chance correlation‐excluded mean absolute error (MAECCE) can estimate the y‐errors of new samples considering the influence of chance correlations in the given dataset on the regression models. Experiments using numerical simulation, quantitative structure‐activity relationship, and quantitative structure‐property relationship datasets confirm that MAECCE can estimate the distribution of y‐errors of new samples in ADs for various training datasets, descriptor sets, and regression analysis methods, enabling chance correlations to be eliminated from data analysis. Python and MATLAB codes for the proposed algorithm are available at https://github.com/hkaneko1985/maecce. The proposed measure, based on y‐randomization, considers chance correlations and is calculated using only training data. This chance correlation‐excluded mean absolute error (MAECCE) can estimate the y‐errors of new samples or test samples in the applicability domains (ADs) of regression models, considering the influence of chance correlations in the given dataset on the regression models.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0886-9383
1099-128X
DOI:10.1002/cem.3171