Ordered quantile normalization: a semiparametric transformation built for the cross-validation era

Normalization transformations have recently experienced a resurgence in popularity in the era of machine learning, particularly in data preprocessing. However, the classical methods that can be adapted to cross-validation are not always effective. We introduce Ordered Quantile (ORQ) normalization, a...

Full description

Saved in:
Bibliographic Details
Published inJournal of applied statistics Vol. 47; no. 13-15; pp. 2312 - 2327
Main Authors Peterson, Ryan A., Cavanaugh, Joseph E.
Format Journal Article
LanguageEnglish
Published England Taylor & Francis 17.11.2020
Taylor & Francis Ltd
Subjects
Online AccessGet full text
ISSN0266-4763
1360-0532
DOI10.1080/02664763.2019.1630372

Cover

More Information
Summary:Normalization transformations have recently experienced a resurgence in popularity in the era of machine learning, particularly in data preprocessing. However, the classical methods that can be adapted to cross-validation are not always effective. We introduce Ordered Quantile (ORQ) normalization, a one-to-one transformation that is designed to consistently and effectively transform a vector of arbitrary distribution into a vector that follows a normal (Gaussian) distribution. In the absence of ties, ORQ normalization is guaranteed to produce normally distributed transformed data. Once trained, an ORQ transformation can be readily and effectively applied to new data. We compare the effectiveness of the ORQ technique with other popular normalization methods in a simulation study where the true data generating distributions are known. We find that ORQ normalization is the only method that works consistently and effectively, regardless of the underlying distribution. We also explore the use of repeated cross-validation to identify the best normalizing transformation when the true underlying distribution is unknown. We apply our technique and other normalization methods via the bestNormalize R package on a car pricing data set. We built bestNormalize to evaluate the normalization efficacy of many candidate transformations; the package is freely available via the Comprehensive R Archive Network.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0266-4763
1360-0532
DOI:10.1080/02664763.2019.1630372