M-LDQ feature embedding and regression modeling for distribution-valued data

With the improving capacity to collect massive amounts of data, distribution-valued data are increasingly used in many applications, where they are presented in a clustered, summarized, or aggregated form to provide detailed information, as opposed to single-valued data. Most of the existing models...

Full description

Saved in:

Bibliographic Details
Published in	Information sciences Vol. 609; pp. 121 - 152
Main Authors	Zhao, Qing, Wang, Huiwen, Lu, Shan
Format	Journal Article
Language	English
Published	Elsevier Inc 01.09.2022
Subjects	Distribution-valued data Linear regression model Logarithmic transformation of the derivative of the quantile function (LDQ) Partial least squares Symbolic data analysis Logarithmic transformation of the derivative of the quantile function (LDQ) Symbolic data analysis Distribution-valued data Partial least squares Linear regression model
Online Access	Get full text
ISSN	0020-0255
DOI	10.1016/j.ins.2022.07.064

Cover

More Information
Summary:	With the improving capacity to collect massive amounts of data, distribution-valued data are increasingly used in many applications, where they are presented in a clustered, summarized, or aggregated form to provide detailed information, as opposed to single-valued data. Most of the existing models for distribution-valued data are subject to limitations attributed to the inherent constraints caused by the special expressions of probability distributions. This makes the practical usage of distribution-valued data highly challenging. This paper introduces a novel feature embedding method to characterize a probability distribution, and on this basis, an effective linear regression model that does not contain additional constraints is proposed. Unlike previous models with nonnegative constraints on coefficients, our model is capable of addressing negative coefficients. The detailed parameter estimation procedure applying partial least squares for this model is presented to guarantee more stable results, especially in the presence of a relatively small sample size or multicollinearity among variables. Overall, the proposed method fundamentally facilitates distribution-valued data regression analysis. Extensive simulation experiments and empirical PM2.5 concentration modeling not only verify the effectiveness of our regression method for distribution-valued data but also demonstrate the advantages of the proposed method compared with existing approaches.
ISSN:	0020-0255
DOI:	10.1016/j.ins.2022.07.064