Alzheimer-type dementia prediction by sparse logistic regression using claim data
•This study developed Alzheimer-type dementia prediction model based on health insurance claim data and long-term care claim data for Japanese elderly.•Feature selection was critical issue for utilizing claim data including a large amount of information.•Sparse logistic regression models with L0 reg...
Saved in:
| Published in | Computer methods and programs in biomedicine Vol. 196; p. 105582 |
|---|---|
| Main Authors | , , , , |
| Format | Journal Article |
| Language | English |
| Published |
Ireland
Elsevier B.V
01.11.2020
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0169-2607 1872-7565 1872-7565 |
| DOI | 10.1016/j.cmpb.2020.105582 |
Cover
| Summary: | •This study developed Alzheimer-type dementia prediction model based on health insurance claim data and long-term care claim data for Japanese elderly.•Feature selection was critical issue for utilizing claim data including a large amount of information.•Sparse logistic regression models with L0 regularization (SLR-L0) and L1 regularization (SLR-L1) were used for feature selection.•SLR-L0 was more effective for selecting influential features than SLR-L1.
This study aimed to predict the risk of Alzheimer-type dementia for persons aged over 75 years old without receiving long-term care services using regularly collected claim data. A refined dataset including 48,123 persons was prepared from claim data of health insurance and long-term care insurance in a large city in the metropolitan area in Japan. The utilized features include the age and sex of subjects, 502 diseases based on ICD-10 diagnosis codes, and 107 prescription drugs based on therapeutic classes. The most important challenge in this work was feature selection form a large number of features. We adopted sparse logistic regression models with L0 regularization (SLR-L0) and L1 regularization (SLR-L1) as classification models based on machine learning. These regularizations enable feature selection by estimating sparse solution of non-zero coefficients in the model optimization. Predictions were performed by integrating 100 predictors trained by bootstrap samples. As a result, the area under the ROC curves (AUCs) were 0.663 for SLR-L0 and 0.660 for SLR-L1. These performances were similar, however, the average numbers of selected features were 13 out of a total of 611 for SLR-L0 and 253 for SLR-R1. The results indicate that SLR-L1 tended to include less useful features, whereas SLR-L0 narrowed down influential features. SLR-L0 might be more useful than SLR-L1 for practical use or the discussion of risk factors with medical experts. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0169-2607 1872-7565 1872-7565 |
| DOI: | 10.1016/j.cmpb.2020.105582 |