Machine learning-driven risk assessment of coronary heart disease: Analysis of NHANES data from 1999 to 2018

The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine le...

Full description

Saved in:
Bibliographic Details
Published inZhong nan da xue xue bao. Journal of Central South University. Yi xue ban Vol. 49; no. 8; p. 1175
Main Authors Lu, Jin, Hu, Haochang, Xiu, Jiaming, Yang, Yanfang, Zhu, Qifeng, Dai, Hanyi, Liu, Xianbao, Wang, Jian'an
Format Journal Article
LanguageChinese
English
Published China 28.08.2024
Subjects
Online AccessGet full text
ISSN1672-7347
DOI10.11817/j.issn.1672-7347.2024.240394

Cover

Abstract The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD. A total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model. The mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, <0.001], had a higher proportion of females (67.1% vs 47.4%, <0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all <0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all <0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making. This study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings.
AbstractList The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD.OBJECTIVESThe high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD.A total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model.METHODSA total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model.The mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, P<0.001], had a higher proportion of females (67.1% vs 47.4%, P<0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all P<0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all P<0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making.RESULTSThe mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, P<0.001], had a higher proportion of females (67.1% vs 47.4%, P<0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all P<0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all P<0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making.This study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings.CONCLUSIONSThis study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings.
The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD. A total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model. The mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, <0.001], had a higher proportion of females (67.1% vs 47.4%, <0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all <0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all <0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making. This study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings.
Author Yang, Yanfang
Xiu, Jiaming
Zhu, Qifeng
Wang, Jian'an
Hu, Haochang
Dai, Hanyi
Liu, Xianbao
Lu, Jin
Author_xml – sequence: 1
  givenname: Jin
  surname: Lu
  fullname: Lu, Jin
  email: 12318327@zju.edu.cn, 12318327@zju.edu.cn
  organization: State Key Laboratory of Transvascular Implantation Devices, Hangzhou 310009. 12318327@zju.edu.cn
– sequence: 2
  givenname: Haochang
  surname: Hu
  fullname: Hu, Haochang
  organization: State Key Laboratory of Transvascular Implantation Devices, Hangzhou 310009
– sequence: 3
  givenname: Jiaming
  surname: Xiu
  fullname: Xiu, Jiaming
  organization: Department of Cardiology, Longyan First Affiliated Hospital of Fujian Medical University, Longyan Fujian 364000
– sequence: 4
  givenname: Yanfang
  surname: Yang
  fullname: Yang, Yanfang
  organization: Department of Cardiology, Provincial Clinical Medical College of Fujian Medical University, Fujian Provincial Hospital, Fuzhou 350001
– sequence: 5
  givenname: Qifeng
  surname: Zhu
  fullname: Zhu, Qifeng
  organization: Binjiang Institute of Zhejiang University, Hangzhou 310053, China
– sequence: 6
  givenname: Hanyi
  surname: Dai
  fullname: Dai, Hanyi
  organization: State Key Laboratory of Transvascular Implantation Devices, Hangzhou 310009
– sequence: 7
  givenname: Xianbao
  surname: Liu
  fullname: Liu, Xianbao
  organization: Binjiang Institute of Zhejiang University, Hangzhou 310053, China
– sequence: 8
  givenname: Jian'an
  surname: Wang
  fullname: Wang, Jian'an
  email: wangjianan111@zju.edu.cn, wangjianan111@zju.edu.cn, wangjianan111@zju.edu.cn, wangjianan111@zju.edu.cn
  organization: Binjiang Institute of Zhejiang University, Hangzhou 310053, China. wangjianan111@zju.edu.cn
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39788507$$D View this record in MEDLINE/PubMed
BookMark eNo9kEtPwkAcxPeAEUS-gtmLiZfWfT-8EYKPBPEgd7Jtt7LY7mL_xYRvL0R0LpNMfplk5goNYooeoVtKckoN1ffbPADEnCrNMs2FzhlhImeCcCsGaPSfD9EEYEsIYZKedImG3GpjJNEj1Ly6chOix413XQzxI6u68O0j7gJ8YgfgAVofe5xqXKYuRdcd8ObI9rgK4B34BzyNrjlAgBOzfJ4u5--4cr3DdZdaTK21uE-YEWqu0UXtGvCTs4_R6nG-mj1ni7enl9l0ke2k0lmtKCdKKaFKIaljhabcW2uYYEVBDOdKsZoKbUVBC1I74aUphbfSUldUkvMxuvut3XXpa--hX7cBSt80Lvq0hzWnkluqrJFH9OaM7ovWV-tdF9rjwvXfQfwHnw9pPQ
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.11817/j.issn.1672-7347.2024.240394
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
DocumentTitleAlternate 机器学习驱动的冠心病风险评估:1999至2018年NHANES数据分析
ExternalDocumentID 39788507
Genre Journal Article
GeographicLocations United States
GeographicLocations_xml – name: United States
GroupedDBID ALMA_UNASSIGNED_HOLDINGS
CGR
CUY
CVF
ECM
EIF
NPM
RPM
7X8
ID FETCH-LOGICAL-p567-f613066646c451a2b713e998242bb0833662f14794b1b0fa4e58c4e9591abd533
ISSN 1672-7347
IngestDate Fri Jul 11 05:55:51 EDT 2025
Mon Jan 13 02:22:04 EST 2025
IsPeerReviewed false
IsScholarly true
Issue 8
Keywords risk assessment
risk factors
National Health and Nutrition Examination Survey
machine learning
coronary artery heart disease
Language Chinese
English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-p567-f613066646c451a2b713e998242bb0833662f14794b1b0fa4e58c4e9591abd533
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PMID 39788507
PQID 3153916985
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3153916985
pubmed_primary_39788507
PublicationCentury 2000
PublicationDate 2024-Aug-28
20240828
PublicationDateYYYYMMDD 2024-08-28
PublicationDate_xml – month: 08
  year: 2024
  text: 2024-Aug-28
  day: 28
PublicationDecade 2020
PublicationPlace China
PublicationPlace_xml – name: China
PublicationTitle Zhong nan da xue xue bao. Journal of Central South University. Yi xue ban
PublicationTitleAlternate Zhong Nan Da Xue Xue Bao Yi Xue Ban
PublicationYear 2024
SSID ssj0002511111
Score 2.300478
Snippet The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and...
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 1175
SubjectTerms Aged
Algorithms
Coronary Artery Disease - diagnosis
Coronary Artery Disease - epidemiology
Coronary Artery Disease - etiology
Coronary Disease - epidemiology
Coronary Disease - etiology
Female
Humans
Machine Learning
Male
Middle Aged
Nutrition Surveys
Risk Assessment - methods
Risk Factors
United States - epidemiology
Title Machine learning-driven risk assessment of coronary heart disease: Analysis of NHANES data from 1999 to 2018
URI https://www.ncbi.nlm.nih.gov/pubmed/39788507
https://www.proquest.com/docview/3153916985
Volume 49
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAQN
  databaseName: PubMed Central
  issn: 1672-7347
  databaseCode: RPM
  dateStart: 20210101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.ncbi.nlm.nih.gov/pmc/
  omitProxy: true
  ssIdentifier: ssj0002511111
  providerName: National Library of Medicine
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3da9RAEF_OFkpfxNKq9aOsoE8h0SS7-fCtSOUoeC-ecO3LsZvs6hWbHGcC4j_kv-nMflxCaUX7cLljk9uEnR8zv5nMzhDyGkhbxXKVh1WiechYVYdlxVkIxl0mWmfYOw2zLWbZ9As7X_DFZPJ7lLXUdzKqft26r-Q-UoUxkCvukv0PyW4nhQH4DfKFI0gYjv8k408mE1L51g9fw3qDysvmi4ttzU2bOr5pzc5b7GDd-fcyNiw4lCWZTU9nZ58DTBu1G0-wfADSUzDgxZjHXn7DHkUNKIdaBD97ZT5StFEwYrguchyYNn2jFJAouFi5PwwJQb3B06oZcGasomjN1mQ_uli568T1ahi9cEFv-Nb-YhfJSBiGZpOx8s1yYPuprcDptbMtaOpQWIxULdYYvd0GFKZy1JWZNNpOGuEtIyw_aLsqj_CxvjYAAWpWFNw24b1RhNufekB2kzzLklFYCE0--mmx6fW8vd0eeeMf5u3fHmWf7PnJ73ZyDNmZPyIPnQzpqYXcAZmo5pB8d3CjN-BGEW50gBttNfVwowZu1MHtPfVgw2ss2CiCjSLYKIKNdi1FsB2R-cez-Ydp6Jp1hGsOtlajHwquMMsqxmORyDxOFbjywAClBJqfwqLpGNsZyFi-04IpXlRMlbyMhazB53hMdpq2UU8JZbrWGmhqomoFbKkuwOXQacnTUlfgsOhj8sov0xJ0Ib7gEo1q-x_LFMw3uDtlwY_JE7t-y7Ut2rL0i_zszjPPyf6Ayhdkp9v06iUwzk6eGGn_AXsdeV0
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Machine+learning-driven+risk+assessment+of+coronary+heart+disease%3A+Analysis+of+NHANES+data+from+1999+to+2018&rft.jtitle=Zhong+nan+da+xue+xue+bao.+Journal+of+Central+South+University.+Yi+xue+ban&rft.au=Lu%2C+Jin&rft.au=Hu%2C+Haochang&rft.au=Xiu%2C+Jiaming&rft.au=Yang%2C+Yanfang&rft.date=2024-08-28&rft.issn=1672-7347&rft.volume=49&rft.issue=8&rft.spage=1175&rft_id=info:doi/10.11817%2Fj.issn.1672-7347.2024.240394&rft_id=info%3Apmid%2F39788507&rft.externalDocID=39788507
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1672-7347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1672-7347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1672-7347&client=summon