Machine learning-driven risk assessment of coronary heart disease: Analysis of NHANES data from 1999 to 2018

The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine le...

Full description

Saved in:

Bibliographic Details
Published in	Zhong nan da xue xue bao. Journal of Central South University. Yi xue ban Vol. 49; no. 8; p. 1175
Main Authors	Lu, Jin, Hu, Haochang, Xiu, Jiaming, Yang, Yanfang, Zhu, Qifeng, Dai, Hanyi, Liu, Xianbao, Wang, Jian'an
Format	Journal Article
Language	Chinese English
Published	China 28.08.2024
Subjects	Aged Algorithms Coronary Artery Disease - diagnosis Coronary Artery Disease - epidemiology Coronary Artery Disease - etiology Coronary Disease - epidemiology Coronary Disease - etiology Female Humans Machine Learning Male Middle Aged Nutrition Surveys Risk Assessment - methods Risk Factors United States - epidemiology United States risk assessment risk factors National Health and Nutrition Examination Survey machine learning coronary artery heart disease
Online Access	Get full text
ISSN	1672-7347
DOI	10.11817/j.issn.1672-7347.2024.240394

Cover

Abstract	The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD. A total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model. The mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, <0.001], had a higher proportion of females (67.1% vs 47.4%, <0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all <0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all <0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making. This study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings.
AbstractList	The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD.OBJECTIVESThe high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD.A total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model.METHODSA total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model.The mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, P<0.001], had a higher proportion of females (67.1% vs 47.4%, P<0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all P<0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all P<0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making.RESULTSThe mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, P<0.001], had a higher proportion of females (67.1% vs 47.4%, P<0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all P<0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all P<0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making.This study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings.CONCLUSIONSThis study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings. The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and early diagnosis of CHD have become key strategies to alleviate this burden. This study aims to explore the application of advanced machine learning techniques to enhance the accuracy of early screening and risk assessment for CHD. A total of 49 490 study subjects from the National Health and Nutrition Examination Survey (NHANES) database spanning from 1999 to 2018 were included. The dataset was randomly divided into training (70%) and testing (30%) sets. The dependent variable (outcome variable) was whether the subjects were informed of a CHD diagnosis, categorizing them into a CHD group and a non-CHD group. We reviewed the literature on risk factors associated with CHD, ultimately including 68 independent variables. The variable characteristics of the study subjects were analyzed, comparing differences between the CHD and non-CHD groups. Machine learning algorithms, specifically random forest (randomForest_4.7-1.1) and XGBoost (xgboost_1.7.7.1) were utilized for variable selection. A comprehensive analysis of the top 10 variables identified by these 2 algorithms were conducted, selecting those mutually recognized by both. A generalized linear model was used to analyze the relationships between variables and CHD, and classical logistic regression was used to construct the CHD risk prediction model. The model's ability to distinguish between CHD and non-CHD individuals was assessed using the area under the receiver operating characteristic curve (AUC); calibration measurements were conducted with the Hosmer-Lemeshow goodness-of-fit test to evaluate the consistency between predicted values and actual CHD proportions; and decision curve analysis was applied to evaluate the clinical benefits of the model's risk prediction. Finally, a nomogram was constructed to visually present the risk scoring of the final model. The mean age of the overall population was (49.53±18.31) years, with males comprising 51.8%. Compared to the non-CHD group, the CHD group was older [(69.05± 11.32) years vs (48.67±18.07) years, <0.001], had a higher proportion of females (67.1% vs 47.4%, <0.001), and exhibited statistically significant differences in classical cardiovascular risk factors such as body mass index, systolic blood pressure, diastolic blood pressure, and smoking (all <0.001). Additionally, there were statistically significant differences in non-classical cardiovascular factors, such as energy intake, vitamins E, vitamin K, calcium, phosphorus, magnesium, zinc, copper, sodium, potassium, and selenium (all <0.05). Six key variables most associated with CHD occurrence were ultimately identified. The CHD risk prediction model constructed was as follows: logit(p)= -7.783+0.074×age+0.003×creatinine-0.003×platelets+0.257×glycated hemoglobin+0.003× uric acid+0.101×coefficient of variation of red cell distribution width. The model demonstrated excellent discriminative ability in predicting CHD, with an accuracy of 0.712 and an AUC of 0.841. Calibration curves indicated good consistency between predicted probabilities and actual values in both the training and testing sets, demonstrating model stability and reliability. Decision curve analysis suggested that the model provided net benefits across a range of threshold probabilities, supporting its potential application in clinical decision-making. This study successfully identified potential risk factors for CHD using machine learning techniques and developed a concise and practical clinical prediction model. Further prospective clinical cohort studies are needed to validate its potential for clinical application, enabling effective cardiovascular disease prevention and intervention strategies in real-world healthcare settings.
Author	Yang, Yanfang Xiu, Jiaming Zhu, Qifeng Wang, Jian'an Hu, Haochang Dai, Hanyi Liu, Xianbao Lu, Jin
Author_xml	– sequence: 1 givenname: Jin surname: Lu fullname: Lu, Jin email: 12318327@zju.edu.cn, 12318327@zju.edu.cn organization: State Key Laboratory of Transvascular Implantation Devices, Hangzhou 310009. 12318327@zju.edu.cn – sequence: 2 givenname: Haochang surname: Hu fullname: Hu, Haochang organization: State Key Laboratory of Transvascular Implantation Devices, Hangzhou 310009 – sequence: 3 givenname: Jiaming surname: Xiu fullname: Xiu, Jiaming organization: Department of Cardiology, Longyan First Affiliated Hospital of Fujian Medical University, Longyan Fujian 364000 – sequence: 4 givenname: Yanfang surname: Yang fullname: Yang, Yanfang organization: Department of Cardiology, Provincial Clinical Medical College of Fujian Medical University, Fujian Provincial Hospital, Fuzhou 350001 – sequence: 5 givenname: Qifeng surname: Zhu fullname: Zhu, Qifeng organization: Binjiang Institute of Zhejiang University, Hangzhou 310053, China – sequence: 6 givenname: Hanyi surname: Dai fullname: Dai, Hanyi organization: State Key Laboratory of Transvascular Implantation Devices, Hangzhou 310009 – sequence: 7 givenname: Xianbao surname: Liu fullname: Liu, Xianbao organization: Binjiang Institute of Zhejiang University, Hangzhou 310053, China – sequence: 8 givenname: Jian'an surname: Wang fullname: Wang, Jian'an email: wangjianan111@zju.edu.cn, wangjianan111@zju.edu.cn, wangjianan111@zju.edu.cn, wangjianan111@zju.edu.cn organization: Binjiang Institute of Zhejiang University, Hangzhou 310053, China. wangjianan111@zju.edu.cn
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/39788507$$D View this record in MEDLINE/PubMed
BookMark	eNo9kEtPwkAcxPeAEUS-gtmLiZfWfT-8EYKPBPEgd7Jtt7LY7mL_xYRvL0R0LpNMfplk5goNYooeoVtKckoN1ffbPADEnCrNMs2FzhlhImeCcCsGaPSfD9EEYEsIYZKedImG3GpjJNEj1Ly6chOix413XQzxI6u68O0j7gJ8YgfgAVofe5xqXKYuRdcd8ObI9rgK4B34BzyNrjlAgBOzfJ4u5--4cr3DdZdaTK21uE-YEWqu0UXtGvCTs4_R6nG-mj1ni7enl9l0ke2k0lmtKCdKKaFKIaljhabcW2uYYEVBDOdKsZoKbUVBC1I74aUphbfSUldUkvMxuvut3XXpa--hX7cBSt80Lvq0hzWnkluqrJFH9OaM7ovWV-tdF9rjwvXfQfwHnw9pPQ
ContentType	Journal Article
DBID	CGR CUY CVF ECM EIF NPM 7X8
DOI	10.11817/j.issn.1672-7347.2024.240394
DatabaseName	Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic
DatabaseTitle	MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic MEDLINE
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
DocumentTitleAlternate	机器学习驱动的冠心病风险评估：1999至2018年NHANES数据分析
ExternalDocumentID	39788507
Genre	Journal Article
GeographicLocations	United States
GeographicLocations_xml	– name: United States
GroupedDBID	ALMA_UNASSIGNED_HOLDINGS CGR CUY CVF ECM EIF NPM RPM 7X8
ID	FETCH-LOGICAL-p567-f613066646c451a2b713e998242bb0833662f14794b1b0fa4e58c4e9591abd533
ISSN	1672-7347
IngestDate	Fri Jul 11 05:55:51 EDT 2025 Mon Jan 13 02:22:04 EST 2025
IsPeerReviewed	false
IsScholarly	true
Issue	8
Keywords	risk assessment risk factors National Health and Nutrition Examination Survey machine learning coronary artery heart disease
Language	Chinese English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-p567-f613066646c451a2b713e998242bb0833662f14794b1b0fa4e58c4e9591abd533
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
PMID	39788507
PQID	3153916985
PQPubID	23479
ParticipantIDs	proquest_miscellaneous_3153916985 pubmed_primary_39788507
PublicationCentury	2000
PublicationDate	2024-Aug-28 20240828
PublicationDateYYYYMMDD	2024-08-28
PublicationDate_xml	– month: 08 year: 2024 text: 2024-Aug-28 day: 28
PublicationDecade	2020
PublicationPlace	China
PublicationPlace_xml	– name: China
PublicationTitle	Zhong nan da xue xue bao. Journal of Central South University. Yi xue ban
PublicationTitleAlternate	Zhong Nan Da Xue Xue Bao Yi Xue Ban
PublicationYear	2024
SSID	ssj0002511111
Score	2.300478
Snippet	The high incidence of coronary artery heart disease (CHD) poses a significant burden and challenge to public health systems globally. Effective prevention and...
SourceID	proquest pubmed
SourceType	Aggregation Database Index Database
StartPage	1175
SubjectTerms	Aged Algorithms Coronary Artery Disease - diagnosis Coronary Artery Disease - epidemiology Coronary Artery Disease - etiology Coronary Disease - epidemiology Coronary Disease - etiology Female Humans Machine Learning Male Middle Aged Nutrition Surveys Risk Assessment - methods Risk Factors United States - epidemiology
Title	Machine learning-driven risk assessment of coronary heart disease: Analysis of NHANES data from 1999 to 2018
URI	https://www.ncbi.nlm.nih.gov/pubmed/39788507 https://www.proquest.com/docview/3153916985
Volume	49
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAQN databaseName: PubMed Central issn: 1672-7347 databaseCode: RPM dateStart: 20210101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.ncbi.nlm.nih.gov/pmc/ omitProxy: true ssIdentifier: ssj0002511111 providerName: National Library of Medicine
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3da9RAEF_OFkpfxNKq9aOsoE8h0SS7-fCtSOUoeC-ecO3LsZvs6hWbHGcC4j_kv-nMflxCaUX7cLljk9uEnR8zv5nMzhDyGkhbxXKVh1WiechYVYdlxVkIxl0mWmfYOw2zLWbZ9As7X_DFZPJ7lLXUdzKqft26r-Q-UoUxkCvukv0PyW4nhQH4DfKFI0gYjv8k408mE1L51g9fw3qDysvmi4ttzU2bOr5pzc5b7GDd-fcyNiw4lCWZTU9nZ58DTBu1G0-wfADSUzDgxZjHXn7DHkUNKIdaBD97ZT5StFEwYrguchyYNn2jFJAouFi5PwwJQb3B06oZcGasomjN1mQ_uli568T1ahi9cEFv-Nb-YhfJSBiGZpOx8s1yYPuprcDptbMtaOpQWIxULdYYvd0GFKZy1JWZNNpOGuEtIyw_aLsqj_CxvjYAAWpWFNw24b1RhNufekB2kzzLklFYCE0--mmx6fW8vd0eeeMf5u3fHmWf7PnJ73ZyDNmZPyIPnQzpqYXcAZmo5pB8d3CjN-BGEW50gBttNfVwowZu1MHtPfVgw2ss2CiCjSLYKIKNdi1FsB2R-cez-Ydp6Jp1hGsOtlajHwquMMsqxmORyDxOFbjywAClBJqfwqLpGNsZyFi-04IpXlRMlbyMhazB53hMdpq2UU8JZbrWGmhqomoFbKkuwOXQacnTUlfgsOhj8sov0xJ0Ib7gEo1q-x_LFMw3uDtlwY_JE7t-y7Ut2rL0i_zszjPPyf6Ayhdkp9v06iUwzk6eGGn_AXsdeV0
linkProvider	National Library of Medicine
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Machine+learning-driven+risk+assessment+of+coronary+heart+disease%3A+Analysis+of+NHANES+data+from+1999+to+2018&rft.jtitle=Zhong+nan+da+xue+xue+bao.+Journal+of+Central+South+University.+Yi+xue+ban&rft.au=Lu%2C+Jin&rft.au=Hu%2C+Haochang&rft.au=Xiu%2C+Jiaming&rft.au=Yang%2C+Yanfang&rft.date=2024-08-28&rft.issn=1672-7347&rft.volume=49&rft.issue=8&rft.spage=1175&rft_id=info:doi/10.11817%2Fj.issn.1672-7347.2024.240394&rft_id=info%3Apmid%2F39788507&rft.externalDocID=39788507
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1672-7347&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1672-7347&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1672-7347&client=summon