Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions

Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some st...

Full description

Saved in:
Bibliographic Details
Published inCurēus (Palo Alto, CA) Vol. 16; no. 3; p. e55991
Main Authors Abbas, Ali, Rehman, Mahad S, Rehman, Syed S
Format Journal Article
LanguageEnglish
Published United States Springer Nature B.V 11.03.2024
Cureus
Subjects
Online AccessGet full text
ISSN2168-8184
2168-8184
DOI10.7759/cureus.55991

Cover

Abstract Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions. The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA). A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5).  Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
AbstractList Introduction: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google’s Bard, and Anthropic’s Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions. Methods: The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA). Results: A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5).  Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.INTRODUCTIONLarge language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).METHODSThe questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.RESULTSA total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
Introduction: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google’s Bard, and Anthropic’s Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.Methods: The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).Results: A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions. The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA). A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5).  Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
Author Rehman, Syed S
Rehman, Mahad S
Abbas, Ali
AuthorAffiliation 2 Nephrology, Baptist Hospitals of Southeast Texas, Beaumont, USA
1 Medical School, University of Texas Southwestern Medical School, Dallas, USA
AuthorAffiliation_xml – name: 2 Nephrology, Baptist Hospitals of Southeast Texas, Beaumont, USA
– name: 1 Medical School, University of Texas Southwestern Medical School, Dallas, USA
Author_xml – sequence: 1
  givenname: Ali
  surname: Abbas
  fullname: Abbas, Ali
– sequence: 2
  givenname: Mahad S
  surname: Rehman
  fullname: Rehman, Mahad S
– sequence: 3
  givenname: Syed S
  surname: Rehman
  fullname: Rehman, Syed S
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38606229$$D View this record in MEDLINE/PubMed
BookMark eNptkc1v1DAQxS1UREvpjTOKxIUDW-w4ie0TglX5kLZQBJytiTPeukrs1E4q-O9xdktVKi72aPybpzd-T8mBDx4Jec7oqRC1emPmiHM6rWul2CNyVLJGriST1cG9-pCcpHRFKWVUlFTQJ-SQy4Y2ZamOyM06DCNE57fFdInFBUYb4gDeYBFscRHGuYdYbCBuMZ9-O0MuzkOHfSqC3818gckFD33xPkDslrFz7JzJjbNfMDiPMRXfYRh7LL7NmBY4PSOPLfQJT27vY_Lzw9mP9afV5uvHz-t3m5XhFZ9WHacIFkwjFbNVW7eS1kryzliGUramtoCmk8KKpi4FAG8bULkphaqbsjX8mLzd645zO2Bn0E8Rej1GN0D8rQM4_e-Ld5d6G240Y5SKSqis8OpWIYbrxb4eXDLY9-AxzElzymXFm5KLjL58gF6FOeafWahKCUkryTP14r6lOy9_M8nA6z1gYkgpor1DGNVL6nqfut6lnvHyAW7ctIskL-T6_w_9AQUbs-E
CitedBy_id crossref_primary_10_1038_s41598_024_79335_w
crossref_primary_10_1177_10711813241261689
crossref_primary_10_1002_ca_24244
crossref_primary_10_2196_67244
crossref_primary_10_7759_cureus_77292
crossref_primary_10_3389_frai_2024_1514896
crossref_primary_10_1007_s00066_024_02342_3
crossref_primary_10_5005_jp_journals_11002_0095
crossref_primary_10_1186_s12909_024_06048_z
Cites_doi 10.1038/s41746-020-00324-0
10.1097/00001888-199911000-00019
10.1038/s41586-020-2145-8
10.48550/arXiv.2307.06435
10.1007/s11596-021-2474-3
10.1016/j.amjmed.2020.03.033
10.2139/ssrn.4476855
10.1038/s41586-019-1799-6
10.1038/s41591-019-0447-x
10.1038/s41746-023-00873-0
10.1186/s12916-019-1426-2
10.2106/JBJS.OA.23.00056
10.1016/j.jsurg.2020.06.033
10.1038/s41586-023-06291-2
10.1038/s41746-019-0216-8
10.4174/astr.2023.104.5.269
10.52225/narra.v3i1.103
10.1016/j.wneu.2023.08.042
10.1371/journal.pdig.0000198
10.1111/1756-185X.14749
10.1093/cid/ciad633
10.1109/TMI.2019.2945514
10.1038/s41598-023-43436-9
10.1016/j.ebiom.2019.07.019
10.1016/j.clsr.2022.105735
ContentType Journal Article
Copyright Copyright © 2024, Abbas et al.
Copyright © 2024, Abbas et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright © 2024, Abbas et al. 2024 Abbas et al.
Copyright_xml – notice: Copyright © 2024, Abbas et al.
– notice: Copyright © 2024, Abbas et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
– notice: Copyright © 2024, Abbas et al. 2024 Abbas et al.
DBID AAYXX
CITATION
NPM
3V.
7X7
7XB
8FI
8FJ
8FK
ABUWG
AFKRA
AZQEC
BENPR
CCPQU
DWQXO
FYUFA
GHDGH
K9.
M0S
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQQKQ
PQUKI
PRINS
7X8
5PM
DOI 10.7759/cureus.55991
DatabaseName CrossRef
PubMed
ProQuest Central (Corporate)
Health & Medical Collection
ProQuest Central (purchase pre-March 2016)
Hospital Premium Collection
Hospital Premium Collection (Alumni Edition)
ProQuest Central (Alumni) (purchase pre-March 2016)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
ProQuest Central
ProQuest One Community College
ProQuest Central
Health Research Premium Collection (UHCL Subscription)
Health Research Premium Collection (Alumni)
ProQuest Health & Medical Complete (Alumni)
ProQuest Health & Medical Collection
ProQuest Central Premium
ProQuest One Academic
ProQuest Publicly Available Content
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
MEDLINE - Academic
PubMed Central (Full Participant titles)
DatabaseTitle CrossRef
PubMed
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Health & Medical Complete (Alumni)
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest Hospital Collection
Health Research Premium Collection (Alumni)
ProQuest Central China
ProQuest Hospital Collection (Alumni)
ProQuest Central
ProQuest Health & Medical Complete
Health Research Premium Collection
ProQuest One Academic UKI Edition
Health and Medicine Complete (Alumni Edition)
ProQuest Central Korea
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
MEDLINE - Academic
DatabaseTitleList
MEDLINE - Academic
Publicly Available Content Database
PubMed
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: 7X7
  name: Health & Medical Collection
  url: https://search.proquest.com/healthcomplete
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
EISSN 2168-8184
ExternalDocumentID PMC11007479
38606229
10_7759_cureus_55991
Genre Journal Article
GroupedDBID 53G
5VS
7X7
8FI
8FJ
AAYXX
ABUWG
ADBBV
AFKRA
ALIPV
ALMA_UNASSIGNED_HOLDINGS
AOIJS
BCNDV
BENPR
BPHCQ
BVXVI
CCPQU
CITATION
FYUFA
HMCUK
HYE
KQ8
M48
PGMZT
PHGZM
PHGZT
PIMPY
PQQKQ
PROAC
RPM
UKHRP
3V.
ADRAZ
GROUPED_DOAJ
NPM
OK1
7XB
8FK
AZQEC
DWQXO
K9.
PKEHL
PQEST
PQUKI
PRINS
7X8
PUEGO
5PM
ID FETCH-LOGICAL-c343t-d30eafac6891f4b5b805983dcf1e88bc5faecd87f76527aa3b6a95fa879562bc3
IEDL.DBID M48
ISSN 2168-8184
IngestDate Thu Aug 21 18:34:24 EDT 2025
Fri Sep 05 09:50:34 EDT 2025
Mon Jun 30 07:23:49 EDT 2025
Thu Jan 02 22:36:28 EST 2025
Thu Apr 24 23:03:02 EDT 2025
Tue Jul 01 02:11:47 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Keywords claude
gpt-4
artificial intelligence in medicine
artificial intelligence (ai)
google's bard
artificial intelligence and education
nbme subject exam
large language model
chatgpt
united states medical licensing examination (usmle)
Language English
License Copyright © 2024, Abbas et al.
This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c343t-d30eafac6891f4b5b805983dcf1e88bc5faecd87f76527aa3b6a95fa879562bc3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
OpenAccessLink http://journals.scholarsportal.info/openUrl.xqy?doi=10.7759/cureus.55991
PMID 38606229
PQID 3049780483
PQPubID 2045583
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_11007479
proquest_miscellaneous_3038436237
proquest_journals_3049780483
pubmed_primary_38606229
crossref_primary_10_7759_cureus_55991
crossref_citationtrail_10_7759_cureus_55991
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-03-11
PublicationDateYYYYMMDD 2024-03-11
PublicationDate_xml – month: 03
  year: 2024
  text: 2024-03-11
  day: 11
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: Palo Alto
– name: Palo Alto (CA)
PublicationTitle Curēus (Palo Alto, CA)
PublicationTitleAlternate Cureus
PublicationYear 2024
Publisher Springer Nature B.V
Cureus
Publisher_xml – name: Springer Nature B.V
– name: Cureus
References ref13
ref14
Kung JE (ref8) 2023; 8
Oh N (ref9) 2023; 104
McKinney SM (ref20) 2020; 577
Borji A (ref16) 2023
ref17
Kossoff EH (ref12) 1999; 74
Wu N (ref19) 2020; 39
Uz C (ref25) 2023; 26
Liu PR (ref1) 2021; 41
Ardila D (ref23) 2019; 25
Meskó B (ref26) 2023; 6
Kung TH (ref6) 2023; 2
Guerra GA (ref7) 2023; 179
Schwartz IS (ref27) 2023
Brin D (ref10) 2023; 13
Fosch-Villaronga E (ref24) 2022; 47
Ouyang D (ref22) 2020; 580
Naveed H (ref5) 2023
Ellahham S (ref3) 2020; 133
Sallam M (ref15) 2023; 3
Tracy BM (ref11) 2020; 77
Kelly CJ (ref28) 2019; 17
Garcia-Vidal C (ref2) 2019; 46
Singhal K (ref4) 2023; 620
Benjamens S (ref18) 2020; 3
Ghorbani A (ref21) 2020; 3
References_xml – volume: 3
  year: 2020
  ident: ref18
  article-title: The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database
  publication-title: NPJ Digit Med
  doi: 10.1038/s41746-020-00324-0
– volume: 74
  year: 1999
  ident: ref12
  article-title: Early clinical experience enhances third-year pediatrics clerkship performance
  publication-title: Acad Med
  doi: 10.1097/00001888-199911000-00019
– volume: 580
  year: 2020
  ident: ref22
  article-title: Video-based AI for beat-to-beat assessment of cardiac function
  publication-title: Nature
  doi: 10.1038/s41586-020-2145-8
– year: 2023
  ident: ref5
  article-title: A comprehensive overview of large language models
  publication-title: arXiv
  doi: 10.48550/arXiv.2307.06435
– volume: 41
  year: 2021
  ident: ref1
  article-title: Application of artificial intelligence in medicine: an overview
  publication-title: Curr Med Sci
  doi: 10.1007/s11596-021-2474-3
– volume: 133
  year: 2020
  ident: ref3
  article-title: Artificial intelligence: the future for diabetes care
  publication-title: Am J Med
  doi: 10.1016/j.amjmed.2020.03.033
– year: 2023
  ident: ref16
  article-title: Battle of the Wordsmiths: comparing ChatGPT, GPT-4, Claude, and Bard
  publication-title: Soc Sci Res Net
  doi: 10.2139/ssrn.4476855
– volume: 577
  year: 2020
  ident: ref20
  article-title: International evaluation of an AI system for breast cancer screening
  publication-title: Nature
  doi: 10.1038/s41586-019-1799-6
– volume: 25
  year: 2019
  ident: ref23
  article-title: End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography
  publication-title: Nat Med
  doi: 10.1038/s41591-019-0447-x
– volume: 6
  year: 2023
  ident: ref26
  article-title: The imperative for regulatory oversight of large language models (or generative AI) in healthcare
  publication-title: NPJ Digit Med
  doi: 10.1038/s41746-023-00873-0
– volume: 17
  year: 2019
  ident: ref28
  article-title: Key challenges for delivering clinical impact with artificial intelligence
  publication-title: BMC Med
  doi: 10.1186/s12916-019-1426-2
– volume: 8
  year: 2023
  ident: ref8
  article-title: Evaluating ChatGPT performance on the orthopaedic in-training examination
  publication-title: JB JS Open Access
  doi: 10.2106/JBJS.OA.23.00056
– volume: 77
  year: 2020
  ident: ref11
  article-title: Sustained clinical performance during surgical rotations predicts NBME shelf exam outcomes
  publication-title: J Surg Educ
  doi: 10.1016/j.jsurg.2020.06.033
– volume: 620
  year: 2023
  ident: ref4
  article-title: Large language models encode clinical knowledge
  publication-title: Nature
  doi: 10.1038/s41586-023-06291-2
– volume: 3
  year: 2020
  ident: ref21
  article-title: Deep learning interpretation of echocardiograms
  publication-title: NPJ Digit Med
  doi: 10.1038/s41746-019-0216-8
– volume: 104
  year: 2023
  ident: ref9
  article-title: ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models
  publication-title: Ann Surg Treat Res
  doi: 10.4174/astr.2023.104.5.269
– volume: 3
  year: 2023
  ident: ref15
  article-title: ChatGPT applications ‎in medical, dental, pharmacy, and public health education: a descriptive study ‎highlighting the advantages and limitations
  publication-title: Narra J
  doi: 10.52225/narra.v3i1.103
– volume: 179
  year: 2023
  ident: ref7
  article-title: GPT-4 artificial intelligence model outperforms ChatGPT, medical students, and neurosurgery residents on neurosurgery written board-like questions
  publication-title: World Neurosurg
  doi: 10.1016/j.wneu.2023.08.042
– volume: 2
  year: 2023
  ident: ref6
  article-title: Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models
  publication-title: PLOS Digit Health
  doi: 10.1371/journal.pdig.0000198
– ident: ref17
– volume: 26
  year: 2023
  ident: ref25
  article-title: "Dr ChatGPT": Is it a reliable and useful source for common rheumatic diseases?
  publication-title: Int J Rheum Dis
  doi: 10.1111/1756-185X.14749
– ident: ref13
– year: 2023
  ident: ref27
  article-title: Black box warning: large language models and the future of infectious diseases consultation
  publication-title: Clin Infect Dis
  doi: 10.1093/cid/ciad633
– volume: 39
  year: 2020
  ident: ref19
  article-title: Deep neural networks improve radiologists' performance in breast cancer screening
  publication-title: IEEE Trans Med Imaging
  doi: 10.1109/TMI.2019.2945514
– volume: 13
  year: 2023
  ident: ref10
  article-title: Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments
  publication-title: Sci Rep
  doi: 10.1038/s41598-023-43436-9
– volume: 46
  year: 2019
  ident: ref2
  article-title: Artificial intelligence to support clinical decision-making processes
  publication-title: EBioMedicine
  doi: 10.1016/j.ebiom.2019.07.019
– volume: 47
  year: 2022
  ident: ref24
  article-title: Accounting for diversity in AI for medicine
  publication-title: Comput Law Secur Rev
  doi: 10.1016/j.clsr.2022.105735
– ident: ref14
SSID ssj0001072070
Score 2.3851476
Snippet Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5,...
Introduction: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's...
SourceID pubmedcentral
proquest
pubmed
crossref
SourceType Open Access Repository
Aggregation Database
Index Database
Enrichment Source
StartPage e55991
SubjectTerms Ambulatory care
Artificial intelligence
Cancer
Gynecology
Healthcare Technology
Language
Large language models
Medical Education
Medical examiners
Medical screening
Medical students
Medicine
Multiple choice
Neurology
Obstetrics
Other
Pediatrics
Performance evaluation
Psychiatry
Surgery
Variance analysis
SummonAdditionalLinks – databaseName: Health & Medical Collection
  dbid: 7X7
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3fS9xAEB6sgvRFWrU1VmUL-iTRJJtkN0-lHsohVQQV7i3MbjYqnIle7qR_fneSvbueoi95SGZJyMzuzq_9PoB9JUMUWnBfCSz82GDko0LiN4lsDJQh16pF-7xM-7fx-SAZuIRb49oqp2tiu1AXtaYc-TGVgwgsR_JfT88-sUZRddVRaHyCldB6IkTdIAZinmMJRGRNuut3FyLJjvVkZCbNEcFshYs70Rv38nWX5H_bztkXWHP-IvvdKfgrLJlqHVYvXEV8A156HZFgdcesK8eu5ucAWF2yq5aea8T-UL-3vXa5SUYEaMOG1VU7xkFjD9lJbc2FhrnqDTv9i490OLBh10gowqxNj5KhbsLt2elNr-87LgVf85iP_YIHBkvUqczCMlaJktavkrzQZWikVDop0ehCilKkSSQQuUoxszeJizyNlObfYLmqK7MFLFJpUEiTCG2dLRNwFeg04AmWCdroLjMeHE7_a64d0DjxXQxzG3CQFvJOC3mrBQ8OZtJPHcDGO3I7UxXlbpo1-dwoPPg5e2wnCFU9sDL1hGS4jO02zYUH3zuNzl7EZUpGmXkgF3Q9EyDw7cUn1cN9C8JNUHs2FMu2P_6uH_A5sm4Qda2F4Q4sj0cTs2vdmLHaa231H85Y968
  priority: 102
  providerName: ProQuest
Title Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions
URI https://www.ncbi.nlm.nih.gov/pubmed/38606229
https://www.proquest.com/docview/3049780483
https://www.proquest.com/docview/3038436237
https://pubmed.ncbi.nlm.nih.gov/PMC11007479
Volume 16
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Nb9QwEB2VVkJcKspnaFkZCU4oSxInsXNCtNqqQlCtgJX2Fo0dpyAtSdmPqvx7ZpzslqVw4ZJDPFakzFh-4xm_B_DS6BiVVTI0CqswdZiEaJD1TRLKgQqU1ni2z_P8bJK-n2bTHVirjfY_cPHX1I71pCbz2fD6x8-3tOAJvw6Vyoo3djV3q8WQubMoD9rzlSJu4uuBvj9tiVRCwd11vt-atL0n3QKaf_ZL_rYBnd6H_R45inedqw9gxzUP4O7Hvjb-EK5OOknB5kIQqBPjmxsBoq3F2At1zcUH7vymZ3dKKVgKbbYQbePn9CTZM3HcUuDwtL6OI0bX-J2vCS7EZ2Q-YeEPSjlkH8HkdPTl5CzsVRVCK1O5DCsZOazR5rqI69RkRhPC0rKydey0Njar0dlKq1rlWaIQpcmxoJesSp4nxsrHsNu0jXsKIjF5VGmXKUuwy0XSRDaPZIZ1hpTnFS6A1-v_WtqecpyVL2YlpR7shbLzQum9EMCrjfVlR7XxD7ujtYvKdbyUXC1kLiUtA3ixGaalwvUPbFy7YhupU9qwpQrgSefRzYekzjk8iwD0lq83BkzDvT3SfPvq6biZdI-SsuLZ_089hHsJgSXubYvjI9hdzlfuOYGdpRnAHTVVA9g7Hp2PPw18VP8C79oJQg
linkProvider Scholars Portal
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Nb9MwFH8anQRc0PjOGGAkdkJhiZ3EzmFCbHTqWFdVsEm7ZbbjAFKXjKYF9s_xt-GXOC0FwW2XHBI7ifz87Pfl3w_gpRKh5JozX3GZ-5GR1JdKIr8JtT5QKplWDdrnKBmcRu_P4rM1-NmdhcGyym5NbBbqvNIYI9_BdBCC5Qj25vKrj6xRmF3tKDSko1bIdxuIMXew48hcfbcuXL17-M7Ke5vSg_7J_sB3LAO-ZhGb-TkLjCykTkQaFpGKlbAWh2C5LkIjhNJxIY3OBS94ElMuJVOJTO1NZOlOqNLMvvcGrEcYQOnB-l5_NP6wjPIEnFqlaivuOY_THT2fmnn9GoG-wtW98C8D9886zd82voMNuOMsVvK2nWJ3Yc2U9-DmscvJ34dv-y2VYfmJWGOSjJcnEUhVkHFDEDYlQ6w4t9c2OkqQgm1Sk6ps-jhw7gnZq-yExW4uf0T6P-QFHk-syUeJOMakCdCiqjyA02sZ54fQK6vSPAZCVRLkwsRcW3PPBEwFOglYLItYWv8yNR686sY10w7qHBk3Jpl1eVAKWSuFrJGCB9uL1pctxMc_2m11IsqcotfZclp68GLx2Koo5l1kaao5tmEisoYC4x48aiW6-BATCapF6oFYkfWiAcJ_rz4pv3xuYMAR7M86g-nm___rOdwanBwPs-Hh6OgJ3KbWKMMaujDcgt5sOjdPrVE1U8_czCVwft3K8gt1aj0S
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB6VVqq4IN4EChiJnlDYJE5i51Ah2u6qpWW1Air1FmzHAaQlKZtdHn-RX8VM4uyyILj1kkNiJ5HHY8_L3wfwVMtQCSO4r4Uq_NiqyFdaEb9JhD5QprjRLdrnOD06i1-dJ-cb8LM_C0Nllf2a2C7URW0oRj6gdBCB5Ug-KF1ZxORw9OLii08MUpRp7ek0lKNZKPZauDF3yOPE_viG7lyzd3yIst-NotHw3cGR7xgHfMNjPvcLHlhVKpPKLCxjnWiJ1ofkhSlDK6U2SamsKaQoRZpEQimuU5XhTWLsTiNtOL73CmwJ3PXREdzaH44nb1YRn0BEqGBd9b0QSTYwi5ldNM8J9Ctc3xf_Mnb_rNn8bRMcXYdrznplL7vpdgM2bHUTtl-7_Pwt-HrQ0RpWHxgalmyyOpXA6pJNWrKwGTul6nO8dpFSRnRs04bVVdvHAXVP2X6Nk5e6uVwSG35Xn-moYsPeKsI0Zm2wltTmNpxdyjjfgc2qruw9YJFOg0LaRBg0_WzAdWDSgCeqTBT6mpn14Fk_rrlxsOfEvjHN0f0hKeSdFPJWCh7sLltfdHAf_2i304sod0rf5Ksp6sGT5WNUV8rBqMrWC2rDZYxGAxce3O0kuvwQlympSOaBXJP1sgFBga8_qT59bCHBCfgPHcPs_v__6zFso9Lkp8fjkwdwNUL7jMrpwnAHNuezhX2I9tVcP3ITl8H7y9aVX-WEQVY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparing+the+Performance+of+Popular+Large+Language+Models+on+the+National+Board+of+Medical+Examiners+Sample+Questions&rft.jtitle=Cur%C4%93us+%28Palo+Alto%2C+CA%29&rft.au=Abbas%2C+Ali&rft.au=Rehman%2C+Mahad+S&rft.au=Rehman%2C+Syed+S&rft.date=2024-03-11&rft.pub=Cureus&rft.eissn=2168-8184&rft.volume=16&rft.issue=3&rft_id=info:doi/10.7759%2Fcureus.55991&rft.externalDocID=PMC11007479
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2168-8184&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2168-8184&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2168-8184&client=summon