Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions
Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some st...
Saved in:
Published in | Curēus (Palo Alto, CA) Vol. 16; no. 3; p. e55991 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
United States
Springer Nature B.V
11.03.2024
Cureus |
Subjects | |
Online Access | Get full text |
ISSN | 2168-8184 2168-8184 |
DOI | 10.7759/cureus.55991 |
Cover
Abstract | Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.
The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).
A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine. |
---|---|
AbstractList | Introduction: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google’s Bard, and Anthropic’s Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.
Methods: The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).
Results: A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5).
Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine. Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.INTRODUCTIONLarge language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).METHODSThe questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.RESULTSA total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine. Introduction: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google’s Bard, and Anthropic’s Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.Methods: The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).Results: A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine. Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions. The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA). A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine. |
Author | Rehman, Syed S Rehman, Mahad S Abbas, Ali |
AuthorAffiliation | 2 Nephrology, Baptist Hospitals of Southeast Texas, Beaumont, USA 1 Medical School, University of Texas Southwestern Medical School, Dallas, USA |
AuthorAffiliation_xml | – name: 2 Nephrology, Baptist Hospitals of Southeast Texas, Beaumont, USA – name: 1 Medical School, University of Texas Southwestern Medical School, Dallas, USA |
Author_xml | – sequence: 1 givenname: Ali surname: Abbas fullname: Abbas, Ali – sequence: 2 givenname: Mahad S surname: Rehman fullname: Rehman, Mahad S – sequence: 3 givenname: Syed S surname: Rehman fullname: Rehman, Syed S |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/38606229$$D View this record in MEDLINE/PubMed |
BookMark | eNptkc1v1DAQxS1UREvpjTOKxIUDW-w4ie0TglX5kLZQBJytiTPeukrs1E4q-O9xdktVKi72aPybpzd-T8mBDx4Jec7oqRC1emPmiHM6rWul2CNyVLJGriST1cG9-pCcpHRFKWVUlFTQJ-SQy4Y2ZamOyM06DCNE57fFdInFBUYb4gDeYBFscRHGuYdYbCBuMZ9-O0MuzkOHfSqC3818gckFD33xPkDslrFz7JzJjbNfMDiPMRXfYRh7LL7NmBY4PSOPLfQJT27vY_Lzw9mP9afV5uvHz-t3m5XhFZ9WHacIFkwjFbNVW7eS1kryzliGUramtoCmk8KKpi4FAG8bULkphaqbsjX8mLzd645zO2Bn0E8Rej1GN0D8rQM4_e-Ld5d6G240Y5SKSqis8OpWIYbrxb4eXDLY9-AxzElzymXFm5KLjL58gF6FOeafWahKCUkryTP14r6lOy9_M8nA6z1gYkgpor1DGNVL6nqfut6lnvHyAW7ctIskL-T6_w_9AQUbs-E |
CitedBy_id | crossref_primary_10_1038_s41598_024_79335_w crossref_primary_10_1177_10711813241261689 crossref_primary_10_1002_ca_24244 crossref_primary_10_2196_67244 crossref_primary_10_7759_cureus_77292 crossref_primary_10_3389_frai_2024_1514896 crossref_primary_10_1007_s00066_024_02342_3 crossref_primary_10_5005_jp_journals_11002_0095 crossref_primary_10_1186_s12909_024_06048_z |
Cites_doi | 10.1038/s41746-020-00324-0 10.1097/00001888-199911000-00019 10.1038/s41586-020-2145-8 10.48550/arXiv.2307.06435 10.1007/s11596-021-2474-3 10.1016/j.amjmed.2020.03.033 10.2139/ssrn.4476855 10.1038/s41586-019-1799-6 10.1038/s41591-019-0447-x 10.1038/s41746-023-00873-0 10.1186/s12916-019-1426-2 10.2106/JBJS.OA.23.00056 10.1016/j.jsurg.2020.06.033 10.1038/s41586-023-06291-2 10.1038/s41746-019-0216-8 10.4174/astr.2023.104.5.269 10.52225/narra.v3i1.103 10.1016/j.wneu.2023.08.042 10.1371/journal.pdig.0000198 10.1111/1756-185X.14749 10.1093/cid/ciad633 10.1109/TMI.2019.2945514 10.1038/s41598-023-43436-9 10.1016/j.ebiom.2019.07.019 10.1016/j.clsr.2022.105735 |
ContentType | Journal Article |
Copyright | Copyright © 2024, Abbas et al. Copyright © 2024, Abbas et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. Copyright © 2024, Abbas et al. 2024 Abbas et al. |
Copyright_xml | – notice: Copyright © 2024, Abbas et al. – notice: Copyright © 2024, Abbas et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. – notice: Copyright © 2024, Abbas et al. 2024 Abbas et al. |
DBID | AAYXX CITATION NPM 3V. 7X7 7XB 8FI 8FJ 8FK ABUWG AFKRA AZQEC BENPR CCPQU DWQXO FYUFA GHDGH K9. M0S PHGZM PHGZT PIMPY PKEHL PQEST PQQKQ PQUKI PRINS 7X8 5PM |
DOI | 10.7759/cureus.55991 |
DatabaseName | CrossRef PubMed ProQuest Central (Corporate) Health & Medical Collection ProQuest Central (purchase pre-March 2016) Hospital Premium Collection Hospital Premium Collection (Alumni Edition) ProQuest Central (Alumni) (purchase pre-March 2016) ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials ProQuest Central ProQuest One Community College ProQuest Central Health Research Premium Collection (UHCL Subscription) Health Research Premium Collection (Alumni) ProQuest Health & Medical Complete (Alumni) ProQuest Health & Medical Collection ProQuest Central Premium ProQuest One Academic ProQuest Publicly Available Content ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China MEDLINE - Academic PubMed Central (Full Participant titles) |
DatabaseTitle | CrossRef PubMed Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Health & Medical Complete (Alumni) ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest Hospital Collection Health Research Premium Collection (Alumni) ProQuest Central China ProQuest Hospital Collection (Alumni) ProQuest Central ProQuest Health & Medical Complete Health Research Premium Collection ProQuest One Academic UKI Edition Health and Medicine Complete (Alumni Edition) ProQuest Central Korea ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) MEDLINE - Academic |
DatabaseTitleList | MEDLINE - Academic Publicly Available Content Database PubMed |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: 7X7 name: Health & Medical Collection url: https://search.proquest.com/healthcomplete sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine |
EISSN | 2168-8184 |
ExternalDocumentID | PMC11007479 38606229 10_7759_cureus_55991 |
Genre | Journal Article |
GroupedDBID | 53G 5VS 7X7 8FI 8FJ AAYXX ABUWG ADBBV AFKRA ALIPV ALMA_UNASSIGNED_HOLDINGS AOIJS BCNDV BENPR BPHCQ BVXVI CCPQU CITATION FYUFA HMCUK HYE KQ8 M48 PGMZT PHGZM PHGZT PIMPY PQQKQ PROAC RPM UKHRP 3V. ADRAZ GROUPED_DOAJ NPM OK1 7XB 8FK AZQEC DWQXO K9. PKEHL PQEST PQUKI PRINS 7X8 PUEGO 5PM |
ID | FETCH-LOGICAL-c343t-d30eafac6891f4b5b805983dcf1e88bc5faecd87f76527aa3b6a95fa879562bc3 |
IEDL.DBID | M48 |
ISSN | 2168-8184 |
IngestDate | Thu Aug 21 18:34:24 EDT 2025 Fri Sep 05 09:50:34 EDT 2025 Mon Jun 30 07:23:49 EDT 2025 Thu Jan 02 22:36:28 EST 2025 Thu Apr 24 23:03:02 EDT 2025 Tue Jul 01 02:11:47 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Keywords | claude gpt-4 artificial intelligence in medicine artificial intelligence (ai) google's bard artificial intelligence and education nbme subject exam large language model chatgpt united states medical licensing examination (usmle) |
Language | English |
License | Copyright © 2024, Abbas et al. This is an open access article distributed under the terms of the Creative Commons Attribution License CC-BY 4.0., which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c343t-d30eafac6891f4b5b805983dcf1e88bc5faecd87f76527aa3b6a95fa879562bc3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
OpenAccessLink | http://journals.scholarsportal.info/openUrl.xqy?doi=10.7759/cureus.55991 |
PMID | 38606229 |
PQID | 3049780483 |
PQPubID | 2045583 |
ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_11007479 proquest_miscellaneous_3038436237 proquest_journals_3049780483 pubmed_primary_38606229 crossref_primary_10_7759_cureus_55991 crossref_citationtrail_10_7759_cureus_55991 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2024-03-11 |
PublicationDateYYYYMMDD | 2024-03-11 |
PublicationDate_xml | – month: 03 year: 2024 text: 2024-03-11 day: 11 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States – name: Palo Alto – name: Palo Alto (CA) |
PublicationTitle | Curēus (Palo Alto, CA) |
PublicationTitleAlternate | Cureus |
PublicationYear | 2024 |
Publisher | Springer Nature B.V Cureus |
Publisher_xml | – name: Springer Nature B.V – name: Cureus |
References | ref13 ref14 Kung JE (ref8) 2023; 8 Oh N (ref9) 2023; 104 McKinney SM (ref20) 2020; 577 Borji A (ref16) 2023 ref17 Kossoff EH (ref12) 1999; 74 Wu N (ref19) 2020; 39 Uz C (ref25) 2023; 26 Liu PR (ref1) 2021; 41 Ardila D (ref23) 2019; 25 Meskó B (ref26) 2023; 6 Kung TH (ref6) 2023; 2 Guerra GA (ref7) 2023; 179 Schwartz IS (ref27) 2023 Brin D (ref10) 2023; 13 Fosch-Villaronga E (ref24) 2022; 47 Ouyang D (ref22) 2020; 580 Naveed H (ref5) 2023 Ellahham S (ref3) 2020; 133 Sallam M (ref15) 2023; 3 Tracy BM (ref11) 2020; 77 Kelly CJ (ref28) 2019; 17 Garcia-Vidal C (ref2) 2019; 46 Singhal K (ref4) 2023; 620 Benjamens S (ref18) 2020; 3 Ghorbani A (ref21) 2020; 3 |
References_xml | – volume: 3 year: 2020 ident: ref18 article-title: The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database publication-title: NPJ Digit Med doi: 10.1038/s41746-020-00324-0 – volume: 74 year: 1999 ident: ref12 article-title: Early clinical experience enhances third-year pediatrics clerkship performance publication-title: Acad Med doi: 10.1097/00001888-199911000-00019 – volume: 580 year: 2020 ident: ref22 article-title: Video-based AI for beat-to-beat assessment of cardiac function publication-title: Nature doi: 10.1038/s41586-020-2145-8 – year: 2023 ident: ref5 article-title: A comprehensive overview of large language models publication-title: arXiv doi: 10.48550/arXiv.2307.06435 – volume: 41 year: 2021 ident: ref1 article-title: Application of artificial intelligence in medicine: an overview publication-title: Curr Med Sci doi: 10.1007/s11596-021-2474-3 – volume: 133 year: 2020 ident: ref3 article-title: Artificial intelligence: the future for diabetes care publication-title: Am J Med doi: 10.1016/j.amjmed.2020.03.033 – year: 2023 ident: ref16 article-title: Battle of the Wordsmiths: comparing ChatGPT, GPT-4, Claude, and Bard publication-title: Soc Sci Res Net doi: 10.2139/ssrn.4476855 – volume: 577 year: 2020 ident: ref20 article-title: International evaluation of an AI system for breast cancer screening publication-title: Nature doi: 10.1038/s41586-019-1799-6 – volume: 25 year: 2019 ident: ref23 article-title: End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography publication-title: Nat Med doi: 10.1038/s41591-019-0447-x – volume: 6 year: 2023 ident: ref26 article-title: The imperative for regulatory oversight of large language models (or generative AI) in healthcare publication-title: NPJ Digit Med doi: 10.1038/s41746-023-00873-0 – volume: 17 year: 2019 ident: ref28 article-title: Key challenges for delivering clinical impact with artificial intelligence publication-title: BMC Med doi: 10.1186/s12916-019-1426-2 – volume: 8 year: 2023 ident: ref8 article-title: Evaluating ChatGPT performance on the orthopaedic in-training examination publication-title: JB JS Open Access doi: 10.2106/JBJS.OA.23.00056 – volume: 77 year: 2020 ident: ref11 article-title: Sustained clinical performance during surgical rotations predicts NBME shelf exam outcomes publication-title: J Surg Educ doi: 10.1016/j.jsurg.2020.06.033 – volume: 620 year: 2023 ident: ref4 article-title: Large language models encode clinical knowledge publication-title: Nature doi: 10.1038/s41586-023-06291-2 – volume: 3 year: 2020 ident: ref21 article-title: Deep learning interpretation of echocardiograms publication-title: NPJ Digit Med doi: 10.1038/s41746-019-0216-8 – volume: 104 year: 2023 ident: ref9 article-title: ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models publication-title: Ann Surg Treat Res doi: 10.4174/astr.2023.104.5.269 – volume: 3 year: 2023 ident: ref15 article-title: ChatGPT applications in medical, dental, pharmacy, and public health education: a descriptive study highlighting the advantages and limitations publication-title: Narra J doi: 10.52225/narra.v3i1.103 – volume: 179 year: 2023 ident: ref7 article-title: GPT-4 artificial intelligence model outperforms ChatGPT, medical students, and neurosurgery residents on neurosurgery written board-like questions publication-title: World Neurosurg doi: 10.1016/j.wneu.2023.08.042 – volume: 2 year: 2023 ident: ref6 article-title: Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models publication-title: PLOS Digit Health doi: 10.1371/journal.pdig.0000198 – ident: ref17 – volume: 26 year: 2023 ident: ref25 article-title: "Dr ChatGPT": Is it a reliable and useful source for common rheumatic diseases? publication-title: Int J Rheum Dis doi: 10.1111/1756-185X.14749 – ident: ref13 – year: 2023 ident: ref27 article-title: Black box warning: large language models and the future of infectious diseases consultation publication-title: Clin Infect Dis doi: 10.1093/cid/ciad633 – volume: 39 year: 2020 ident: ref19 article-title: Deep neural networks improve radiologists' performance in breast cancer screening publication-title: IEEE Trans Med Imaging doi: 10.1109/TMI.2019.2945514 – volume: 13 year: 2023 ident: ref10 article-title: Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments publication-title: Sci Rep doi: 10.1038/s41598-023-43436-9 – volume: 46 year: 2019 ident: ref2 article-title: Artificial intelligence to support clinical decision-making processes publication-title: EBioMedicine doi: 10.1016/j.ebiom.2019.07.019 – volume: 47 year: 2022 ident: ref24 article-title: Accounting for diversity in AI for medicine publication-title: Comput Law Secur Rev doi: 10.1016/j.clsr.2022.105735 – ident: ref14 |
SSID | ssj0001072070 |
Score | 2.3851476 |
Snippet | Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5,... Introduction: Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's... |
SourceID | pubmedcentral proquest pubmed crossref |
SourceType | Open Access Repository Aggregation Database Index Database Enrichment Source |
StartPage | e55991 |
SubjectTerms | Ambulatory care Artificial intelligence Cancer Gynecology Healthcare Technology Language Large language models Medical Education Medical examiners Medical screening Medical students Medicine Multiple choice Neurology Obstetrics Other Pediatrics Performance evaluation Psychiatry Surgery Variance analysis |
SummonAdditionalLinks | – databaseName: Health & Medical Collection dbid: 7X7 link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3fS9xAEB6sgvRFWrU1VmUL-iTRJJtkN0-lHsohVQQV7i3MbjYqnIle7qR_fneSvbueoi95SGZJyMzuzq_9PoB9JUMUWnBfCSz82GDko0LiN4lsDJQh16pF-7xM-7fx-SAZuIRb49oqp2tiu1AXtaYc-TGVgwgsR_JfT88-sUZRddVRaHyCldB6IkTdIAZinmMJRGRNuut3FyLJjvVkZCbNEcFshYs70Rv38nWX5H_bztkXWHP-IvvdKfgrLJlqHVYvXEV8A156HZFgdcesK8eu5ucAWF2yq5aea8T-UL-3vXa5SUYEaMOG1VU7xkFjD9lJbc2FhrnqDTv9i490OLBh10gowqxNj5KhbsLt2elNr-87LgVf85iP_YIHBkvUqczCMlaJktavkrzQZWikVDop0ehCilKkSSQQuUoxszeJizyNlObfYLmqK7MFLFJpUEiTCG2dLRNwFeg04AmWCdroLjMeHE7_a64d0DjxXQxzG3CQFvJOC3mrBQ8OZtJPHcDGO3I7UxXlbpo1-dwoPPg5e2wnCFU9sDL1hGS4jO02zYUH3zuNzl7EZUpGmXkgF3Q9EyDw7cUn1cN9C8JNUHs2FMu2P_6uH_A5sm4Qda2F4Q4sj0cTs2vdmLHaa231H85Y968 priority: 102 providerName: ProQuest |
Title | Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions |
URI | https://www.ncbi.nlm.nih.gov/pubmed/38606229 https://www.proquest.com/docview/3049780483 https://www.proquest.com/docview/3038436237 https://pubmed.ncbi.nlm.nih.gov/PMC11007479 |
Volume | 16 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Nb9QwEB2VVkJcKspnaFkZCU4oSxInsXNCtNqqQlCtgJX2Fo0dpyAtSdmPqvx7ZpzslqVw4ZJDPFakzFh-4xm_B_DS6BiVVTI0CqswdZiEaJD1TRLKgQqU1ni2z_P8bJK-n2bTHVirjfY_cPHX1I71pCbz2fD6x8-3tOAJvw6Vyoo3djV3q8WQubMoD9rzlSJu4uuBvj9tiVRCwd11vt-atL0n3QKaf_ZL_rYBnd6H_R45inedqw9gxzUP4O7Hvjb-EK5OOknB5kIQqBPjmxsBoq3F2At1zcUH7vymZ3dKKVgKbbYQbePn9CTZM3HcUuDwtL6OI0bX-J2vCS7EZ2Q-YeEPSjlkH8HkdPTl5CzsVRVCK1O5DCsZOazR5rqI69RkRhPC0rKydey0Njar0dlKq1rlWaIQpcmxoJesSp4nxsrHsNu0jXsKIjF5VGmXKUuwy0XSRDaPZIZ1hpTnFS6A1-v_WtqecpyVL2YlpR7shbLzQum9EMCrjfVlR7XxD7ujtYvKdbyUXC1kLiUtA3ixGaalwvUPbFy7YhupU9qwpQrgSefRzYekzjk8iwD0lq83BkzDvT3SfPvq6biZdI-SsuLZ_089hHsJgSXubYvjI9hdzlfuOYGdpRnAHTVVA9g7Hp2PPw18VP8C79oJQg |
linkProvider | Scholars Portal |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Nb9MwFH8anQRc0PjOGGAkdkJhiZ3EzmFCbHTqWFdVsEm7ZbbjAFKXjKYF9s_xt-GXOC0FwW2XHBI7ifz87Pfl3w_gpRKh5JozX3GZ-5GR1JdKIr8JtT5QKplWDdrnKBmcRu_P4rM1-NmdhcGyym5NbBbqvNIYI9_BdBCC5Qj25vKrj6xRmF3tKDSko1bIdxuIMXew48hcfbcuXL17-M7Ke5vSg_7J_sB3LAO-ZhGb-TkLjCykTkQaFpGKlbAWh2C5LkIjhNJxIY3OBS94ElMuJVOJTO1NZOlOqNLMvvcGrEcYQOnB-l5_NP6wjPIEnFqlaivuOY_THT2fmnn9GoG-wtW98C8D9886zd82voMNuOMsVvK2nWJ3Yc2U9-DmscvJ34dv-y2VYfmJWGOSjJcnEUhVkHFDEDYlQ6w4t9c2OkqQgm1Sk6ps-jhw7gnZq-yExW4uf0T6P-QFHk-syUeJOMakCdCiqjyA02sZ54fQK6vSPAZCVRLkwsRcW3PPBEwFOglYLItYWv8yNR686sY10w7qHBk3Jpl1eVAKWSuFrJGCB9uL1pctxMc_2m11IsqcotfZclp68GLx2Koo5l1kaao5tmEisoYC4x48aiW6-BATCapF6oFYkfWiAcJ_rz4pv3xuYMAR7M86g-nm___rOdwanBwPs-Hh6OgJ3KbWKMMaujDcgt5sOjdPrVE1U8_czCVwft3K8gt1aj0S |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB6VVqq4IN4EChiJnlDYJE5i51Ah2u6qpWW1Air1FmzHAaQlKZtdHn-RX8VM4uyyILj1kkNiJ5HHY8_L3wfwVMtQCSO4r4Uq_NiqyFdaEb9JhD5QprjRLdrnOD06i1-dJ-cb8LM_C0Nllf2a2C7URW0oRj6gdBCB5Ug-KF1ZxORw9OLii08MUpRp7ek0lKNZKPZauDF3yOPE_viG7lyzd3yIst-NotHw3cGR7xgHfMNjPvcLHlhVKpPKLCxjnWiJ1ofkhSlDK6U2SamsKaQoRZpEQimuU5XhTWLsTiNtOL73CmwJ3PXREdzaH44nb1YRn0BEqGBd9b0QSTYwi5ldNM8J9Ctc3xf_Mnb_rNn8bRMcXYdrznplL7vpdgM2bHUTtl-7_Pwt-HrQ0RpWHxgalmyyOpXA6pJNWrKwGTul6nO8dpFSRnRs04bVVdvHAXVP2X6Nk5e6uVwSG35Xn-moYsPeKsI0Zm2wltTmNpxdyjjfgc2qruw9YJFOg0LaRBg0_WzAdWDSgCeqTBT6mpn14Fk_rrlxsOfEvjHN0f0hKeSdFPJWCh7sLltfdHAf_2i304sod0rf5Ksp6sGT5WNUV8rBqMrWC2rDZYxGAxce3O0kuvwQlympSOaBXJP1sgFBga8_qT59bCHBCfgPHcPs_v__6zFso9Lkp8fjkwdwNUL7jMrpwnAHNuezhX2I9tVcP3ITl8H7y9aVX-WEQVY |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparing+the+Performance+of+Popular+Large+Language+Models+on+the+National+Board+of+Medical+Examiners+Sample+Questions&rft.jtitle=Cur%C4%93us+%28Palo+Alto%2C+CA%29&rft.au=Abbas%2C+Ali&rft.au=Rehman%2C+Mahad+S&rft.au=Rehman%2C+Syed+S&rft.date=2024-03-11&rft.pub=Cureus&rft.eissn=2168-8184&rft.volume=16&rft.issue=3&rft_id=info:doi/10.7759%2Fcureus.55991&rft.externalDocID=PMC11007479 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2168-8184&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2168-8184&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2168-8184&client=summon |