A systematic review of large language model (LLM) evaluations in clinical medicine

Background Large Language Models (LLMs), advanced AI tools based on transformer architectures, demonstrate significant potential in clinical medicine by enhancing decision support, diagnostics, and medical education. However, their integration into clinical workflows requires rigorous evaluation to...

Full description

Saved in:

Bibliographic Details
Published in	BMC medical informatics and decision making Vol. 25; no. 1; pp. 117 - 11
Main Authors	Shool, Sina, Adimi, Sara, Saboori Amleshi, Reza, Bitaraf, Ehsan, Golpira, Reza, Tara, Mahmood
Format	Journal Article
Language	English
Published	London BioMed Central 07.03.2025 BioMed Central Ltd BMC
Subjects	Artificial Intelligence Artificial intelligence in medicine Chatbots Clinical medicine Clinical Medicine - methods Clinical Medicine - standards Criteria Decision making Deep learning in healthcare Electronic health records Ethics Health Informatics Humans Information Systems and Communication Service Language Large Language Models Literature reviews LLM evaluation Management of Computing and Information Systems Medicine Medicine & Public Health Parameters Performance evaluation Systematic Review Technology application Iran LLM evaluation Large language models Systematic review Natural language processing Clinical medicine Artificial intelligence in medicine Deep learning in healthcare
Online Access	Get full text
ISSN	1472-6947 1472-6947
DOI	10.1186/s12911-025-02954-4

Cover

More Information
Summary:	Background Large Language Models (LLMs), advanced AI tools based on transformer architectures, demonstrate significant potential in clinical medicine by enhancing decision support, diagnostics, and medical education. However, their integration into clinical workflows requires rigorous evaluation to ensure reliability, safety, and ethical alignment. Objective This systematic review examines the evaluation parameters and methodologies applied to LLMs in clinical medicine, highlighting their capabilities, limitations, and application trends. Methods A comprehensive review of the literature was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, and arXiv databases, encompassing both peer-reviewed and preprint studies. Studies were screened against predefined inclusion and exclusion criteria to identify original research evaluating LLM performance in medical contexts. Results The results reveal a growing interest in leveraging LLM tools in clinical settings, with 761 studies meeting the inclusion criteria. While general-domain LLMs, particularly ChatGPT and GPT-4, dominated evaluations (93.55%), medical-domain LLMs accounted for only 6.45%. Accuracy emerged as the most commonly assessed parameter (21.78%). Despite these advancements, the evidence base highlights certain limitations and biases across the included studies, emphasizing the need for careful interpretation and robust evaluation frameworks. Conclusions The exponential growth in LLM research underscores their transformative potential in healthcare. However, addressing challenges such as ethical risks, evaluation variability, and underrepresentation of critical specialties will be essential. Future efforts should prioritize standardized frameworks to ensure safe, effective, and equitable LLM integration in clinical practice.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 ObjectType-Undefined-3
ISSN:	1472-6947 1472-6947
DOI:	10.1186/s12911-025-02954-4