Performance of Large Language Models in Nursing Examinations: Comparative Analysis of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark in China

ABSTRACT Background While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark. Purpose This study assessed the performance of ChatGPT...

Full description

Saved in:

Bibliographic Details
Published in	Nursing open Vol. 12; no. 10; pp. e70317 - n/a
Main Authors	Li, Peifang, Jiang, Menglin, Chen, Jiali, Ning, Ning
Format	Journal Article
Language	English
Published	United States John Wiley & Sons, Inc 01.10.2025
Subjects	Accuracy Adult Chatbots ChatGPT China Education, Nursing - methods Education, Nursing - standards Educational Measurement - methods Educational Measurement - standards Educational Measurement - statistics & numerical data Generative Artificial Intelligence Humans Knowledge acquisition Language Large Language Models Licenses Licensing examinations Medical research Nurse specialists Nurses Nursing nursing examinations Professional practice Software Surveys and Questionnaires China ChatGPT nursing examinations large language models
Online Access	Get full text
ISSN	2054-1058 2054-1058
DOI	10.1002/nop2.70317

Cover

More Information
Summary:	ABSTRACT Background While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark. Purpose This study assessed the performance of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark on the 2022 China National Nursing Professional Qualification Exam (CNNPQE) at both the Junior and Intermediate levels. It also investigated whether the accuracy of these language models' responses correlated with the exam's difficulty or subject matter. Methods We inputted 800 questions from the 2022 CNNPQE‐Junior and CNNPQE‐Intermediate exams into ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark to determine their accuracy rates in correctly answering the questions. We then analysed the correlation between these accuracy rates and the exams' difficulty levels or subjects. Results The accuracy of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark in the CNNPQE‐Junior was 49.3% (197/400), 68.5% (274/400), and 61% (244/400), respectively, whereas it was 56.4% (225/399), 70.7% (282/399) and 57.6% (230/399) in the CNNPQE‐Intermediate. When considering different grades, the differences in accuracy rates among the three models were statistically significant (M2 = 95.531, degrees of freedom (df) = 4, p < 0.001). These accuracy rates of ChatGPT‐4 in the elementary knowledge, relevant professional knowledge, professional knowledge, and professional practice ability were 74.5%, 63.5%, 79% and 62.3%, respectively, leading in accuracy in other subjects in the CNNPQE. The results of the Cochran–Mantel–Haenszel (CMH) test showed that when considering different subjects, there was a statistically significant difference in accuracy rates of three LLMs (M2 = 97.435, df = 4, p < 0.001). Conclusions ChatGPT‐4 and iFLYTEK Spark performed well on Chinese nursing examinations and demonstrated potential as valuable tools in nursing education.
Bibliography:	This work was supported by the Sichuan University College Students' innovation and entrepreneurship training program (C2024128641) and Sichuan University Graduate Education and Teaching Reform Research Project (GSSCU2023104). Peifang Li and Menglin Jiang are contributed equally to this work. Funding ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2054-1058 2054-1058
DOI:	10.1002/nop2.70317