Performance of Large Language Models in Nursing Examinations: Comparative Analysis of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark in China

ABSTRACT Background While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark. Purpose This study assessed the performance of ChatGPT...

Full description

Saved in:
Bibliographic Details
Published inNursing open Vol. 12; no. 10; pp. e70317 - n/a
Main Authors Li, Peifang, Jiang, Menglin, Chen, Jiali, Ning, Ning
Format Journal Article
LanguageEnglish
Published United States John Wiley & Sons, Inc 01.10.2025
Subjects
Online AccessGet full text
ISSN2054-1058
2054-1058
DOI10.1002/nop2.70317

Cover

More Information
Summary:ABSTRACT Background While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark. Purpose This study assessed the performance of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark on the 2022 China National Nursing Professional Qualification Exam (CNNPQE) at both the Junior and Intermediate levels. It also investigated whether the accuracy of these language models' responses correlated with the exam's difficulty or subject matter. Methods We inputted 800 questions from the 2022 CNNPQE‐Junior and CNNPQE‐Intermediate exams into ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark to determine their accuracy rates in correctly answering the questions. We then analysed the correlation between these accuracy rates and the exams' difficulty levels or subjects. Results The accuracy of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark in the CNNPQE‐Junior was 49.3% (197/400), 68.5% (274/400), and 61% (244/400), respectively, whereas it was 56.4% (225/399), 70.7% (282/399) and 57.6% (230/399) in the CNNPQE‐Intermediate. When considering different grades, the differences in accuracy rates among the three models were statistically significant (M2 = 95.531, degrees of freedom (df) = 4, p < 0.001). These accuracy rates of ChatGPT‐4 in the elementary knowledge, relevant professional knowledge, professional knowledge, and professional practice ability were 74.5%, 63.5%, 79% and 62.3%, respectively, leading in accuracy in other subjects in the CNNPQE. The results of the Cochran–Mantel–Haenszel (CMH) test showed that when considering different subjects, there was a statistically significant difference in accuracy rates of three LLMs (M2 = 97.435, df = 4, p < 0.001). Conclusions ChatGPT‐4 and iFLYTEK Spark performed well on Chinese nursing examinations and demonstrated potential as valuable tools in nursing education.
Bibliography:This work was supported by the Sichuan University College Students' innovation and entrepreneurship training program (C2024128641) and Sichuan University Graduate Education and Teaching Reform Research Project (GSSCU2023104).
Peifang Li and Menglin Jiang are contributed equally to this work.
Funding
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2054-1058
2054-1058
DOI:10.1002/nop2.70317