Performance of Large Language Models in Nursing Examinations: Comparative Analysis of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark in China
ABSTRACT Background While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark. Purpose This study assessed the performance of ChatGPT...
Saved in:
| Published in | Nursing open Vol. 12; no. 10; pp. e70317 - n/a |
|---|---|
| Main Authors | , , , |
| Format | Journal Article |
| Language | English |
| Published |
United States
John Wiley & Sons, Inc
01.10.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2054-1058 2054-1058 |
| DOI | 10.1002/nop2.70317 |
Cover
| Summary: | ABSTRACT
Background
While large language models (LLMs) have been widely utilised in nursing education, their performance in Chinese nursing examinations remains unexplored, particularly in the context of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark.
Purpose
This study assessed the performance of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark on the 2022 China National Nursing Professional Qualification Exam (CNNPQE) at both the Junior and Intermediate levels. It also investigated whether the accuracy of these language models' responses correlated with the exam's difficulty or subject matter.
Methods
We inputted 800 questions from the 2022 CNNPQE‐Junior and CNNPQE‐Intermediate exams into ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark to determine their accuracy rates in correctly answering the questions. We then analysed the correlation between these accuracy rates and the exams' difficulty levels or subjects.
Results
The accuracy of ChatGPT‐3.5, ChatGPT‐4 and iFLYTEK Spark in the CNNPQE‐Junior was 49.3% (197/400), 68.5% (274/400), and 61% (244/400), respectively, whereas it was 56.4% (225/399), 70.7% (282/399) and 57.6% (230/399) in the CNNPQE‐Intermediate. When considering different grades, the differences in accuracy rates among the three models were statistically significant (M2 = 95.531, degrees of freedom (df) = 4, p < 0.001). These accuracy rates of ChatGPT‐4 in the elementary knowledge, relevant professional knowledge, professional knowledge, and professional practice ability were 74.5%, 63.5%, 79% and 62.3%, respectively, leading in accuracy in other subjects in the CNNPQE. The results of the Cochran–Mantel–Haenszel (CMH) test showed that when considering different subjects, there was a statistically significant difference in accuracy rates of three LLMs (M2 = 97.435, df = 4, p < 0.001).
Conclusions
ChatGPT‐4 and iFLYTEK Spark performed well on Chinese nursing examinations and demonstrated potential as valuable tools in nursing education. |
|---|---|
| Bibliography: | This work was supported by the Sichuan University College Students' innovation and entrepreneurship training program (C2024128641) and Sichuan University Graduate Education and Teaching Reform Research Project (GSSCU2023104). Peifang Li and Menglin Jiang are contributed equally to this work. Funding ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ISSN: | 2054-1058 2054-1058 |
| DOI: | 10.1002/nop2.70317 |