Large language model comparisons between English and Chinese query performance for cardiovascular prevention

Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s bidirectio...

Full description

Saved in:

Bibliographic Details
Published in	Communications medicine Vol. 5; no. 1; pp. 177 - 8
Main Authors	Ji, Hongwei, Wang, Xiaofei, Sia, Ching-Hui, Yap, Jonathan, Lim, Soo Teik, Djohan, Andie Hartanto, Chang, Yaowei, Zhang, Ning, Guo, Mengqi, Li, Fuhai, Lim, Zhi Wei, Wang, Ya Xing, Sheng, Bin, Wong, Tien Yin, Cheng, Susan, Yeo, Khung Keong, Tham, Yih-Chung
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 16.05.2025 Springer Nature B.V Nature Portfolio
Subjects	692/308 692/4019 Cardiology Chatbots Language Large language models Medicine Medicine & Public Health Prevention Queries
Online Access	Get full text
ISSN	2730-664X 2730-664X
DOI	10.1038/s43856-025-00802-0

Cover

More Information
Summary:	Background Large language model (LLM) offer promise in addressing layperson queries related to cardiovascular disease (CVD) prevention. However, the accuracy and consistency of information provided by current general LLMs remain unclear. Methods We evaluated capabilities of BARD (Google’s bidirectional language model for semantic understanding), ChatGPT-3.5, ChatGPT-4.0 (OpenAI’s conversational models for generating human-like text) and ERNIE (Baidu’s knowledge-enhanced language model for context understanding) in addressing CVD prevention queries in English and Chinese. 75 CVD prevention questions were posed to each LLM. The primary outcome was the accuracy of responses (rated as appropriate, borderline, inappropriate). Results For English prompts, the chatbots’ appropriate ratings are as follows: BARD at 88.0%, ChatGPT-3.5 at 92.0%, and ChatGPT-4.0 at 97.3%. All models demonstrate temporal improvement in initially suboptimal responses, with BARD and ChatGPT-3.5 each improving by 67% (6/9 and 4/6), and ChatGPT-4.0 achieving a 100% (2/2) improvement rate. Both BARD and ChatGPT-4.0 outperform ChatGPT-3.5 in recognizing the correctness of their responses. For Chinese prompts, the “appropriate” ratings are: ERNIE at 84.0%, ChatGPT-3.5 at 88.0%, and ChatGPT-4.0 at 85.3%. However, ERNIE outperform ChatGPT-3.5 and ChatGPT-4.0 in temporal improvement and self-awareness of correctness. Conclusions For CVD prevention queries in English, ChatGPT-4.0 outperforms other LLMs in generating appropriate responses, temporal improvement, and self-awareness. The LLMs’ performance drops slightly for Chinese queries, reflecting potential language bias in these LLMs. Given growing availability and accessibility of LLM chatbots, regular and rigorous evaluations are essential to thoroughly assess the quality and limitations of the medical information they provide across widely spoken languages. Plain Language Summary Recently there has been an increase in the use of large language model (LLM) chatbots by patients seeking medical information. However, the accuracy of information provided by LLMs across different languages remain unclear. This study aimed to evaluate the performance of popular LLM chatbots, such as BARD, ChatGPT-3.5, ChatGPT-4.0, and ERNIE, in answering cardiovascular disease prevention questions in both English and Chinese. We tested these models with 75 questions each, focusing on the accuracy of their responses and their ability to improve over time. The results showed that ChatGPT-4 provided the most accurate answers in English and demonstrated the best improvement over time. In Chinese, ERNIE performed better in improving its responses over time. This research highlights the need for ongoing evaluations to ensure the spread of reliable health information by LLMs across diverse languages. Ji et al. assess the responses of four major large language models in the context of cardiovascular disease prevention queries in both English and Chinese. The large language model chatbots exhibit significant disparities in performance across different models and languages, with ChatGPT-4 outperforming the others in English.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2730-664X 2730-664X
DOI:	10.1038/s43856-025-00802-0