47 - Can AI Guide Us? Comparing ChatGPT-4 and Deepseek in the Management of Postprostatectomy Incontinence
The recent advances in Artificial Intelligence (AI) have led to the emergence of increasingly sophisticated language models, such as ChatGPT-4 and Deepseek, which are gaining popularity among healthcare professionals as potential tools for clinical decision support. We aimed to compare the accuracy...
        Saved in:
      
    
          | Published in | Continence (Amsterdam) Vol. 15; p. 101971 | 
|---|---|
| Main Authors | , , , , , , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
            Elsevier B.V
    
        2025
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 2772-9737 2772-9737  | 
| DOI | 10.1016/j.cont.2025.101971 | 
Cover
| Summary: | The recent advances in Artificial Intelligence (AI) have led to the emergence of increasingly sophisticated language models, such as ChatGPT-4 and Deepseek, which are gaining popularity among healthcare professionals as potential tools for clinical decision support. We aimed to compare the accuracy and clinical relevance of recommendations provided by ChatGPT-4 and Deepseek regarding the assessment and management of postprostatectomy urinary incontinence (PPUI).
A total of 20 questions were prepared by urologists with expertise in PPUI. The questions had uncontroversial answers based on the Incontinence after Prostate Treatment: AUA/SUFU Guideline. Ten were conceptual questions and ten were based on clinical cases, designed to evaluate the models’ ability to apply knowledge and critical thinking. All questions were submitted in English, anonymously (without IP identification), separately, to ChatGPT 4o and Deepseek. The engine was prompted to be specific and limit the answers to 200 words for greater objectivity and was not prompted to incorporate any specific guideline. Each question was entered as a separate, independent prompt using the “New Chat” function. AI generated answers were independently analyzed by the experts who provided the questions. The accuracy of each response was graded as (A) Correct (1 point); (B) partially correct (0.5 point); or (C)Incorrect (0 point).
ChatGPT had a global accuracy of 95% (19 out of 20 questions), with 90% accuracy in conceptual questions (9 correct answers) and 100% in clinical cases. Deepseek reached a global accuracy of 72.5%, with 80% accuracy in conceptual questions (8 correct answers) and 65% in clinical cases (6.5 correct answers). Deepseek showed more partial answers and incorrect interpretations in questions addressing treatment options, complications, and special clinical situations. The Table shows examples of performance differences between the two AI models across various domains.
ChatGPT had a global accuracy of 95% (19 out of 20 questions), with 90% accuracy in conceptual questions (9 correct answers) and 100% in clinical cases. Deepseek reached a global accuracy of 72.5%, with 80% accuracy in conceptual questions (8 correct answers) and 65% in clinical cases (6.5 correct answers). Deepseek showed more partial answers and incorrect interpretations in questions addressing treatment options, complications, and special clinical situations. The Table shows examples of performance differences between the two AI models across various domains.
Both AI tools demonstrated potential to support clinical reasoning in the context of PPUI. However, ChatGPT outperformed Deepseek in both accuracy and consistency, especially in complex clinical scenarios. Despite promising results, careful human validation remains essential before incorporating AI-generated recommendations into clinical practice.
[Display omitted]
[Display omitted]
Funding none Clinical Trial No Subjects None | 
|---|---|
| ISSN: | 2772-9737 2772-9737  | 
| DOI: | 10.1016/j.cont.2025.101971 |