Artificial Intelligence in Health Professional Licensing: Performance of ChatGPT-3.5 and GPT-4; Systematic Review and Meta-Analysis

Andrew K. Foster; Natalie J. Price; Victoria L. Brown; Samuel T. Reed

doi:10.51847/bMVKZquCSZ

2022 Volume 2

Artificial Intelligence in Health Professional Licensing: Performance of ChatGPT-3.5 and GPT-4; Systematic Review and Meta-Analysis

Andrew K. Foster , Natalie J. Price^*✉ , Victoria L. Brown , Samuel T. Reed

Abstract

ChatGPT, a recently launched AI chatbot, has shown notable performance in medical exams. However, there has been no comprehensive evaluation of the ChatGPT models (ChatGPT-3.5 and GPT-4) across multiple national health licensing exams. This study sought to systematically assess how ChatGPT performs in licensing examinations for medicine, pharmacy, dentistry, and nursing via a meta-analysis. Following the PRISMA guidelines, relevant full-text studies were retrieved from MEDLINE/PubMed, EMBASE, ERIC, Cochrane Library, Web of Science, and key journals, covering the period from ChatGPT’s release up to February 27, 2024. Studies were included if they investigated ChatGPT-3.5 or GPT-4 performance in national licensing exams in medicine, pharmacy, dentistry, or nursing, used multiple-choice questions, and provided data suitable for effect size calculation. Data extraction, coding, and quality assessment were independently performed by two reviewers. The quality of included studies was assessed using the JBI Critical Appraisal Tools. A random-effects model was applied to compute pooled effect sizes with 95% confidence intervals (CIs).

A total of 23 studies met the inclusion criteria, covering four types of national licensing exams. Study distribution included medicine (n = 17), pharmacy (n = 3), nursing (n = 2), and dentistry (n = 1). Accuracy rates ranged from 36–77% for ChatGPT-3.5 and 64.4–100% for GPT-4. The overall pooled accuracy was 70.1% (95% CI, 65–74.8%), statistically significant (p < 0.001). Subgroup analyses showed that GPT-4 consistently outperformed ChatGPT-3.5. Across disciplines, the models demonstrated the highest performance in pharmacy, followed by medicine, dentistry, and nursing. Limitations included the narrow range of question types (absence of open-ended and scenario-based items) and considerable heterogeneity among studies. This analysis provides insight into ChatGPT’s accuracy across four national health licensing exams and offers both practical guidance and theoretical support for future research. Further work should investigate the use of ChatGPT in healthcare education with more diverse question types and advanced AI versions.

How to cite this article

Vancouver

Foster AK, Price NJ, Brown VL, Reed ST. Artificial Intelligence in Health Professional Licensing: Performance of ChatGPT-3.5 and GPT-4; Systematic Review and Meta-Analysis. Ann Pharm Educ Saf Public Health Advocacy. 2022;2:176-90. https://doi.org/10.51847/bMVKZquCSZ

APA

Foster, A. K., Price, N. J., Brown, V. L., & Reed, S. T. (2022). Artificial Intelligence in Health Professional Licensing: Performance of ChatGPT-3.5 and GPT-4; Systematic Review and Meta-Analysis. Annals of Pharmacy Education, Safety, and Public Health Advocacy, 2, 176-190. https://doi.org/10.51847/bMVKZquCSZ

Scan to access this article

View Full HTML

RIS format EndNote format

Download PDF

Downloads: 29

Artificial Intelligence in Health Professional Licensing: Performance of ChatGPT-3.5 and GPT-4; Systematic Review and Meta-Analysis

Abstract

About SMER