Arkangel AI, OpenEvidence, ChatGPT, Medisearch: Are they objectively up to medical standards? A real-life assessment of LLMs in healthcare

Presenter: Jose Zea, BS Session: Large Language Models in the Clinic Time: 4/20/2026 2:00:00 PM → 4/20/2026 5:00:00 PM

Authors

Natalia Castano -Villegas 1 , Maria Camila Villa 2 , Katherine Monsalve 3 , Isabella Llano 2 , Laura Velásquez 2 , Jose Zea 2 1 Arkangel AI, Bogotá, Colombia, 2 Arkangel IA, Bogotá, Colombia, 3 Arcángel IA, Bogotá, Colombia

Abstract

Background: Large language models (LLMs) are increasingly used in healthcare, but standardized benchmarks fail to capture their validity and safety in real-world scenarios. Evaluating their quality is critical for safe integration into practice. Methods: Four fictitious clinical vignettes were developed by independent specialists and tested in four conversational agents: ArkangelAI, OpenEvidence, ChatGPT, and Medisearch. Each vignette included four questions. Responses were evaluated by four external clinicians using an eight-criterion Likert scale: 1-2 = dissatisfaction, 3 = neutral, 4-5 = satisfaction, 6 = not applicable. The criteria considered correctness, consensus, bias, standard of care, updated information, patient safety, real sources in references, and context-awareness. Response times were measured with medians/interquartile ranges (IQR). Results were reported as frequencies. Hypothesis tests were applied (α= 0.05). Results: There were 128 Question-answer pairs. ArkangelAI-Deep had the highest satisfaction (92.9%), followed by OpenEvidence (83.6%), ChatGPT-Deep (80.5%), and Medisearch (71.1%). Most dissatisfaction was for the real-source-of-references criteria: GPT-Personalized 75%, GPT-Regular 97%. Conversely, ArkangelAI-Deep, ChatGPT-Deep, and OpenEvidence obtained 100% satisfaction. All performed well in correctness and agreement with the consensus. ChatGPT was the lowest-scoring in non-biased answers. The safest for patients was GPT-Personalized, followed by Arkagel AI-Deep. Medisearch had the fastest response time (18 s), while GPT-Deep (13 min) and ArkangelAI-Deep (7.4 min) were slowest, showing a trade-off between depth and usability. Conclusions: ArkangelAI-Deep and OpenEvidence consistently outperformed others, while Medisearch and GPT-Regular had significant limitations. These results underscore the need for standardized frameworks to ensure safe use of LLMs in healthcare.

Disclosure

N. Castano -Villegas, None.. M. Villa, None.. K. Monsalve, None.. I. Llano, None.. L. Velásquez, None.. J. Zea, None.

Cited in


Control: 764 · Presentation Id: 2553 · Meeting 21436