Automated cohort extraction from real-world oncology data using adaptive LLM-based agentic systems for clinical trial feasibility and patient selection

Presenter: Brandon Theodorou, PhD Session: Agentic AI in Cancer Time: 4/19/2026 2:00:00 PM → 4/19/2026 5:00:00 PM

Authors

Brandon Theodorou 1 , Thomas Schmitt 2 , Zifeng Wang 1 , Venugopal Thati 1 , Angela Watkins 2 , Kimberly Banks 2 , Jimeng Sun 1 , Amar Das 2 1 Keiji AI, Seattle, WA, 2 Guardant Health, Palo Alto, CA

Abstract

Clinical feasibility analysis and cohort identification have essential applications in precision oncology, including feasibility assessment for clinical trial and other study designs, cohort extraction for digital twin modeling and virtual trial simulation, and real-world evidence generation for regulatory submission and comparative effectiveness research. However, real-world oncology datasets pose significant challenges due to heterogeneous, nonstandard data formats and complex, interconnected inclusion criteria. These technical barriers are compounded by functional hurdles like technical understanding, data access, and code execution. Traditional approaches rely on slow, unscalable manual query construction, and even recent state-of-the-art large language models (LLMs) struggle with multi-step reasoning and adaptation to diverse data structures. To address these issues, we developed an adaptive LLM-based agentic platform that autonomously learns data structures, generates and executes code, and iteratively refines analyses to extract patient cohorts meeting complex criteria. Unlike conventional LLMs that generate static code without execution capabilities or dataset adaptation, our platform’s agentic architecture dynamically explores data schemas, validates intermediate outputs, and self-corrects errors. It also accepts expert guidance and allows fully auditable, editable, and exportable outputs at each step of the process. The system accepts natural language queries specifying complex criteria such as specific diagnoses, genomic profiles, treatment histories, and temporal relationships, then autonomously navigates real-world data (RWD) of any shape, size, and format to identify qualifying patients. We evaluated performance by replicating 15 historical feasibility analyses from the GuardantINFORM™ database, which integrates genomic and epigenomic RWD from >550K patients with de-identified administrative claims data across multiple oncology indications and data tables. Our validation study found that the platform successfully extracted exact cohorts for 12 requests and delivered clinically acceptable approximations for 2 more. The final analysis failed due to a clinical misunderstanding but was correctable post hoc with improved guidance. In contrast, state-of-the-art LLMs without tuned agentic capabilities failed to adapt to dataset-specific structures and had low task completion rates, highlighting the critical importance of the task-based design, iterative execution, and self-correction. This performance democratizes access to sophisticated data analysis, addressing a critical bottleneck in translating RWD into actionable clinical insights and establishing a foundation for autonomous, adaptive AI systems that can accelerate oncology research.

Disclosure

B. Theodorou, Guardant Health ). T. Schmitt, Guardant Health Employment. Z. Wang, Guardant Health ). V. Thati, Guardant Health ). A. Watkins, Guardant Health Employment. K. Banks, Guardant Health Employment. J. Sun, Guardant Health ). A. Das, Guardant Health Employment.

Cited in


Control: 7134 · Presentation Id: 2499 · Meeting 21436