Computer Assisted Verbal Autopsy: Comparing Large Language Models to Physicians for Assigning Causes to 6939 Deaths in Sierra Leone from 2019-2022 Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.21203/rs.3.rs-7578570/v1
· OA: W4414489233
<title>Abstract</title> Background Verbal autopsies (VAs) collect information on deaths in low and middle-income countries occurring outside healthcare facilities to estimate causes of death (CODs) for use in epidemiological or planning studies. Physician coding of VAs focused on the narrative of deaths and past symptoms is current best practice. Large language models (LLM) such as ChatGPT-4 enable possible use of the narrative portion of VAs to assign CODs. However, there are few if any robust comparisons of LLMs to physician coding. Methods We analyzed 6,939 VA records from a random sample of deaths in Sierra Leone (2019–2022) to compare four models: two LLMs (GPT-3.5, GPT-4) and two based on symptom algorithms (InterVA-5, InSilicoVA), against physician-assigned CODs. GPT models used narratives, whereas InterVA-5 and InSilicoVA relied on questionnaires. CODs were grouped into 19, 10, and 7 categories for adult, child, and neonatal deaths. We used cause specific mortality fraction (CSMF) accuracy and partial chance corrected concordance (PCCC) to assess population and individual-level agreement, respectively compared to the standard of physician coding. We stratified analyses by age group as CODs vary among neonates, children and adults. Results GPT-4 outperformed all models overall (PCCC = 0.61), followed by GPT-3.5 (0.56) then InSilicoVA (0.44) and InterVA-5 (0.44). GPT-4 achieved the highest performance for adult (0.64) and neonatal deaths (0.58), while GPT-3.5 had the highest performance for child deaths (0.54). Across ages, performance increased from 1 month to 14 years and declined from 15 to 69 years. GPT4, GPT-3.5, and InSilicoVA achieved the highest PCCC in 17, 9, and 4 of the 30 CODs, respectively. At the population level, all models achieved comparable CSMF accuracies (0.74–0.79). Conclusion GPT models and InSilicoVA showed greater performance for specific CODs at the individual-level. GPT models demonstrated improvements over InterVA-5 and InSilicoVA models. This study provides foundational evidence to integrate LLM and algorithmic models with physician coding to improve the quality of VA data.