P1232 A Novel Inflammatory Bowel Disease Registry Powered by Artificial Intelligence and Natural Language Processing Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.1093/ecco-jcc/jjae190.1406
· OA: W4406688164
Background Accurate data registries may assist clinicians and researchers to gain insights into inflammatory bowel disease(IBD) and provide opportunities to improve overall patient care. However, most data registries are limited by the amount of time needed to collect and record patient-level data. Machine learning and natural language processing(NLP) can facilitate data collection, storage, and retrieval, reducing or even eliminating the need for human data entry. The aim of this study was to describe and validate a novel IBD repository(IBD Data Lake), leveraging machine learning and NLP techniques, as useful tools to curate and retrieve pertinent, real time clinical data in the IBD patient population. Methods The IBD Data Lake was created by medical professionals, translational researchers, and data strategists at the IBD Centre of British Columbia. Structured and unstructured data were extracted from patients’ electronic medical record and were transferred to a secure cloud infrastructure and curated into a searchable database. A customized user interface was created to search the IBD Data Lake. An advanced NLP service(Comprehend MedicalTM) was employed to extract clinical information from the unstructured text and data from medical documents in PDF format. Manual chart review was used as the gold standard to validate all information from the IBD Data Lake. Results A list of 208 patients(104 IBD patients matched to 104 non-IBD patients) from the IBD Data Lake was generated between July 1, 2018 and July 31, 2023. After a thorough chart review, the IBD cohort comprised 101 IBD patients and the non-IBD cohort included 102 non-IBD patients. The IBD Data Lake’s performance metrics for identifying IBD patients were as follows: sensitivity 98.1%, specificity 97.1%, positive predictive value 97.1%, and negative predictive value 98.1%.The machine learning and NLP components of the IBD Data Lake demonstrated high performance in analyzing key IBD unstructured clinical characteristics: for distinction of ulcerative colitis or Crohn’s disease, sensitivity was 100% and specificity 98.2%; for smoking status, sensitivity was 100% and specificity 96.9%; and for extraintestinal manifestations, sensitivity was 92% and specificity 100%. Conclusion A novel IBD Data Lake that integrates machine learning and NLP techniques has been validated. IBD patients have been identified with great accuracy and the machine learning/NLP components of the IBD Data Lake allow for a comprehensive and timely extraction and organization of unstructured data. This will ultimately lay the groundwork for recruitment of specific IBD cohorts of interest to address the gaps that remain in our knowledge. It has the potential to drive innovation in the field of IBD and gastroenterology.