Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2404.06680
Retrieving information from EHR systems is essential for answering specific questions about patient journeys and improving the delivery of clinical care. Despite this fact, most EHR systems still rely on keyword-based searches. With the advent of generative large language models (LLMs), retrieving information can lead to better search and summarization capabilities. Such retrievers can also feed Retrieval-augmented generation (RAG) pipelines to answer any query. However, the task of retrieving information from EHR real-world clinical data contained within EHR systems in order to solve several downstream use cases is challenging due to the difficulty in creating query-document support pairs. We provide a blueprint for creating such datasets in an affordable manner using large language models. Our method results in a retriever that is 30-50 F-1 points better than propriety counterparts such as Ada and Mistral for oncology data elements. We further compare our model, called Onco-Retriever, against fine-tuned PubMedBERT model as well. We conduct an extensive manual evaluation on real-world EHR data along with latency analysis of the different models and provide a path forward for healthcare organizations to build domain-specific retrievers.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2404.06680
- https://arxiv.org/pdf/2404.06680
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4394776991
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4394776991Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2404.06680Digital Object Identifier
- Title
-
Onco-Retriever: Generative Classifier for Retrieval of EHR Records in OncologyWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-04-10Full publication date if available
- Authors
-
Shashi Kant Gupta, Aditya Basu, Bradley Taylor, Anai N. Kothari, Hrituraj SinghList of authors in order
- Landing page
-
https://arxiv.org/abs/2404.06680Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2404.06680Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2404.06680Direct OA link when available
- Concepts
-
Computer science, Information retrieval, Automatic summarization, Blueprint, Data science, Engineering, Mechanical engineeringTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4394776991 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2404.06680 |
| ids.doi | https://doi.org/10.48550/arxiv.2404.06680 |
| ids.openalex | https://openalex.org/W4394776991 |
| fwci | |
| type | preprint |
| title | Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11710 |
| topics[0].field.id | https://openalex.org/fields/13 |
| topics[0].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[0].score | 0.9961000084877014 |
| topics[0].domain.id | https://openalex.org/domains/1 |
| topics[0].domain.display_name | Life Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1312 |
| topics[0].subfield.display_name | Molecular Biology |
| topics[0].display_name | Biomedical Text Mining and Ontologies |
| topics[1].id | https://openalex.org/T10028 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9957000017166138 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Topic Modeling |
| topics[2].id | https://openalex.org/T10181 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9779000282287598 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Natural Language Processing Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7473637461662292 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C23123220 |
| concepts[1].level | 1 |
| concepts[1].score | 0.5860162377357483 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q816826 |
| concepts[1].display_name | Information retrieval |
| concepts[2].id | https://openalex.org/C170858558 |
| concepts[2].level | 2 |
| concepts[2].score | 0.4666549861431122 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q1394144 |
| concepts[2].display_name | Automatic summarization |
| concepts[3].id | https://openalex.org/C155911762 |
| concepts[3].level | 2 |
| concepts[3].score | 0.4419670104980469 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q422321 |
| concepts[3].display_name | Blueprint |
| concepts[4].id | https://openalex.org/C2522767166 |
| concepts[4].level | 1 |
| concepts[4].score | 0.3431469798088074 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q2374463 |
| concepts[4].display_name | Data science |
| concepts[5].id | https://openalex.org/C127413603 |
| concepts[5].level | 0 |
| concepts[5].score | 0.0 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[5].display_name | Engineering |
| concepts[6].id | https://openalex.org/C78519656 |
| concepts[6].level | 1 |
| concepts[6].score | 0.0 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q101333 |
| concepts[6].display_name | Mechanical engineering |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.7473637461662292 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/information-retrieval |
| keywords[1].score | 0.5860162377357483 |
| keywords[1].display_name | Information retrieval |
| keywords[2].id | https://openalex.org/keywords/automatic-summarization |
| keywords[2].score | 0.4666549861431122 |
| keywords[2].display_name | Automatic summarization |
| keywords[3].id | https://openalex.org/keywords/blueprint |
| keywords[3].score | 0.4419670104980469 |
| keywords[3].display_name | Blueprint |
| keywords[4].id | https://openalex.org/keywords/data-science |
| keywords[4].score | 0.3431469798088074 |
| keywords[4].display_name | Data science |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2404.06680 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2404.06680 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2404.06680 |
| locations[1].id | doi:10.48550/arxiv.2404.06680 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2404.06680 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5052263686 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-6587-5607 |
| authorships[0].author.display_name | Shashi Kant Gupta |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Gupta, Shashi Kant |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100606669 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-9760-0912 |
| authorships[1].author.display_name | Aditya Basu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Basu, Aditya |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5103502112 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Bradley Taylor |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Taylor, Bradley |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5077843932 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-6544-8832 |
| authorships[3].author.display_name | Anai N. Kothari |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Kothari, Anai |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5044442880 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-3705-120X |
| authorships[4].author.display_name | Hrituraj Singh |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Singh, Hrituraj |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2404.06680 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Onco-Retriever: Generative Classifier for Retrieval of EHR Records in Oncology |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11710 |
| primary_topic.field.id | https://openalex.org/fields/13 |
| primary_topic.field.display_name | Biochemistry, Genetics and Molecular Biology |
| primary_topic.score | 0.9961000084877014 |
| primary_topic.domain.id | https://openalex.org/domains/1 |
| primary_topic.domain.display_name | Life Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1312 |
| primary_topic.subfield.display_name | Molecular Biology |
| primary_topic.display_name | Biomedical Text Mining and Ontologies |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2366403280, https://openalex.org/W1495108544, https://openalex.org/W4304620183, https://openalex.org/W2091301346, https://openalex.org/W3148229873, https://openalex.org/W4389760904, https://openalex.org/W2150160875, https://openalex.org/W2351187795 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2404.06680 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2404.06680 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2404.06680 |
| primary_location.id | pmh:oai:arXiv.org:2404.06680 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2404.06680 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2404.06680 |
| publication_date | 2024-04-10 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 100, 118, 171 |
| abstract_inverted_index.We | 98, 138, 151 |
| abstract_inverted_index.an | 107, 153 |
| abstract_inverted_index.as | 130, 149 |
| abstract_inverted_index.in | 79, 93, 106, 117 |
| abstract_inverted_index.is | 5, 87, 121 |
| abstract_inverted_index.of | 18, 35, 67, 165 |
| abstract_inverted_index.on | 29, 157 |
| abstract_inverted_index.to | 45, 60, 81, 90, 177 |
| abstract_inverted_index.Ada | 131 |
| abstract_inverted_index.EHR | 3, 25, 71, 77, 159 |
| abstract_inverted_index.F-1 | 123 |
| abstract_inverted_index.Our | 114 |
| abstract_inverted_index.and | 14, 48, 132, 169 |
| abstract_inverted_index.any | 62 |
| abstract_inverted_index.can | 43, 53 |
| abstract_inverted_index.due | 89 |
| abstract_inverted_index.for | 7, 102, 134, 174 |
| abstract_inverted_index.our | 141 |
| abstract_inverted_index.the | 16, 33, 65, 91, 166 |
| abstract_inverted_index.use | 85 |
| abstract_inverted_index.Such | 51 |
| abstract_inverted_index.With | 32 |
| abstract_inverted_index.also | 54 |
| abstract_inverted_index.data | 74, 136, 160 |
| abstract_inverted_index.feed | 55 |
| abstract_inverted_index.from | 2, 70 |
| abstract_inverted_index.lead | 44 |
| abstract_inverted_index.most | 24 |
| abstract_inverted_index.path | 172 |
| abstract_inverted_index.rely | 28 |
| abstract_inverted_index.such | 104, 129 |
| abstract_inverted_index.task | 66 |
| abstract_inverted_index.than | 126 |
| abstract_inverted_index.that | 120 |
| abstract_inverted_index.this | 22 |
| abstract_inverted_index.with | 162 |
| abstract_inverted_index.(RAG) | 58 |
| abstract_inverted_index.30-50 | 122 |
| abstract_inverted_index.about | 11 |
| abstract_inverted_index.along | 161 |
| abstract_inverted_index.build | 178 |
| abstract_inverted_index.care. | 20 |
| abstract_inverted_index.cases | 86 |
| abstract_inverted_index.fact, | 23 |
| abstract_inverted_index.large | 37, 111 |
| abstract_inverted_index.model | 148 |
| abstract_inverted_index.order | 80 |
| abstract_inverted_index.solve | 82 |
| abstract_inverted_index.still | 27 |
| abstract_inverted_index.using | 110 |
| abstract_inverted_index.well. | 150 |
| abstract_inverted_index.advent | 34 |
| abstract_inverted_index.answer | 61 |
| abstract_inverted_index.better | 46, 125 |
| abstract_inverted_index.called | 143 |
| abstract_inverted_index.manner | 109 |
| abstract_inverted_index.manual | 155 |
| abstract_inverted_index.method | 115 |
| abstract_inverted_index.model, | 142 |
| abstract_inverted_index.models | 39, 168 |
| abstract_inverted_index.pairs. | 97 |
| abstract_inverted_index.points | 124 |
| abstract_inverted_index.query. | 63 |
| abstract_inverted_index.search | 47 |
| abstract_inverted_index.within | 76 |
| abstract_inverted_index.(LLMs), | 40 |
| abstract_inverted_index.Despite | 21 |
| abstract_inverted_index.Mistral | 133 |
| abstract_inverted_index.against | 145 |
| abstract_inverted_index.compare | 140 |
| abstract_inverted_index.conduct | 152 |
| abstract_inverted_index.forward | 173 |
| abstract_inverted_index.further | 139 |
| abstract_inverted_index.latency | 163 |
| abstract_inverted_index.models. | 113 |
| abstract_inverted_index.patient | 12 |
| abstract_inverted_index.provide | 99, 170 |
| abstract_inverted_index.results | 116 |
| abstract_inverted_index.several | 83 |
| abstract_inverted_index.support | 96 |
| abstract_inverted_index.systems | 4, 26, 78 |
| abstract_inverted_index.However, | 64 |
| abstract_inverted_index.analysis | 164 |
| abstract_inverted_index.clinical | 19, 73 |
| abstract_inverted_index.creating | 94, 103 |
| abstract_inverted_index.datasets | 105 |
| abstract_inverted_index.delivery | 17 |
| abstract_inverted_index.journeys | 13 |
| abstract_inverted_index.language | 38, 112 |
| abstract_inverted_index.oncology | 135 |
| abstract_inverted_index.specific | 9 |
| abstract_inverted_index.answering | 8 |
| abstract_inverted_index.blueprint | 101 |
| abstract_inverted_index.contained | 75 |
| abstract_inverted_index.different | 167 |
| abstract_inverted_index.elements. | 137 |
| abstract_inverted_index.essential | 6 |
| abstract_inverted_index.extensive | 154 |
| abstract_inverted_index.improving | 15 |
| abstract_inverted_index.pipelines | 59 |
| abstract_inverted_index.propriety | 127 |
| abstract_inverted_index.questions | 10 |
| abstract_inverted_index.retriever | 119 |
| abstract_inverted_index.searches. | 31 |
| abstract_inverted_index.PubMedBERT | 147 |
| abstract_inverted_index.Retrieving | 0 |
| abstract_inverted_index.affordable | 108 |
| abstract_inverted_index.difficulty | 92 |
| abstract_inverted_index.downstream | 84 |
| abstract_inverted_index.evaluation | 156 |
| abstract_inverted_index.fine-tuned | 146 |
| abstract_inverted_index.generation | 57 |
| abstract_inverted_index.generative | 36 |
| abstract_inverted_index.healthcare | 175 |
| abstract_inverted_index.real-world | 72, 158 |
| abstract_inverted_index.retrievers | 52 |
| abstract_inverted_index.retrieving | 41, 68 |
| abstract_inverted_index.challenging | 88 |
| abstract_inverted_index.information | 1, 42, 69 |
| abstract_inverted_index.retrievers. | 180 |
| abstract_inverted_index.counterparts | 128 |
| abstract_inverted_index.capabilities. | 50 |
| abstract_inverted_index.keyword-based | 30 |
| abstract_inverted_index.organizations | 176 |
| abstract_inverted_index.summarization | 49 |
| abstract_inverted_index.query-document | 95 |
| abstract_inverted_index.Onco-Retriever, | 144 |
| abstract_inverted_index.domain-specific | 179 |
| abstract_inverted_index.Retrieval-augmented | 56 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.4300000071525574 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |