Incorporating Large Language Model-Derived Information into Hypothesis Testing for Genomics Article Swipe
Jordan Bryan
,
Hui Niu
,
Didong Li
·
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.1101/2025.04.30.651464
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.1101/2025.04.30.651464
We propose strategies for incorporating the information in large language models (LLMs) into statistical hypothesis tests in genomics studies. Using gene embeddings derived from text inputs to OpenAI’s GPT-3.5 model, we show that biological signals in a variety of genomics datasets reside near the principal subspace spanned by the embeddings. We then use a frequentist and Bayesian (FAB) framework to propose several hypothesis tests that are either optimal or approximately optimal with respect to prior information based on the gene embedding subspace. In four real-world genomics examples, the FAB tests guided by the LLM-derived information achieve more power than classical counterparts.
Related Topics
Concepts
Metadata
- Type
- preprint
- Language
- en
- Landing Page
- https://doi.org/10.1101/2025.04.30.651464
- https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.04.30.651464.full.pdf
- OA Status
- green
- References
- 35
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4410236389
All OpenAlex metadata
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4410236389Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.1101/2025.04.30.651464Digital Object Identifier
- Title
-
Incorporating Large Language Model-Derived Information into Hypothesis Testing for GenomicsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-07Full publication date if available
- Authors
-
Jordan Bryan, Hui Niu, Didong LiList of authors in order
- Landing page
-
https://doi.org/10.1101/2025.04.30.651464Publisher landing page
- PDF URL
-
https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.04.30.651464.full.pdfDirect link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.04.30.651464.full.pdfDirect OA link when available
- Concepts
-
Genomics, Computational biology, Computer science, Data science, Biology, Genome, Genetics, GeneTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- References (count)
-
35Number of works referenced by this work
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4410236389 |
|---|---|
| doi | https://doi.org/10.1101/2025.04.30.651464 |
| ids.doi | https://doi.org/10.1101/2025.04.30.651464 |
| ids.pmid | https://pubmed.ncbi.nlm.nih.gov/40654778 |
| ids.openalex | https://openalex.org/W4410236389 |
| fwci | 0.0 |
| type | preprint |
| title | Incorporating Large Language Model-Derived Information into Hypothesis Testing for Genomics |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10885 |
| topics[0].field.id | https://openalex.org/fields/13 |
| topics[0].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[0].score | 0.9991999864578247 |
| topics[0].domain.id | https://openalex.org/domains/1 |
| topics[0].domain.display_name | Life Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1312 |
| topics[0].subfield.display_name | Molecular Biology |
| topics[0].display_name | Gene expression and cancer classification |
| topics[1].id | https://openalex.org/T12254 |
| topics[1].field.id | https://openalex.org/fields/13 |
| topics[1].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[1].score | 0.9983999729156494 |
| topics[1].domain.id | https://openalex.org/domains/1 |
| topics[1].domain.display_name | Life Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1312 |
| topics[1].subfield.display_name | Molecular Biology |
| topics[1].display_name | Machine Learning in Bioinformatics |
| topics[2].id | https://openalex.org/T10015 |
| topics[2].field.id | https://openalex.org/fields/13 |
| topics[2].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[2].score | 0.996399998664856 |
| topics[2].domain.id | https://openalex.org/domains/1 |
| topics[2].domain.display_name | Life Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1312 |
| topics[2].subfield.display_name | Molecular Biology |
| topics[2].display_name | Genomics and Phylogenetic Studies |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C189206191 |
| concepts[0].level | 4 |
| concepts[0].score | 0.48880431056022644 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q222046 |
| concepts[0].display_name | Genomics |
| concepts[1].id | https://openalex.org/C70721500 |
| concepts[1].level | 1 |
| concepts[1].score | 0.46069759130477905 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q177005 |
| concepts[1].display_name | Computational biology |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.44360536336898804 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C2522767166 |
| concepts[3].level | 1 |
| concepts[3].score | 0.4357111155986786 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q2374463 |
| concepts[3].display_name | Data science |
| concepts[4].id | https://openalex.org/C86803240 |
| concepts[4].level | 0 |
| concepts[4].score | 0.2835124731063843 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[4].display_name | Biology |
| concepts[5].id | https://openalex.org/C141231307 |
| concepts[5].level | 3 |
| concepts[5].score | 0.21812790632247925 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q7020 |
| concepts[5].display_name | Genome |
| concepts[6].id | https://openalex.org/C54355233 |
| concepts[6].level | 1 |
| concepts[6].score | 0.17129850387573242 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q7162 |
| concepts[6].display_name | Genetics |
| concepts[7].id | https://openalex.org/C104317684 |
| concepts[7].level | 2 |
| concepts[7].score | 0.1287693977355957 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q7187 |
| concepts[7].display_name | Gene |
| keywords[0].id | https://openalex.org/keywords/genomics |
| keywords[0].score | 0.48880431056022644 |
| keywords[0].display_name | Genomics |
| keywords[1].id | https://openalex.org/keywords/computational-biology |
| keywords[1].score | 0.46069759130477905 |
| keywords[1].display_name | Computational biology |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.44360536336898804 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/data-science |
| keywords[3].score | 0.4357111155986786 |
| keywords[3].display_name | Data science |
| keywords[4].id | https://openalex.org/keywords/biology |
| keywords[4].score | 0.2835124731063843 |
| keywords[4].display_name | Biology |
| keywords[5].id | https://openalex.org/keywords/genome |
| keywords[5].score | 0.21812790632247925 |
| keywords[5].display_name | Genome |
| keywords[6].id | https://openalex.org/keywords/genetics |
| keywords[6].score | 0.17129850387573242 |
| keywords[6].display_name | Genetics |
| keywords[7].id | https://openalex.org/keywords/gene |
| keywords[7].score | 0.1287693977355957 |
| keywords[7].display_name | Gene |
| language | en |
| locations[0].id | doi:10.1101/2025.04.30.651464 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306402567 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | False |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | bioRxiv (Cold Spring Harbor Laboratory) |
| locations[0].source.host_organization | https://openalex.org/I2750212522 |
| locations[0].source.host_organization_name | Cold Spring Harbor Laboratory |
| locations[0].source.host_organization_lineage | https://openalex.org/I2750212522 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.04.30.651464.full.pdf |
| locations[0].version | acceptedVersion |
| locations[0].raw_type | posted-content |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | True |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.1101/2025.04.30.651464 |
| locations[1].id | pmid:40654778 |
| locations[1].is_oa | False |
| locations[1].source.id | https://openalex.org/S4306525036 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | False |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | PubMed |
| locations[1].source.host_organization | https://openalex.org/I1299303238 |
| locations[1].source.host_organization_name | National Institutes of Health |
| locations[1].source.host_organization_lineage | https://openalex.org/I1299303238 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | publishedVersion |
| locations[1].raw_type | |
| locations[1].license_id | |
| locations[1].is_accepted | True |
| locations[1].is_published | True |
| locations[1].raw_source_name | bioRxiv : the preprint server for biology |
| locations[1].landing_page_url | https://pubmed.ncbi.nlm.nih.gov/40654778 |
| locations[2].id | pmh:oai:pubmedcentral.nih.gov:12248099 |
| locations[2].is_oa | True |
| locations[2].source.id | https://openalex.org/S2764455111 |
| locations[2].source.issn | |
| locations[2].source.type | repository |
| locations[2].source.is_oa | False |
| locations[2].source.issn_l | |
| locations[2].source.is_core | False |
| locations[2].source.is_in_doaj | False |
| locations[2].source.display_name | PubMed Central |
| locations[2].source.host_organization | https://openalex.org/I1299303238 |
| locations[2].source.host_organization_name | National Institutes of Health |
| locations[2].source.host_organization_lineage | https://openalex.org/I1299303238 |
| locations[2].license | cc-by |
| locations[2].pdf_url | |
| locations[2].version | submittedVersion |
| locations[2].raw_type | Text |
| locations[2].license_id | https://openalex.org/licenses/cc-by |
| locations[2].is_accepted | False |
| locations[2].is_published | False |
| locations[2].raw_source_name | bioRxiv |
| locations[2].landing_page_url | https://www.ncbi.nlm.nih.gov/pmc/articles/12248099 |
| indexed_in | crossref, pubmed |
| authorships[0].author.id | https://openalex.org/A5011557283 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-4984-0516 |
| authorships[0].author.display_name | Jordan Bryan |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Jordan G. Bryan |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100635197 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-4611-1614 |
| authorships[1].author.display_name | Hui Niu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Hongqian Niu |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5090506415 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-9146-705X |
| authorships[2].author.display_name | Didong Li |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Didong Li |
| authorships[2].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.04.30.651464.full.pdf |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Incorporating Large Language Model-Derived Information into Hypothesis Testing for Genomics |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-23T23:15:26.331081 |
| primary_topic.id | https://openalex.org/T10885 |
| primary_topic.field.id | https://openalex.org/fields/13 |
| primary_topic.field.display_name | Biochemistry, Genetics and Molecular Biology |
| primary_topic.score | 0.9991999864578247 |
| primary_topic.domain.id | https://openalex.org/domains/1 |
| primary_topic.domain.display_name | Life Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1312 |
| primary_topic.subfield.display_name | Molecular Biology |
| primary_topic.display_name | Gene expression and cancer classification |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2053972265 |
| cited_by_count | 0 |
| locations_count | 3 |
| best_oa_location.id | doi:10.1101/2025.04.30.651464 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306402567 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | False |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | bioRxiv (Cold Spring Harbor Laboratory) |
| best_oa_location.source.host_organization | https://openalex.org/I2750212522 |
| best_oa_location.source.host_organization_name | Cold Spring Harbor Laboratory |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I2750212522 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.04.30.651464.full.pdf |
| best_oa_location.version | acceptedVersion |
| best_oa_location.raw_type | posted-content |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.1101/2025.04.30.651464 |
| primary_location.id | doi:10.1101/2025.04.30.651464 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306402567 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | False |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | bioRxiv (Cold Spring Harbor Laboratory) |
| primary_location.source.host_organization | https://openalex.org/I2750212522 |
| primary_location.source.host_organization_name | Cold Spring Harbor Laboratory |
| primary_location.source.host_organization_lineage | https://openalex.org/I2750212522 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.04.30.651464.full.pdf |
| primary_location.version | acceptedVersion |
| primary_location.raw_type | posted-content |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | True |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.1101/2025.04.30.651464 |
| publication_date | 2025-05-07 |
| publication_year | 2025 |
| referenced_works | https://openalex.org/W2562947047, https://openalex.org/W2070163108, https://openalex.org/W1597211359, https://openalex.org/W3180924175, https://openalex.org/W4387819650, https://openalex.org/W3003119869, https://openalex.org/W2081990599, https://openalex.org/W4367602258, https://openalex.org/W6863429002, https://openalex.org/W6863413777, https://openalex.org/W3046375318, https://openalex.org/W2194775991, https://openalex.org/W3010561558, https://openalex.org/W6766464881, https://openalex.org/W3170620680, https://openalex.org/W4366723564, https://openalex.org/W4404824622, https://openalex.org/W2119454933, https://openalex.org/W2896457183, https://openalex.org/W2911489562, https://openalex.org/W4404132818, https://openalex.org/W2158280795, https://openalex.org/W4320035412, https://openalex.org/W2735132087, https://openalex.org/W4388218765, https://openalex.org/W4281716291, https://openalex.org/W3048090444, https://openalex.org/W4401443086, https://openalex.org/W2894573839, https://openalex.org/W4378838672, https://openalex.org/W2584033231, https://openalex.org/W2948545594, https://openalex.org/W4221153690, https://openalex.org/W4400687657, https://openalex.org/W2963190765 |
| referenced_works_count | 35 |
| abstract_inverted_index.a | 37, 54 |
| abstract_inverted_index.In | 83 |
| abstract_inverted_index.We | 1, 51 |
| abstract_inverted_index.by | 48, 92 |
| abstract_inverted_index.in | 8, 17, 36 |
| abstract_inverted_index.of | 39 |
| abstract_inverted_index.on | 78 |
| abstract_inverted_index.or | 69 |
| abstract_inverted_index.to | 27, 60, 74 |
| abstract_inverted_index.we | 31 |
| abstract_inverted_index.FAB | 89 |
| abstract_inverted_index.and | 56 |
| abstract_inverted_index.are | 66 |
| abstract_inverted_index.for | 4 |
| abstract_inverted_index.the | 6, 44, 49, 79, 88, 93 |
| abstract_inverted_index.use | 53 |
| abstract_inverted_index.four | 84 |
| abstract_inverted_index.from | 24 |
| abstract_inverted_index.gene | 21, 80 |
| abstract_inverted_index.into | 13 |
| abstract_inverted_index.more | 97 |
| abstract_inverted_index.near | 43 |
| abstract_inverted_index.show | 32 |
| abstract_inverted_index.text | 25 |
| abstract_inverted_index.than | 99 |
| abstract_inverted_index.that | 33, 65 |
| abstract_inverted_index.then | 52 |
| abstract_inverted_index.with | 72 |
| abstract_inverted_index.(FAB) | 58 |
| abstract_inverted_index.Using | 20 |
| abstract_inverted_index.based | 77 |
| abstract_inverted_index.large | 9 |
| abstract_inverted_index.power | 98 |
| abstract_inverted_index.prior | 75 |
| abstract_inverted_index.tests | 16, 64, 90 |
| abstract_inverted_index.(LLMs) | 12 |
| abstract_inverted_index.either | 67 |
| abstract_inverted_index.guided | 91 |
| abstract_inverted_index.inputs | 26 |
| abstract_inverted_index.model, | 30 |
| abstract_inverted_index.models | 11 |
| abstract_inverted_index.reside | 42 |
| abstract_inverted_index.GPT-3.5 | 29 |
| abstract_inverted_index.achieve | 96 |
| abstract_inverted_index.derived | 23 |
| abstract_inverted_index.optimal | 68, 71 |
| abstract_inverted_index.propose | 2, 61 |
| abstract_inverted_index.respect | 73 |
| abstract_inverted_index.several | 62 |
| abstract_inverted_index.signals | 35 |
| abstract_inverted_index.spanned | 47 |
| abstract_inverted_index.variety | 38 |
| abstract_inverted_index.Abstract | 0 |
| abstract_inverted_index.Bayesian | 57 |
| abstract_inverted_index.datasets | 41 |
| abstract_inverted_index.genomics | 18, 40, 86 |
| abstract_inverted_index.language | 10 |
| abstract_inverted_index.studies. | 19 |
| abstract_inverted_index.subspace | 46 |
| abstract_inverted_index.classical | 100 |
| abstract_inverted_index.embedding | 81 |
| abstract_inverted_index.examples, | 87 |
| abstract_inverted_index.framework | 59 |
| abstract_inverted_index.principal | 45 |
| abstract_inverted_index.subspace. | 82 |
| abstract_inverted_index.OpenAI’s | 28 |
| abstract_inverted_index.biological | 34 |
| abstract_inverted_index.embeddings | 22 |
| abstract_inverted_index.hypothesis | 15, 63 |
| abstract_inverted_index.real-world | 85 |
| abstract_inverted_index.strategies | 3 |
| abstract_inverted_index.LLM-derived | 94 |
| abstract_inverted_index.embeddings. | 50 |
| abstract_inverted_index.frequentist | 55 |
| abstract_inverted_index.information | 7, 76, 95 |
| abstract_inverted_index.statistical | 14 |
| abstract_inverted_index.approximately | 70 |
| abstract_inverted_index.counterparts. | 101 |
| abstract_inverted_index.incorporating | 5 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile.value | 0.15531705 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | False |