Long-range gene expression prediction with token alignment of large language model Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2410.01858
Gene expression is a cellular process that plays a fundamental role in human phenotypical variations and diseases. Despite advances of deep learning models for gene expression prediction, recent benchmarks have revealed their inability to learn distal regulatory grammar. Here, we address this challenge by leveraging a pretrained large language model to enhance gene expression prediction. We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens, allowing for symbolic reasoning of genomic sequence features via the frozen language model. This cross-modal adaptation learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts, enabling in-context learning that is not possible with existing models. Trained on lymphoblastoid cells, GTA was evaluated on cells from the Geuvadis consortium and outperforms state-of-the-art models such as Enformer, achieving a Spearman correlation of 0.65, a 10\% improvement. Additionally, GTA offers improved interpretation of long-range interactions through the identification of the most meaningful sections of the input genetic context. GTA represents a powerful and novel cross-modal approach to gene expression prediction by utilizing a pretrained language model, in a paradigm shift from conventional gene expression models trained only on sequence data.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2410.01858
- https://arxiv.org/pdf/2410.01858
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403853826
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403853826Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2410.01858Digital Object Identifier
- Title
-
Long-range gene expression prediction with token alignment of large language modelWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-10-02Full publication date if available
- Authors
-
E Honig, Huixin Zhan, Ying Wu, Zhe ZhangList of authors in order
- Landing page
-
https://arxiv.org/abs/2410.01858Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2410.01858Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2410.01858Direct OA link when available
- Concepts
-
Security token, Range (aeronautics), Computer science, Expression (computer science), Computational biology, Artificial intelligence, Natural language processing, Biology, Computer network, Programming language, Engineering, Aerospace engineeringTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403853826 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2410.01858 |
| ids.doi | https://doi.org/10.48550/arxiv.2410.01858 |
| ids.openalex | https://openalex.org/W4403853826 |
| fwci | |
| type | preprint |
| title | Long-range gene expression prediction with token alignment of large language model |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10885 |
| topics[0].field.id | https://openalex.org/fields/13 |
| topics[0].field.display_name | Biochemistry, Genetics and Molecular Biology |
| topics[0].score | 0.8841000199317932 |
| topics[0].domain.id | https://openalex.org/domains/1 |
| topics[0].domain.display_name | Life Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1312 |
| topics[0].subfield.display_name | Molecular Biology |
| topics[0].display_name | Gene expression and cancer classification |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C48145219 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7039057612419128 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1335365 |
| concepts[0].display_name | Security token |
| concepts[1].id | https://openalex.org/C204323151 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5977835655212402 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q905424 |
| concepts[1].display_name | Range (aeronautics) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.5674582123756409 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C90559484 |
| concepts[3].level | 2 |
| concepts[3].score | 0.5633938908576965 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q778379 |
| concepts[3].display_name | Expression (computer science) |
| concepts[4].id | https://openalex.org/C70721500 |
| concepts[4].level | 1 |
| concepts[4].score | 0.33911705017089844 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q177005 |
| concepts[4].display_name | Computational biology |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.3322131037712097 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C204321447 |
| concepts[6].level | 1 |
| concepts[6].score | 0.3266996741294861 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[6].display_name | Natural language processing |
| concepts[7].id | https://openalex.org/C86803240 |
| concepts[7].level | 0 |
| concepts[7].score | 0.28160834312438965 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[7].display_name | Biology |
| concepts[8].id | https://openalex.org/C31258907 |
| concepts[8].level | 1 |
| concepts[8].score | 0.22880509495735168 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q1301371 |
| concepts[8].display_name | Computer network |
| concepts[9].id | https://openalex.org/C199360897 |
| concepts[9].level | 1 |
| concepts[9].score | 0.22351789474487305 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[9].display_name | Programming language |
| concepts[10].id | https://openalex.org/C127413603 |
| concepts[10].level | 0 |
| concepts[10].score | 0.1111757755279541 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[10].display_name | Engineering |
| concepts[11].id | https://openalex.org/C146978453 |
| concepts[11].level | 1 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q3798668 |
| concepts[11].display_name | Aerospace engineering |
| keywords[0].id | https://openalex.org/keywords/security-token |
| keywords[0].score | 0.7039057612419128 |
| keywords[0].display_name | Security token |
| keywords[1].id | https://openalex.org/keywords/range |
| keywords[1].score | 0.5977835655212402 |
| keywords[1].display_name | Range (aeronautics) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.5674582123756409 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/expression |
| keywords[3].score | 0.5633938908576965 |
| keywords[3].display_name | Expression (computer science) |
| keywords[4].id | https://openalex.org/keywords/computational-biology |
| keywords[4].score | 0.33911705017089844 |
| keywords[4].display_name | Computational biology |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.3322131037712097 |
| keywords[5].display_name | Artificial intelligence |
| keywords[6].id | https://openalex.org/keywords/natural-language-processing |
| keywords[6].score | 0.3266996741294861 |
| keywords[6].display_name | Natural language processing |
| keywords[7].id | https://openalex.org/keywords/biology |
| keywords[7].score | 0.28160834312438965 |
| keywords[7].display_name | Biology |
| keywords[8].id | https://openalex.org/keywords/computer-network |
| keywords[8].score | 0.22880509495735168 |
| keywords[8].display_name | Computer network |
| keywords[9].id | https://openalex.org/keywords/programming-language |
| keywords[9].score | 0.22351789474487305 |
| keywords[9].display_name | Programming language |
| keywords[10].id | https://openalex.org/keywords/engineering |
| keywords[10].score | 0.1111757755279541 |
| keywords[10].display_name | Engineering |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2410.01858 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2410.01858 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2410.01858 |
| locations[1].id | doi:10.48550/arxiv.2410.01858 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2410.01858 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5012169627 |
| authorships[0].author.orcid | https://orcid.org/0009-0001-4591-3546 |
| authorships[0].author.display_name | E Honig |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Honig, Edouardo |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5075369864 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-8926-1941 |
| authorships[1].author.display_name | Huixin Zhan |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhan, Huixin |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5101780958 |
| authorships[2].author.orcid | https://orcid.org/0009-0001-6768-5118 |
| authorships[2].author.display_name | Ying Wu |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wu, Ying Nian |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100443000 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-6748-3356 |
| authorships[3].author.display_name | Zhe Zhang |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Zhang, Zijun Frank |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2410.01858 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-10-29T00:00:00 |
| display_name | Long-range gene expression prediction with token alignment of large language model |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10885 |
| primary_topic.field.id | https://openalex.org/fields/13 |
| primary_topic.field.display_name | Biochemistry, Genetics and Molecular Biology |
| primary_topic.score | 0.8841000199317932 |
| primary_topic.domain.id | https://openalex.org/domains/1 |
| primary_topic.domain.display_name | Life Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1312 |
| primary_topic.subfield.display_name | Molecular Biology |
| primary_topic.display_name | Gene expression and cancer classification |
| related_works | https://openalex.org/W4388335561, https://openalex.org/W2970530566, https://openalex.org/W4288261899, https://openalex.org/W4307309205, https://openalex.org/W2967478618, https://openalex.org/W4385009901, https://openalex.org/W4385572700, https://openalex.org/W2997152889, https://openalex.org/W4387768015, https://openalex.org/W4285141722 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2410.01858 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2410.01858 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2410.01858 |
| primary_location.id | pmh:oai:arXiv.org:2410.01858 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2410.01858 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2410.01858 |
| publication_date | 2024-10-02 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 3, 8, 45, 133, 138, 164, 176, 181 |
| abstract_inverted_index.We | 55 |
| abstract_inverted_index.as | 100, 130 |
| abstract_inverted_index.by | 43, 174 |
| abstract_inverted_index.in | 11, 180 |
| abstract_inverted_index.is | 2, 106 |
| abstract_inverted_index.of | 19, 75, 136, 146, 152, 157 |
| abstract_inverted_index.on | 113, 119, 191 |
| abstract_inverted_index.to | 33, 50, 94, 170 |
| abstract_inverted_index.us | 93 |
| abstract_inverted_index.we | 39 |
| abstract_inverted_index.GTA | 116, 142, 162 |
| abstract_inverted_index.and | 15, 91, 125, 166 |
| abstract_inverted_index.for | 23, 72 |
| abstract_inverted_index.not | 107 |
| abstract_inverted_index.the | 80, 88, 122, 150, 153, 158 |
| abstract_inverted_index.via | 79 |
| abstract_inverted_index.was | 117 |
| abstract_inverted_index.10\% | 139 |
| abstract_inverted_index.Gene | 0 |
| abstract_inverted_index.This | 84 |
| abstract_inverted_index.deep | 20 |
| abstract_inverted_index.from | 121, 184 |
| abstract_inverted_index.gene | 24, 52, 171, 186 |
| abstract_inverted_index.have | 29 |
| abstract_inverted_index.most | 154 |
| abstract_inverted_index.only | 190 |
| abstract_inverted_index.role | 10 |
| abstract_inverted_index.such | 129 |
| abstract_inverted_index.that | 6, 105 |
| abstract_inverted_index.this | 41 |
| abstract_inverted_index.with | 67, 109 |
| abstract_inverted_index.0.65, | 137 |
| abstract_inverted_index.Here, | 38 |
| abstract_inverted_index.Token | 59 |
| abstract_inverted_index.cells | 120 |
| abstract_inverted_index.data. | 193 |
| abstract_inverted_index.human | 12, 98 |
| abstract_inverted_index.input | 159 |
| abstract_inverted_index.large | 47 |
| abstract_inverted_index.learn | 34 |
| abstract_inverted_index.model | 49 |
| abstract_inverted_index.novel | 167 |
| abstract_inverted_index.plays | 7 |
| abstract_inverted_index.shift | 183 |
| abstract_inverted_index.their | 31 |
| abstract_inverted_index.which | 62 |
| abstract_inverted_index.(GTA), | 61 |
| abstract_inverted_index.aligns | 63 |
| abstract_inverted_index.allows | 92 |
| abstract_inverted_index.cells, | 115 |
| abstract_inverted_index.distal | 35 |
| abstract_inverted_index.frozen | 81 |
| abstract_inverted_index.learns | 87 |
| abstract_inverted_index.model, | 179 |
| abstract_inverted_index.model. | 83 |
| abstract_inverted_index.models | 22, 128, 188 |
| abstract_inverted_index.offers | 143 |
| abstract_inverted_index.recent | 27 |
| abstract_inverted_index.Despite | 17 |
| abstract_inverted_index.Genetic | 57 |
| abstract_inverted_index.Trained | 112 |
| abstract_inverted_index.address | 40 |
| abstract_inverted_index.enhance | 51 |
| abstract_inverted_index.further | 95 |
| abstract_inverted_index.genetic | 64, 160 |
| abstract_inverted_index.genomic | 76 |
| abstract_inverted_index.grammar | 90 |
| abstract_inverted_index.models. | 111 |
| abstract_inverted_index.natural | 68 |
| abstract_inverted_index.process | 5 |
| abstract_inverted_index.through | 149 |
| abstract_inverted_index.tokens, | 70 |
| abstract_inverted_index.trained | 189 |
| abstract_inverted_index.Geuvadis | 123 |
| abstract_inverted_index.Spearman | 134 |
| abstract_inverted_index.advances | 18 |
| abstract_inverted_index.allowing | 71 |
| abstract_inverted_index.approach | 169 |
| abstract_inverted_index.cellular | 4 |
| abstract_inverted_index.context. | 161 |
| abstract_inverted_index.enabling | 102 |
| abstract_inverted_index.existing | 110 |
| abstract_inverted_index.features | 66, 78 |
| abstract_inverted_index.grammar. | 37 |
| abstract_inverted_index.improved | 144 |
| abstract_inverted_index.language | 48, 69, 82, 178 |
| abstract_inverted_index.learning | 21, 104 |
| abstract_inverted_index.paradigm | 182 |
| abstract_inverted_index.possible | 108 |
| abstract_inverted_index.powerful | 165 |
| abstract_inverted_index.prompts, | 101 |
| abstract_inverted_index.revealed | 30 |
| abstract_inverted_index.sections | 156 |
| abstract_inverted_index.sequence | 58, 65, 77, 192 |
| abstract_inverted_index.symbolic | 73 |
| abstract_inverted_index.Alignment | 60 |
| abstract_inverted_index.Enformer, | 131 |
| abstract_inverted_index.achieving | 132 |
| abstract_inverted_index.challenge | 42 |
| abstract_inverted_index.diseases. | 16 |
| abstract_inverted_index.evaluated | 118 |
| abstract_inverted_index.inability | 32 |
| abstract_inverted_index.introduce | 56 |
| abstract_inverted_index.reasoning | 74 |
| abstract_inverted_index.utilizing | 175 |
| abstract_inverted_index.adaptation | 86 |
| abstract_inverted_index.benchmarks | 28 |
| abstract_inverted_index.consortium | 124 |
| abstract_inverted_index.expression | 1, 25, 53, 172, 187 |
| abstract_inverted_index.in-context | 103 |
| abstract_inverted_index.leveraging | 44 |
| abstract_inverted_index.long-range | 147 |
| abstract_inverted_index.meaningful | 155 |
| abstract_inverted_index.prediction | 173 |
| abstract_inverted_index.pretrained | 46, 177 |
| abstract_inverted_index.regulatory | 36, 89 |
| abstract_inverted_index.represents | 163 |
| abstract_inverted_index.variations | 14 |
| abstract_inverted_index.annotations | 99 |
| abstract_inverted_index.correlation | 135 |
| abstract_inverted_index.cross-modal | 85, 168 |
| abstract_inverted_index.fundamental | 9 |
| abstract_inverted_index.incorporate | 96 |
| abstract_inverted_index.outperforms | 126 |
| abstract_inverted_index.prediction, | 26 |
| abstract_inverted_index.prediction. | 54 |
| abstract_inverted_index.conventional | 185 |
| abstract_inverted_index.improvement. | 140 |
| abstract_inverted_index.interactions | 148 |
| abstract_inverted_index.phenotypical | 13 |
| abstract_inverted_index.Additionally, | 141 |
| abstract_inverted_index.gene-specific | 97 |
| abstract_inverted_index.identification | 151 |
| abstract_inverted_index.interpretation | 145 |
| abstract_inverted_index.lymphoblastoid | 114 |
| abstract_inverted_index.state-of-the-art | 127 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |