Decoding Prokaryotic Whole Genomes with a Product-Contextualized Large Language Model Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.64898/2025.12.03.692003
Genomes encode the instructions for life, yet their full interpretation requires models capable of capturing long-range context and functional meaning at scale. Existing genome language models (gLMs) are limited by short context windows, high computational cost, and poor interpretability. We present GenSyntax, a product-contextualized large language model (LLM) trained on 49,250 annotated prokaryotic genomes. GenSyntax replaces nucleotide tokenization with gene product descriptors, transforming genomes into “genetic paragraphs” that preserve functional semantics. Using a two-stage training strategy, GenSyntax achieves leading performance in plasmid host identification, gene function prediction, genome assembly, and gene essentiality assessment compared with the other LLMs. It also enables phenotype prediction and minimal genome design, establishing a scalable and interpretable framework for genome-scale decoding and synthetic biology.
Related Topics
- Type
- article
- Landing Page
- https://doi.org/10.64898/2025.12.03.692003
- OA Status
- gold
- References
- 26
- OpenAlex ID
- https://openalex.org/W4417117234
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4417117234Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.64898/2025.12.03.692003Digital Object Identifier
- Title
-
Decoding Prokaryotic Whole Genomes with a Product-Contextualized Large Language ModelWork title
- Type
-
articleOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-12-05Full publication date if available
- Authors
-
Shiwen Ni, Shuaimin Li, Shijian Wang, Xinyu Bi, Y. Li, Chai Phei Gan, iarui Jin, Yuan Lü, Min Yang, Teng WangList of authors in order
- Landing page
-
https://doi.org/10.64898/2025.12.03.692003Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
goldOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.64898/2025.12.03.692003Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
- References (count)
-
26Number of works referenced by this work
Full payload
| id | https://openalex.org/W4417117234 |
|---|---|
| doi | https://doi.org/10.64898/2025.12.03.692003 |
| ids.doi | https://doi.org/10.64898/2025.12.03.692003 |
| ids.openalex | https://openalex.org/W4417117234 |
| fwci | |
| type | article |
| title | Decoding Prokaryotic Whole Genomes with a Product-Contextualized Large Language Model |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | |
| locations[0].id | doi:10.64898/2025.12.03.692003 |
| locations[0].is_oa | True |
| locations[0].source | |
| locations[0].license | cc-by-nd |
| locations[0].pdf_url | |
| locations[0].version | acceptedVersion |
| locations[0].raw_type | posted-content |
| locations[0].license_id | https://openalex.org/licenses/cc-by-nd |
| locations[0].is_accepted | True |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.64898/2025.12.03.692003 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5076323286 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-4986-4446 |
| authorships[0].author.display_name | Shiwen Ni |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Shiwen Ni |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5085846056 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-8368-916X |
| authorships[1].author.display_name | Shuaimin Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Shuaimin Li |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5115592225 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Shijian Wang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Shijian Wang |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5112585623 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Xinyu Bi |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Xinping Bi |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5104667309 |
| authorships[4].author.orcid | https://orcid.org/0009-0003-5197-9482 |
| authorships[4].author.display_name | Y. Li |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Yitai Li |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5090872672 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-7075-5711 |
| authorships[5].author.display_name | Chai Phei Gan |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Chengguang Gan |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5120721826 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | iarui Jin |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | iarui Jin |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5100672889 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-2420-4804 |
| authorships[7].author.display_name | Yuan Lü |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Yuan Lu |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5101561494 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-2104-8420 |
| authorships[8].author.display_name | Min Yang |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Min Yang |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5100348960 |
| authorships[9].author.orcid | https://orcid.org/0000-0003-3067-4674 |
| authorships[9].author.display_name | Teng Wang |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Teng Wang |
| authorships[9].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.64898/2025.12.03.692003 |
| open_access.oa_status | gold |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-12-08T00:00:00 |
| display_name | Decoding Prokaryotic Whole Genomes with a Product-Contextualized Large Language Model |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-12-10T02:45:41.426853 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.64898/2025.12.03.692003 |
| best_oa_location.is_oa | True |
| best_oa_location.source | |
| best_oa_location.license | cc-by-nd |
| best_oa_location.pdf_url | |
| best_oa_location.version | acceptedVersion |
| best_oa_location.raw_type | posted-content |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by-nd |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.64898/2025.12.03.692003 |
| primary_location.id | doi:10.64898/2025.12.03.692003 |
| primary_location.is_oa | True |
| primary_location.source | |
| primary_location.license | cc-by-nd |
| primary_location.pdf_url | |
| primary_location.version | acceptedVersion |
| primary_location.raw_type | posted-content |
| primary_location.license_id | https://openalex.org/licenses/cc-by-nd |
| primary_location.is_accepted | True |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.64898/2025.12.03.692003 |
| publication_date | 2025-12-05 |
| publication_year | 2025 |
| referenced_works | https://openalex.org/W4401103147, https://openalex.org/W4404349982, https://openalex.org/W4405995672, https://openalex.org/W4403921247, https://openalex.org/W4400921533, https://openalex.org/W4387966979, https://openalex.org/W4404821554, https://openalex.org/W3081278968, https://openalex.org/W4298006192, https://openalex.org/W2470052926, https://openalex.org/W2766648240, https://openalex.org/W2173732482, https://openalex.org/W2048176942, https://openalex.org/W4383816978, https://openalex.org/W4388410058, https://openalex.org/W2942757983, https://openalex.org/W2140647036, https://openalex.org/W2137119243, https://openalex.org/W2943829031, https://openalex.org/W2177415512, https://openalex.org/W3094213757, https://openalex.org/W3125845881, https://openalex.org/W4411754635, https://openalex.org/W4403924738, https://openalex.org/W2083706719, https://openalex.org/W3000025696 |
| referenced_works_count | 26 |
| abstract_inverted_index.a | 43, 73, 109 |
| abstract_inverted_index.It | 99 |
| abstract_inverted_index.We | 40 |
| abstract_inverted_index.at | 21 |
| abstract_inverted_index.by | 30 |
| abstract_inverted_index.in | 81 |
| abstract_inverted_index.of | 14 |
| abstract_inverted_index.on | 50 |
| abstract_inverted_index.and | 18, 37, 90, 104, 111, 117 |
| abstract_inverted_index.are | 28 |
| abstract_inverted_index.for | 5, 114 |
| abstract_inverted_index.the | 3, 96 |
| abstract_inverted_index.yet | 7 |
| abstract_inverted_index.also | 100 |
| abstract_inverted_index.full | 9 |
| abstract_inverted_index.gene | 60, 85, 91 |
| abstract_inverted_index.high | 34 |
| abstract_inverted_index.host | 83 |
| abstract_inverted_index.into | 65 |
| abstract_inverted_index.poor | 38 |
| abstract_inverted_index.that | 68 |
| abstract_inverted_index.with | 59, 95 |
| abstract_inverted_index.(LLM) | 48 |
| abstract_inverted_index.LLMs. | 98 |
| abstract_inverted_index.Using | 72 |
| abstract_inverted_index.cost, | 36 |
| abstract_inverted_index.large | 45 |
| abstract_inverted_index.life, | 6 |
| abstract_inverted_index.model | 47 |
| abstract_inverted_index.other | 97 |
| abstract_inverted_index.short | 31 |
| abstract_inverted_index.their | 8 |
| abstract_inverted_index.(gLMs) | 27 |
| abstract_inverted_index.49,250 | 51 |
| abstract_inverted_index.encode | 2 |
| abstract_inverted_index.genome | 24, 88, 106 |
| abstract_inverted_index.models | 12, 26 |
| abstract_inverted_index.scale. | 22 |
| abstract_inverted_index.Genomes | 1 |
| abstract_inverted_index.capable | 13 |
| abstract_inverted_index.context | 17, 32 |
| abstract_inverted_index.design, | 107 |
| abstract_inverted_index.enables | 101 |
| abstract_inverted_index.genomes | 64 |
| abstract_inverted_index.leading | 79 |
| abstract_inverted_index.limited | 29 |
| abstract_inverted_index.meaning | 20 |
| abstract_inverted_index.minimal | 105 |
| abstract_inverted_index.plasmid | 82 |
| abstract_inverted_index.present | 41 |
| abstract_inverted_index.product | 61 |
| abstract_inverted_index.trained | 49 |
| abstract_inverted_index.Abstract | 0 |
| abstract_inverted_index.Existing | 23 |
| abstract_inverted_index.achieves | 78 |
| abstract_inverted_index.biology. | 119 |
| abstract_inverted_index.compared | 94 |
| abstract_inverted_index.decoding | 116 |
| abstract_inverted_index.function | 86 |
| abstract_inverted_index.genomes. | 54 |
| abstract_inverted_index.language | 25, 46 |
| abstract_inverted_index.preserve | 69 |
| abstract_inverted_index.replaces | 56 |
| abstract_inverted_index.requires | 11 |
| abstract_inverted_index.scalable | 110 |
| abstract_inverted_index.training | 75 |
| abstract_inverted_index.windows, | 33 |
| abstract_inverted_index.GenSyntax | 55, 77 |
| abstract_inverted_index.annotated | 52 |
| abstract_inverted_index.assembly, | 89 |
| abstract_inverted_index.capturing | 15 |
| abstract_inverted_index.framework | 113 |
| abstract_inverted_index.phenotype | 102 |
| abstract_inverted_index.strategy, | 76 |
| abstract_inverted_index.synthetic | 118 |
| abstract_inverted_index.two-stage | 74 |
| abstract_inverted_index.GenSyntax, | 42 |
| abstract_inverted_index.assessment | 93 |
| abstract_inverted_index.functional | 19, 70 |
| abstract_inverted_index.long-range | 16 |
| abstract_inverted_index.nucleotide | 57 |
| abstract_inverted_index.prediction | 103 |
| abstract_inverted_index.semantics. | 71 |
| abstract_inverted_index.“genetic | 66 |
| abstract_inverted_index.performance | 80 |
| abstract_inverted_index.prediction, | 87 |
| abstract_inverted_index.prokaryotic | 53 |
| abstract_inverted_index.descriptors, | 62 |
| abstract_inverted_index.essentiality | 92 |
| abstract_inverted_index.establishing | 108 |
| abstract_inverted_index.genome-scale | 115 |
| abstract_inverted_index.instructions | 4 |
| abstract_inverted_index.tokenization | 58 |
| abstract_inverted_index.transforming | 63 |
| abstract_inverted_index.computational | 35 |
| abstract_inverted_index.interpretable | 112 |
| abstract_inverted_index.paragraphs” | 67 |
| abstract_inverted_index.interpretation | 10 |
| abstract_inverted_index.identification, | 84 |
| abstract_inverted_index.interpretability. | 39 |
| abstract_inverted_index.product-contextualized | 44 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 10 |
| citation_normalized_percentile |