Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.65215/2qt5jb81
Tokenization is a critical design choice in genomic language modeling. Widely used schemes---character-level encoding, fixed-length $k$-mers, and greedy subword algorithms such as BPE---show intrinsic limitations on DNA that are magnified by the small four-letter alphabet. To address this, we adapt Ladderpath, an Algorithmic Information Theory method that identifies nested and hierarchical repetitions through optimal information reuse, into a tokenizer tailored for genomic sequences. Integrating this tokenizer into an 86-million-parameter Transformer yields the Ladderpath Tokenized Model (LTM), which surpasses the best existing models---including those several times larger---on 17 of 21 benchmarks. Comparisons with TF-IDF and other frequency-based baselines show that these gains extend beyond simple motif-frequency statistics. LTM's internal representations further exhibit biologically meaningful organization: token embeddings form coherent clusters, and sequence embeddings group promoters, enhancers, and histone-mark-associated regions without task-specific supervision, revealing an emergent structure of functional sequence classes. These findings show that strengthening the information-theoretic basis of tokenization provides a complementary path to architectural innovations and model scaling, enabling more compact and biologically aligned genomic foundation models.
Related Topics
- Type
- article
- Landing Page
- https://doi.org/10.65215/2qt5jb81
- OA Status
- gold
- OpenAlex ID
- https://openalex.org/W4417230746
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4417230746Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.65215/2qt5jb81Digital Object Identifier
- Title
-
Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic StructureWork title
- Type
-
articleOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-12-11Full publication date if available
- Authors
-
Yiping Wang, Jing Wang, Junhao Zhu, Fangyi Zhai, Zhu Hu, Ziwei Dai, Zengru Di, Da Zhou, Yu LiuList of authors in order
- Landing page
-
https://doi.org/10.65215/2qt5jb81Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
goldOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.65215/2qt5jb81Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4417230746 |
|---|---|
| doi | https://doi.org/10.65215/2qt5jb81 |
| ids.doi | https://doi.org/10.65215/2qt5jb81 |
| ids.openalex | https://openalex.org/W4417230746 |
| fwci | |
| type | article |
| title | Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | |
| locations[0].id | doi:10.65215/2qt5jb81 |
| locations[0].is_oa | True |
| locations[0].source | |
| locations[0].license | cc-by-nc-nd |
| locations[0].pdf_url | |
| locations[0].version | acceptedVersion |
| locations[0].raw_type | posted-content |
| locations[0].license_id | https://openalex.org/licenses/cc-by-nc-nd |
| locations[0].is_accepted | True |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://doi.org/10.65215/2qt5jb81 |
| indexed_in | crossref |
| authorships[0].author.id | https://openalex.org/A5100317707 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-7730-8906 |
| authorships[0].author.display_name | Yiping Wang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Yiping Wang |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100610512 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-5160-5171 |
| authorships[1].author.display_name | Jing Wang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Jing Wang |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5009070781 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Junhao Zhu |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Junhao Zhu |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5020121490 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Fangyi Zhai |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Fengyao Zhai |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100754393 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-4669-5245 |
| authorships[4].author.display_name | Zhu Hu |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Hu Zhu |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5084388646 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Ziwei Dai |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Ziwei Dai |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5102710973 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-5240-298X |
| authorships[6].author.display_name | Zengru Di |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Zengru Di |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5018955596 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-0272-6644 |
| authorships[7].author.display_name | Da Zhou |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Da Zhou |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5063833400 |
| authorships[8].author.orcid | https://orcid.org/0000-0003-2640-6490 |
| authorships[8].author.display_name | Yu Liu |
| authorships[8].countries | CN |
| authorships[8].affiliations[0].institution_ids | https://openalex.org/I25254941 |
| authorships[8].affiliations[0].raw_affiliation_string | Beijing Normal University |
| authorships[8].institutions[0].id | https://openalex.org/I25254941 |
| authorships[8].institutions[0].ror | https://ror.org/022k4wk35 |
| authorships[8].institutions[0].type | education |
| authorships[8].institutions[0].lineage | https://openalex.org/I25254941 |
| authorships[8].institutions[0].country_code | CN |
| authorships[8].institutions[0].display_name | Beijing Normal University |
| authorships[8].author_position | last |
| authorships[8].raw_author_name | Yu Liu |
| authorships[8].is_corresponding | False |
| authorships[8].raw_affiliation_strings | Beijing Normal University |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.65215/2qt5jb81 |
| open_access.oa_status | gold |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-12-11T00:00:00 |
| display_name | Compression-Based Tokenization Improves Language Modeling of Hierarchical Genomic Structure |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-12-12T23:16:27.785689 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | doi:10.65215/2qt5jb81 |
| best_oa_location.is_oa | True |
| best_oa_location.source | |
| best_oa_location.license | cc-by-nc-nd |
| best_oa_location.pdf_url | |
| best_oa_location.version | acceptedVersion |
| best_oa_location.raw_type | posted-content |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by-nc-nd |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.65215/2qt5jb81 |
| primary_location.id | doi:10.65215/2qt5jb81 |
| primary_location.is_oa | True |
| primary_location.source | |
| primary_location.license | cc-by-nc-nd |
| primary_location.pdf_url | |
| primary_location.version | acceptedVersion |
| primary_location.raw_type | posted-content |
| primary_location.license_id | https://openalex.org/licenses/cc-by-nc-nd |
| primary_location.is_accepted | True |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://doi.org/10.65215/2qt5jb81 |
| publication_date | 2025-12-11 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 2, 57, 150 |
| abstract_inverted_index.17 | 86 |
| abstract_inverted_index.21 | 88 |
| abstract_inverted_index.To | 35 |
| abstract_inverted_index.an | 41, 67, 132 |
| abstract_inverted_index.as | 21 |
| abstract_inverted_index.by | 30 |
| abstract_inverted_index.in | 6 |
| abstract_inverted_index.is | 1 |
| abstract_inverted_index.of | 87, 135, 147 |
| abstract_inverted_index.on | 25 |
| abstract_inverted_index.to | 153 |
| abstract_inverted_index.we | 38 |
| abstract_inverted_index.DNA | 26 |
| abstract_inverted_index.and | 16, 49, 93, 119, 125, 156, 162 |
| abstract_inverted_index.are | 28 |
| abstract_inverted_index.for | 60 |
| abstract_inverted_index.the | 31, 71, 78, 144 |
| abstract_inverted_index.best | 79 |
| abstract_inverted_index.form | 116 |
| abstract_inverted_index.into | 56, 66 |
| abstract_inverted_index.more | 160 |
| abstract_inverted_index.path | 152 |
| abstract_inverted_index.show | 97, 141 |
| abstract_inverted_index.such | 20 |
| abstract_inverted_index.that | 27, 46, 98, 142 |
| abstract_inverted_index.this | 64 |
| abstract_inverted_index.used | 11 |
| abstract_inverted_index.with | 91 |
| abstract_inverted_index.LTM's | 106 |
| abstract_inverted_index.Model | 74 |
| abstract_inverted_index.These | 139 |
| abstract_inverted_index.adapt | 39 |
| abstract_inverted_index.basis | 146 |
| abstract_inverted_index.gains | 100 |
| abstract_inverted_index.group | 122 |
| abstract_inverted_index.model | 157 |
| abstract_inverted_index.other | 94 |
| abstract_inverted_index.small | 32 |
| abstract_inverted_index.these | 99 |
| abstract_inverted_index.this, | 37 |
| abstract_inverted_index.those | 82 |
| abstract_inverted_index.times | 84 |
| abstract_inverted_index.token | 114 |
| abstract_inverted_index.which | 76 |
| abstract_inverted_index.(LTM), | 75 |
| abstract_inverted_index.TF-IDF | 92 |
| abstract_inverted_index.Theory | 44 |
| abstract_inverted_index.Widely | 10 |
| abstract_inverted_index.beyond | 102 |
| abstract_inverted_index.choice | 5 |
| abstract_inverted_index.design | 4 |
| abstract_inverted_index.extend | 101 |
| abstract_inverted_index.greedy | 17 |
| abstract_inverted_index.method | 45 |
| abstract_inverted_index.nested | 48 |
| abstract_inverted_index.reuse, | 55 |
| abstract_inverted_index.simple | 103 |
| abstract_inverted_index.yields | 70 |
| abstract_inverted_index.address | 36 |
| abstract_inverted_index.aligned | 164 |
| abstract_inverted_index.compact | 161 |
| abstract_inverted_index.exhibit | 110 |
| abstract_inverted_index.further | 109 |
| abstract_inverted_index.genomic | 7, 61, 165 |
| abstract_inverted_index.models. | 167 |
| abstract_inverted_index.optimal | 53 |
| abstract_inverted_index.regions | 127 |
| abstract_inverted_index.several | 83 |
| abstract_inverted_index.subword | 18 |
| abstract_inverted_index.through | 52 |
| abstract_inverted_index.without | 128 |
| abstract_inverted_index.classes. | 138 |
| abstract_inverted_index.coherent | 117 |
| abstract_inverted_index.critical | 3 |
| abstract_inverted_index.emergent | 133 |
| abstract_inverted_index.enabling | 159 |
| abstract_inverted_index.existing | 80 |
| abstract_inverted_index.findings | 140 |
| abstract_inverted_index.internal | 107 |
| abstract_inverted_index.language | 8 |
| abstract_inverted_index.provides | 149 |
| abstract_inverted_index.scaling, | 158 |
| abstract_inverted_index.sequence | 120, 137 |
| abstract_inverted_index.tailored | 59 |
| abstract_inverted_index.$k$-mers, | 15 |
| abstract_inverted_index.Tokenized | 73 |
| abstract_inverted_index.alphabet. | 34 |
| abstract_inverted_index.baselines | 96 |
| abstract_inverted_index.clusters, | 118 |
| abstract_inverted_index.encoding, | 13 |
| abstract_inverted_index.intrinsic | 23 |
| abstract_inverted_index.magnified | 29 |
| abstract_inverted_index.modeling. | 9 |
| abstract_inverted_index.revealing | 131 |
| abstract_inverted_index.structure | 134 |
| abstract_inverted_index.surpasses | 77 |
| abstract_inverted_index.tokenizer | 58, 65 |
| abstract_inverted_index.BPE---show | 22 |
| abstract_inverted_index.Ladderpath | 72 |
| abstract_inverted_index.algorithms | 19 |
| abstract_inverted_index.embeddings | 115, 121 |
| abstract_inverted_index.enhancers, | 124 |
| abstract_inverted_index.foundation | 166 |
| abstract_inverted_index.functional | 136 |
| abstract_inverted_index.identifies | 47 |
| abstract_inverted_index.meaningful | 112 |
| abstract_inverted_index.promoters, | 123 |
| abstract_inverted_index.sequences. | 62 |
| abstract_inverted_index.Algorithmic | 42 |
| abstract_inverted_index.Comparisons | 90 |
| abstract_inverted_index.Information | 43 |
| abstract_inverted_index.Integrating | 63 |
| abstract_inverted_index.Ladderpath, | 40 |
| abstract_inverted_index.Transformer | 69 |
| abstract_inverted_index.benchmarks. | 89 |
| abstract_inverted_index.four-letter | 33 |
| abstract_inverted_index.information | 54 |
| abstract_inverted_index.innovations | 155 |
| abstract_inverted_index.larger---on | 85 |
| abstract_inverted_index.limitations | 24 |
| abstract_inverted_index.repetitions | 51 |
| abstract_inverted_index.statistics. | 105 |
| abstract_inverted_index.Tokenization | 0 |
| abstract_inverted_index.biologically | 111, 163 |
| abstract_inverted_index.fixed-length | 14 |
| abstract_inverted_index.hierarchical | 50 |
| abstract_inverted_index.supervision, | 130 |
| abstract_inverted_index.tokenization | 148 |
| abstract_inverted_index.architectural | 154 |
| abstract_inverted_index.complementary | 151 |
| abstract_inverted_index.organization: | 113 |
| abstract_inverted_index.strengthening | 143 |
| abstract_inverted_index.task-specific | 129 |
| abstract_inverted_index.frequency-based | 95 |
| abstract_inverted_index.motif-frequency | 104 |
| abstract_inverted_index.representations | 108 |
| abstract_inverted_index.models---including | 81 |
| abstract_inverted_index.86-million-parameter | 68 |
| abstract_inverted_index.information-theoretic | 145 |
| abstract_inverted_index.histone-mark-associated | 126 |
| abstract_inverted_index.schemes---character-level | 12 |
| cited_by_percentile_year | |
| countries_distinct_count | 1 |
| institutions_distinct_count | 9 |
| citation_normalized_percentile |