Understanding and Mitigating Tokenization Bias in Language Models Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2406.16829
State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2406.16829
- https://arxiv.org/pdf/2406.16829
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4400024841
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4400024841Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2406.16829Digital Object Identifier
- Title
-
Understanding and Mitigating Tokenization Bias in Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-06-24Full publication date if available
- Authors
-
Buu Phan, Marton Havasi, Matthew J. Muckley, Karen UllrichList of authors in order
- Landing page
-
https://arxiv.org/abs/2406.16829Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2406.16829Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2406.16829Direct OA link when available
- Concepts
-
Lexical analysis, Computer science, Natural language processing, Linguistics, PhilosophyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4400024841 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2406.16829 |
| ids.doi | https://doi.org/10.48550/arxiv.2406.16829 |
| ids.openalex | https://openalex.org/W4400024841 |
| fwci | |
| type | preprint |
| title | Understanding and Mitigating Tokenization Bias in Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9897000193595886 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T10028 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9836000204086304 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C176982825 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7642172574996948 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q835922 |
| concepts[0].display_name | Lexical analysis |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5312609076499939 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C204321447 |
| concepts[2].level | 1 |
| concepts[2].score | 0.4401792585849762 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[2].display_name | Natural language processing |
| concepts[3].id | https://openalex.org/C41895202 |
| concepts[3].level | 1 |
| concepts[3].score | 0.4319281280040741 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q8162 |
| concepts[3].display_name | Linguistics |
| concepts[4].id | https://openalex.org/C138885662 |
| concepts[4].level | 0 |
| concepts[4].score | 0.07568886876106262 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[4].display_name | Philosophy |
| keywords[0].id | https://openalex.org/keywords/lexical-analysis |
| keywords[0].score | 0.7642172574996948 |
| keywords[0].display_name | Lexical analysis |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.5312609076499939 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/natural-language-processing |
| keywords[2].score | 0.4401792585849762 |
| keywords[2].display_name | Natural language processing |
| keywords[3].id | https://openalex.org/keywords/linguistics |
| keywords[3].score | 0.4319281280040741 |
| keywords[3].display_name | Linguistics |
| keywords[4].id | https://openalex.org/keywords/philosophy |
| keywords[4].score | 0.07568886876106262 |
| keywords[4].display_name | Philosophy |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2406.16829 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2406.16829 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2406.16829 |
| locations[1].id | doi:10.48550/arxiv.2406.16829 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2406.16829 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5085652556 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Buu Phan |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Phan, Buu |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5099497279 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Marton Havasi |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Havasi, Marton |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5029739727 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-6525-8817 |
| authorships[2].author.display_name | Matthew J. Muckley |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Muckley, Matthew |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5058031547 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Karen Ullrich |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Ullrich, Karen |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2406.16829 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Understanding and Mitigating Tokenization Bias in Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9897000193595886 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W4300598845, https://openalex.org/W2601638452, https://openalex.org/W2285263069, https://openalex.org/W4319309671, https://openalex.org/W4376107815, https://openalex.org/W4319309603, https://openalex.org/W1599985958, https://openalex.org/W1748623649 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2406.16829 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2406.16829 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2406.16829 |
| primary_location.id | pmh:oai:arXiv.org:2406.16829 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2406.16829 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2406.16829 |
| publication_date | 2024-06-24 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 21, 50, 74, 119, 130, 143 |
| abstract_inverted_index.As | 118 |
| abstract_inverted_index.To | 62 |
| abstract_inverted_index.We | 34, 134 |
| abstract_inverted_index.as | 11, 41, 101, 153 |
| abstract_inverted_index.be | 55 |
| abstract_inverted_index.do | 91 |
| abstract_inverted_index.in | 113 |
| abstract_inverted_index.it | 147 |
| abstract_inverted_index.of | 23, 104, 116, 139, 159 |
| abstract_inverted_index.on | 7, 86 |
| abstract_inverted_index.or | 60 |
| abstract_inverted_index.to | 27, 77, 155 |
| abstract_inverted_index.we | 72, 121 |
| abstract_inverted_index.Our | 89 |
| abstract_inverted_index.and | 5, 46, 97 |
| abstract_inverted_index.any | 82 |
| abstract_inverted_index.are | 3 |
| abstract_inverted_index.can | 125 |
| abstract_inverted_index.for | 31, 67 |
| abstract_inverted_index.not | 92 |
| abstract_inverted_index.one | 14, 124 |
| abstract_inverted_index.our | 140 |
| abstract_inverted_index.the | 17, 28, 95, 98, 102, 110, 114, 137, 150, 156, 164 |
| abstract_inverted_index.MPE. | 117 |
| abstract_inverted_index.bias | 52 |
| abstract_inverted_index.case | 115 |
| abstract_inverted_index.each | 68 |
| abstract_inverted_index.from | 81, 129 |
| abstract_inverted_index.into | 20, 163 |
| abstract_inverted_index.list | 22 |
| abstract_inverted_index.more | 58 |
| abstract_inverted_index.must | 15 |
| abstract_inverted_index.show | 35, 122 |
| abstract_inverted_index.such | 40 |
| abstract_inverted_index.that | 36, 53, 123 |
| abstract_inverted_index.this | 64 |
| abstract_inverted_index.with | 57, 109 |
| abstract_inverted_index.(MPE) | 45 |
| abstract_inverted_index.data. | 61, 88 |
| abstract_inverted_index.known | 10 |
| abstract_inverted_index.model | 84, 105 |
| abstract_inverted_index.novel | 75 |
| abstract_inverted_index.runs, | 106 |
| abstract_inverted_index.units | 9 |
| abstract_inverted_index.where | 146 |
| abstract_inverted_index.(BPE), | 48 |
| abstract_inverted_index.above, | 71 |
| abstract_inverted_index.before | 25 |
| abstract_inverted_index.cannot | 54 |
| abstract_inverted_index.encode | 16 |
| abstract_inverted_index.induce | 49 |
| abstract_inverted_index.length | 112 |
| abstract_inverted_index.method | 141, 158 |
| abstract_inverted_index.model, | 96 |
| abstract_inverted_index.model. | 133, 166 |
| abstract_inverted_index.models | 2, 30 |
| abstract_inverted_index.number | 103 |
| abstract_inverted_index.obtain | 78 |
| abstract_inverted_index.prefix | 43 |
| abstract_inverted_index.scales | 107 |
| abstract_inverted_index.scheme | 70 |
| abstract_inverted_index.setup, | 145 |
| abstract_inverted_index.string | 19 |
| abstract_inverted_index.tokens | 24, 162 |
| abstract_inverted_index.verify | 136 |
| abstract_inverted_index.counter | 63 |
| abstract_inverted_index.defined | 100 |
| abstract_inverted_index.maximum | 42 |
| abstract_inverted_index.methods | 90 |
| abstract_inverted_index.operate | 6 |
| abstract_inverted_index.opposed | 154 |
| abstract_inverted_index.passing | 26 |
| abstract_inverted_index.popular | 37 |
| abstract_inverted_index.propose | 73 |
| abstract_inverted_index.require | 93 |
| abstract_inverted_index.result, | 120 |
| abstract_inverted_index.subword | 8 |
| abstract_inverted_index.through | 142 |
| abstract_inverted_index.tokens. | 12 |
| abstract_inverted_index.trained | 85 |
| abstract_inverted_index.behavior | 128 |
| abstract_inverted_index.directly | 160 |
| abstract_inverted_index.encoding | 38, 44, 69 |
| abstract_inverted_index.language | 1, 29, 83, 132, 165 |
| abstract_inverted_index.linearly | 108 |
| abstract_inverted_index.problem, | 66 |
| abstract_inverted_index.recovers | 149 |
| abstract_inverted_index.sampling | 51 |
| abstract_inverted_index.schemes, | 39 |
| abstract_inverted_index.sequence | 111 |
| abstract_inverted_index.simulate | 126 |
| abstract_inverted_index.training | 59 |
| abstract_inverted_index.unbiased | 79 |
| abstract_inverted_index.algorithm | 76 |
| abstract_inverted_index.estimates | 80 |
| abstract_inverted_index.mitigated | 56 |
| abstract_inverted_index.prompting | 161 |
| abstract_inverted_index.tokenized | 87, 131 |
| abstract_inverted_index.universal | 65 |
| abstract_inverted_index.accurately | 148 |
| abstract_inverted_index.finetuning | 94 |
| abstract_inverted_index.next-token | 32 |
| abstract_inverted_index.token-free | 127 |
| abstract_inverted_index.transition | 151 |
| abstract_inverted_index.complexity, | 99 |
| abstract_inverted_index.correctness | 138 |
| abstract_inverted_index.empirically | 135 |
| abstract_inverted_index.prediction. | 33 |
| abstract_inverted_index.Markov-chain | 144 |
| abstract_inverted_index.conditioning | 18 |
| abstract_inverted_index.conventional | 157 |
| abstract_inverted_index.Specifically, | 13 |
| abstract_inverted_index.autoregressive | 4 |
| abstract_inverted_index.probabilities, | 152 |
| abstract_inverted_index.State-of-the-art | 0 |
| abstract_inverted_index.byte-pair-encoding | 47 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |