Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2409.04701
Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term late in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2409.04701
- https://arxiv.org/pdf/2409.04701
- OA Status
- green
- Cited By
- 4
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403882995
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403882995Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2409.04701Digital Object Identifier
- Title
-
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-09-07Full publication date if available
- Authors
-
Michael Günther, Isabelle Mohr, Derek Leslie Williams, Bo Wang, Han XiaoList of authors in order
- Landing page
-
https://arxiv.org/abs/2409.04701Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2409.04701Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2409.04701Direct OA link when available
- Concepts
-
Chunking (psychology), Embedding, Context (archaeology), Computer science, Artificial intelligence, Natural language processing, History, ArchaeologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
4Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 4Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403882995 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2409.04701 |
| ids.doi | https://doi.org/10.48550/arxiv.2409.04701 |
| ids.openalex | https://openalex.org/W4403882995 |
| fwci | |
| type | preprint |
| title | Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11980 |
| topics[0].field.id | https://openalex.org/fields/33 |
| topics[0].field.display_name | Social Sciences |
| topics[0].score | 0.4650000035762787 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/3313 |
| topics[0].subfield.display_name | Transportation |
| topics[0].display_name | Human Mobility and Location-Based Analysis |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C203357204 |
| concepts[0].level | 2 |
| concepts[0].score | 0.769059419631958 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1089605 |
| concepts[0].display_name | Chunking (psychology) |
| concepts[1].id | https://openalex.org/C41608201 |
| concepts[1].level | 2 |
| concepts[1].score | 0.738793671131134 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q980509 |
| concepts[1].display_name | Embedding |
| concepts[2].id | https://openalex.org/C2779343474 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6626158952713013 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[2].display_name | Context (archaeology) |
| concepts[3].id | https://openalex.org/C41008148 |
| concepts[3].level | 0 |
| concepts[3].score | 0.5334805250167847 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[3].display_name | Computer science |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.4723430871963501 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C204321447 |
| concepts[5].level | 1 |
| concepts[5].score | 0.41162747144699097 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[5].display_name | Natural language processing |
| concepts[6].id | https://openalex.org/C95457728 |
| concepts[6].level | 0 |
| concepts[6].score | 0.16120001673698425 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q309 |
| concepts[6].display_name | History |
| concepts[7].id | https://openalex.org/C166957645 |
| concepts[7].level | 1 |
| concepts[7].score | 0.04565304517745972 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q23498 |
| concepts[7].display_name | Archaeology |
| keywords[0].id | https://openalex.org/keywords/chunking |
| keywords[0].score | 0.769059419631958 |
| keywords[0].display_name | Chunking (psychology) |
| keywords[1].id | https://openalex.org/keywords/embedding |
| keywords[1].score | 0.738793671131134 |
| keywords[1].display_name | Embedding |
| keywords[2].id | https://openalex.org/keywords/context |
| keywords[2].score | 0.6626158952713013 |
| keywords[2].display_name | Context (archaeology) |
| keywords[3].id | https://openalex.org/keywords/computer-science |
| keywords[3].score | 0.5334805250167847 |
| keywords[3].display_name | Computer science |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.4723430871963501 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/natural-language-processing |
| keywords[5].score | 0.41162747144699097 |
| keywords[5].display_name | Natural language processing |
| keywords[6].id | https://openalex.org/keywords/history |
| keywords[6].score | 0.16120001673698425 |
| keywords[6].display_name | History |
| keywords[7].id | https://openalex.org/keywords/archaeology |
| keywords[7].score | 0.04565304517745972 |
| keywords[7].display_name | Archaeology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2409.04701 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2409.04701 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2409.04701 |
| locations[1].id | doi:10.48550/arxiv.2409.04701 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2409.04701 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5080524539 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-2195-4300 |
| authorships[0].author.display_name | Michael Günther |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Günther, Michael |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5019723720 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-7354-0241 |
| authorships[1].author.display_name | Isabelle Mohr |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Mohr, Isabelle |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5061741619 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Derek Leslie Williams |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Williams, Daniel James |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100742939 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-5635-7971 |
| authorships[3].author.display_name | Bo Wang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wang, Bo |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5101913009 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-9555-4342 |
| authorships[4].author.display_name | Han Xiao |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Xiao, Han |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2409.04701 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11980 |
| primary_topic.field.id | https://openalex.org/fields/33 |
| primary_topic.field.display_name | Social Sciences |
| primary_topic.score | 0.4650000035762787 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/3313 |
| primary_topic.subfield.display_name | Transportation |
| primary_topic.display_name | Human Mobility and Location-Based Analysis |
| related_works | https://openalex.org/W2384729545, https://openalex.org/W2198395236, https://openalex.org/W2800417007, https://openalex.org/W147604216, https://openalex.org/W2161080928, https://openalex.org/W4245487161, https://openalex.org/W2090755435, https://openalex.org/W2039036070, https://openalex.org/W2153813398, https://openalex.org/W3204019825 |
| cited_by_count | 4 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 4 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2409.04701 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2409.04701 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2409.04701 |
| primary_location.id | pmh:oai:arXiv.org:2409.04701 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2409.04701 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2409.04701 |
| publication_date | 2024-09-07 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.- | 102 |
| abstract_inverted_index.a | 69, 136, 158 |
| abstract_inverted_index.In | 64 |
| abstract_inverted_index.To | 148 |
| abstract_inverted_index.as | 21 |
| abstract_inverted_index.be | 28, 133 |
| abstract_inverted_index.in | 30, 50, 61, 107 |
| abstract_inverted_index.is | 129 |
| abstract_inverted_index.of | 7, 86, 139, 153 |
| abstract_inverted_index.to | 27, 81, 120, 132, 135 |
| abstract_inverted_index.we | 67, 156 |
| abstract_inverted_index.The | 110, 127 |
| abstract_inverted_index.all | 84 |
| abstract_inverted_index.and | 9, 42, 97, 143 |
| abstract_inverted_index.are | 24 |
| abstract_inverted_index.can | 53 |
| abstract_inverted_index.for | 162 |
| abstract_inverted_index.its | 108 |
| abstract_inverted_index.the | 22, 31, 87, 94, 104, 115, 151 |
| abstract_inverted_index.use | 1 |
| abstract_inverted_index.way | 52 |
| abstract_inverted_index.Many | 0 |
| abstract_inverted_index.from | 57 |
| abstract_inverted_index.full | 116 |
| abstract_inverted_index.into | 39 |
| abstract_inverted_index.just | 98 |
| abstract_inverted_index.late | 73, 106, 154 |
| abstract_inverted_index.less | 25 |
| abstract_inverted_index.long | 77, 88 |
| abstract_inverted_index.lose | 54 |
| abstract_inverted_index.mean | 100 |
| abstract_inverted_index.term | 105 |
| abstract_inverted_index.text | 19, 37 |
| abstract_inverted_index.them | 44 |
| abstract_inverted_index.this | 51, 65 |
| abstract_inverted_index.wide | 137 |
| abstract_inverted_index.with | 17, 90 |
| abstract_inverted_index.after | 93 |
| abstract_inverted_index.cases | 2 |
| abstract_inverted_index.chunk | 47, 112 |
| abstract_inverted_index.dense | 10 |
| abstract_inverted_index.embed | 83 |
| abstract_inverted_index.first | 82 |
| abstract_inverted_index.hence | 103 |
| abstract_inverted_index.model | 96 |
| abstract_inverted_index.novel | 70 |
| abstract_inverted_index.often | 14, 35 |
| abstract_inverted_index.range | 138 |
| abstract_inverted_index.split | 36 |
| abstract_inverted_index.text, | 8, 89 |
| abstract_inverted_index.which | 75 |
| abstract_inverted_index.works | 144 |
| abstract_inverted_index.across | 123 |
| abstract_inverted_index.before | 99 |
| abstract_inverted_index.better | 16 |
| abstract_inverted_index.called | 72 |
| abstract_inverted_index.chunks | 41 |
| abstract_inverted_index.encode | 43 |
| abstract_inverted_index.enough | 131 |
| abstract_inverted_index.likely | 26 |
| abstract_inverted_index.method | 71, 128 |
| abstract_inverted_index.models | 80, 142 |
| abstract_inverted_index.paper, | 66 |
| abstract_inverted_index.tasks. | 126 |
| abstract_inverted_index.tokens | 85 |
| abstract_inverted_index.applied | 92, 134 |
| abstract_inverted_index.capture | 114 |
| abstract_inverted_index.chunks, | 59 |
| abstract_inverted_index.context | 78 |
| abstract_inverted_index.created | 49 |
| abstract_inverted_index.further | 149 |
| abstract_inverted_index.generic | 130 |
| abstract_inverted_index.leading | 119 |
| abstract_inverted_index.models. | 164 |
| abstract_inverted_index.naming. | 109 |
| abstract_inverted_index.perform | 15 |
| abstract_inverted_index.pooling | 101 |
| abstract_inverted_index.propose | 157 |
| abstract_inverted_index.require | 3 |
| abstract_inverted_index.results | 122 |
| abstract_inverted_index.shorter | 18 |
| abstract_inverted_index.smaller | 5, 40 |
| abstract_inverted_index.systems | 13 |
| abstract_inverted_index.various | 124 |
| abstract_inverted_index.without | 145 |
| abstract_inverted_index.However, | 46 |
| abstract_inverted_index.approach | 161 |
| abstract_inverted_index.chunking | 91 |
| abstract_inverted_index.increase | 150 |
| abstract_inverted_index.portions | 6 |
| abstract_inverted_index.superior | 121 |
| abstract_inverted_index.chunking, | 74, 155 |
| abstract_inverted_index.dedicated | 159 |
| abstract_inverted_index.documents | 38 |
| abstract_inverted_index.embedding | 79, 141, 163 |
| abstract_inverted_index.introduce | 68 |
| abstract_inverted_index.leverages | 76 |
| abstract_inverted_index.resulting | 60, 111 |
| abstract_inverted_index.retrieval | 12, 125 |
| abstract_inverted_index.segments, | 20 |
| abstract_inverted_index.semantics | 23 |
| abstract_inverted_index.training. | 147 |
| abstract_inverted_index.additional | 146 |
| abstract_inverted_index.contextual | 55, 117 |
| abstract_inverted_index.embeddings | 48, 113 |
| abstract_inverted_index.retrieving | 4 |
| abstract_inverted_index.embeddings. | 32 |
| abstract_inverted_index.fine-tuning | 160 |
| abstract_inverted_index.information | 56 |
| abstract_inverted_index.separately. | 45 |
| abstract_inverted_index.sub-optimal | 62 |
| abstract_inverted_index.surrounding | 58 |
| abstract_inverted_index.transformer | 95 |
| abstract_inverted_index.information, | 118 |
| abstract_inverted_index.long-context | 140 |
| abstract_inverted_index.vector-based | 11 |
| abstract_inverted_index.Consequently, | 33 |
| abstract_inverted_index.effectiveness | 152 |
| abstract_inverted_index.practitioners | 34 |
| abstract_inverted_index.over-compressed | 29 |
| abstract_inverted_index.representations. | 63 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |