GEM: Empowering LLM for both Embedding Generation and Language Understanding Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2506.04344
Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG), still rely on separate embedding models to generate text embeddings, which can complicate the system and introduce discrepancies in understanding of the query between the embedding model and LLMs. To address this limitation, we propose a simple self-supervised approach, Generative Embedding large language Model (GEM), that enables any large decoder-only LLM to generate high-quality text embeddings while maintaining its original text generation and reasoning capabilities. Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask. This method could be easily integrated into post-training or fine tuning stages of any existing LLMs. We demonstrate the effectiveness of our approach by applying it to two popular LLM families, ranging from 1B to 8B parameters, and evaluating the transformed models on both text embedding benchmarks (MTEB) and NLP benchmarks (MMLU). The results show that our proposed method significantly improves the original LLMs on MTEB while having a minimal impact on MMLU. Our strong results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2506.04344
- https://arxiv.org/pdf/2506.04344
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416131795
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416131795Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2506.04344Digital Object Identifier
- Title
-
GEM: Empowering LLM for both Embedding Generation and Language UnderstandingWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-06-04Full publication date if available
- Authors
-
C. Zhang, Qiang Zhang, Ke Li, Sai Vidyaranya Nuthalapati, Benyu Zhang, Lizhu Zhang, Xiangjun FanList of authors in order
- Landing page
-
https://arxiv.org/abs/2506.04344Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2506.04344Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2506.04344Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416131795 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2506.04344 |
| ids.doi | https://doi.org/10.48550/arxiv.2506.04344 |
| ids.openalex | https://openalex.org/W4416131795 |
| fwci | |
| type | preprint |
| title | GEM: Empowering LLM for both Embedding Generation and Language Understanding |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2506.04344 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2506.04344 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2506.04344 |
| locations[1].id | doi:10.48550/arxiv.2506.04344 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2506.04344 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5091183429 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | C. Zhang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhang, Caojin |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5115597562 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-7882-5489 |
| authorships[1].author.display_name | Qiang Zhang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhang, Qiang |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100655785 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-7199-9139 |
| authorships[2].author.display_name | Ke Li |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Li, Ke |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5008151704 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Sai Vidyaranya Nuthalapati |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Nuthalapati, Sai Vidyaranya |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5053072892 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Benyu Zhang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhang, Benyu |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5021569903 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Lizhu Zhang |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Zhang, Lizhu |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5001542440 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Xiangjun Fan |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Fan, Xiangjun |
| authorships[6].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2506.04344 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | GEM: Empowering LLM for both Embedding Generation and Language Understanding |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T05:54:34.280854 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2506.04344 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2506.04344 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2506.04344 |
| primary_location.id | pmh:oai:arXiv.org:2506.04344 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2506.04344 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2506.04344 |
| publication_date | 2025-06-04 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 64, 101, 184 |
| abstract_inverted_index.1B | 149 |
| abstract_inverted_index.8B | 151 |
| abstract_inverted_index.To | 58 |
| abstract_inverted_index.We | 132 |
| abstract_inverted_index.be | 119 |
| abstract_inverted_index.by | 111, 139 |
| abstract_inverted_index.in | 9, 47 |
| abstract_inverted_index.it | 141 |
| abstract_inverted_index.of | 49, 108, 128, 136 |
| abstract_inverted_index.on | 31, 158, 180, 187 |
| abstract_inverted_index.or | 124 |
| abstract_inverted_index.to | 35, 80, 142, 150 |
| abstract_inverted_index.we | 62 |
| abstract_inverted_index.LLM | 79, 145 |
| abstract_inverted_index.NLP | 165, 208 |
| abstract_inverted_index.Our | 94, 189 |
| abstract_inverted_index.The | 168 |
| abstract_inverted_index.and | 11, 44, 56, 91, 104, 153, 164 |
| abstract_inverted_index.any | 76, 129 |
| abstract_inverted_index.can | 40, 196 |
| abstract_inverted_index.its | 87 |
| abstract_inverted_index.new | 97 |
| abstract_inverted_index.our | 137, 172, 194 |
| abstract_inverted_index.the | 42, 50, 53, 109, 113, 134, 155, 177 |
| abstract_inverted_index.two | 143 |
| abstract_inverted_index.LLMs | 179, 198 |
| abstract_inverted_index.MTEB | 181 |
| abstract_inverted_index.This | 116 |
| abstract_inverted_index.both | 159 |
| abstract_inverted_index.fine | 125 |
| abstract_inverted_index.from | 148 |
| abstract_inverted_index.have | 5 |
| abstract_inverted_index.into | 100, 122 |
| abstract_inverted_index.many | 22 |
| abstract_inverted_index.rely | 30 |
| abstract_inverted_index.show | 170 |
| abstract_inverted_index.text | 17, 37, 83, 89, 102, 110, 160, 201 |
| abstract_inverted_index.that | 74, 171, 193 |
| abstract_inverted_index.they | 15 |
| abstract_inverted_index.this | 60 |
| abstract_inverted_index.with | 199 |
| abstract_inverted_index.LLMs. | 57, 131 |
| abstract_inverted_index.Large | 0 |
| abstract_inverted_index.MMLU. | 188 |
| abstract_inverted_index.Model | 72 |
| abstract_inverted_index.body, | 103 |
| abstract_inverted_index.could | 118 |
| abstract_inverted_index.e.g., | 24 |
| abstract_inverted_index.given | 19 |
| abstract_inverted_index.large | 70, 77 |
| abstract_inverted_index.mask. | 115 |
| abstract_inverted_index.model | 55 |
| abstract_inverted_index.query | 51 |
| abstract_inverted_index.still | 29 |
| abstract_inverted_index.their | 206 |
| abstract_inverted_index.where | 14 |
| abstract_inverted_index.which | 39 |
| abstract_inverted_index.while | 85, 182, 204 |
| abstract_inverted_index.(GEM), | 73 |
| abstract_inverted_index.(LLMs) | 4 |
| abstract_inverted_index.(MTEB) | 163 |
| abstract_inverted_index.(RAG), | 28 |
| abstract_inverted_index.easily | 120 |
| abstract_inverted_index.having | 183 |
| abstract_inverted_index.impact | 186 |
| abstract_inverted_index.method | 95, 117, 174 |
| abstract_inverted_index.models | 3, 34, 157 |
| abstract_inverted_index.simple | 65 |
| abstract_inverted_index.stages | 127 |
| abstract_inverted_index.strong | 190 |
| abstract_inverted_index.system | 43 |
| abstract_inverted_index.tasks, | 13 |
| abstract_inverted_index.tuning | 126 |
| abstract_inverted_index.(MMLU). | 167 |
| abstract_inverted_index.address | 59 |
| abstract_inverted_index.between | 52 |
| abstract_inverted_index.empower | 197 |
| abstract_inverted_index.enables | 75 |
| abstract_inverted_index.inserts | 96 |
| abstract_inverted_index.minimal | 185 |
| abstract_inverted_index.popular | 144 |
| abstract_inverted_index.propose | 63 |
| abstract_inverted_index.ranging | 147 |
| abstract_inverted_index.results | 169, 191 |
| abstract_inverted_index.special | 98 |
| abstract_inverted_index.success | 8 |
| abstract_inverted_index.However, | 21 |
| abstract_inverted_index.achieved | 6 |
| abstract_inverted_index.applying | 140 |
| abstract_inverted_index.approach | 138, 195 |
| abstract_inverted_index.existing | 130 |
| abstract_inverted_index.generate | 16, 36, 81 |
| abstract_inverted_index.improves | 176 |
| abstract_inverted_index.indicate | 192 |
| abstract_inverted_index.language | 2, 71 |
| abstract_inverted_index.original | 88, 178, 207 |
| abstract_inverted_index.proposed | 173 |
| abstract_inverted_index.separate | 32 |
| abstract_inverted_index.token(s) | 99 |
| abstract_inverted_index.Embedding | 69 |
| abstract_inverted_index.approach, | 67 |
| abstract_inverted_index.attention | 114 |
| abstract_inverted_index.augmented | 26 |
| abstract_inverted_index.embedding | 33, 54, 107, 161, 202 |
| abstract_inverted_index.families, | 146 |
| abstract_inverted_index.generates | 105 |
| abstract_inverted_index.introduce | 45 |
| abstract_inverted_index.reasoning | 12, 92 |
| abstract_inverted_index.responses | 18 |
| abstract_inverted_index.retrieval | 25 |
| abstract_inverted_index.Generative | 68 |
| abstract_inverted_index.benchmarks | 162, 166 |
| abstract_inverted_index.complicate | 41 |
| abstract_inverted_index.embeddings | 84 |
| abstract_inverted_index.evaluating | 154 |
| abstract_inverted_index.generation | 10, 27, 90 |
| abstract_inverted_index.integrated | 121 |
| abstract_inverted_index.remarkable | 7 |
| abstract_inverted_index.demonstrate | 133 |
| abstract_inverted_index.embeddings, | 38 |
| abstract_inverted_index.limitation, | 61 |
| abstract_inverted_index.maintaining | 86, 205 |
| abstract_inverted_index.parameters, | 152 |
| abstract_inverted_index.performance | 209 |
| abstract_inverted_index.transformed | 156 |
| abstract_inverted_index.capabilities | 203 |
| abstract_inverted_index.decoder-only | 1, 78 |
| abstract_inverted_index.high-quality | 82 |
| abstract_inverted_index.manipulating | 112 |
| abstract_inverted_index.applications, | 23 |
| abstract_inverted_index.capabilities. | 93 |
| abstract_inverted_index.discrepancies | 46 |
| abstract_inverted_index.effectiveness | 135 |
| abstract_inverted_index.instructions. | 20 |
| abstract_inverted_index.post-training | 123 |
| abstract_inverted_index.significantly | 175 |
| abstract_inverted_index.summarization | 106 |
| abstract_inverted_index.understanding | 48 |
| abstract_inverted_index.self-supervised | 66 |
| abstract_inverted_index.state-of-the-art | 200 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 7 |
| citation_normalized_percentile |