Weight subcloning: direct initialization of transformers using larger pretrained ones Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2312.09299
Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2312.09299
- https://arxiv.org/pdf/2312.09299
- OA Status
- green
- Cited By
- 2
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4389911345
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4389911345Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2312.09299Digital Object Identifier
- Title
-
Weight subcloning: direct initialization of transformers using larger pretrained onesWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-12-14Full publication date if available
- Authors
-
Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad RastegariList of authors in order
- Landing page
-
https://arxiv.org/abs/2312.09299Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2312.09299Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2312.09299Direct OA link when available
- Concepts
-
Initialization, Computer science, Transformer, Artificial intelligence, Language model, Machine learning, Voltage, Quantum mechanics, Physics, Programming languageTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
2Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 1, 2024: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4389911345 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2312.09299 |
| ids.doi | https://doi.org/10.48550/arxiv.2312.09299 |
| ids.openalex | https://openalex.org/W4389911345 |
| fwci | |
| type | preprint |
| title | Weight subcloning: direct initialization of transformers using larger pretrained ones |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11689 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9882000088691711 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Adversarial Robustness in Machine Learning |
| topics[1].id | https://openalex.org/T12026 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9247999787330627 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Explainable Artificial Intelligence (XAI) |
| topics[2].id | https://openalex.org/T12357 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9010000228881836 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Digital Media Forensic Detection |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C114466953 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8658191561698914 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q6034165 |
| concepts[0].display_name | Initialization |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.7842246294021606 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C66322947 |
| concepts[2].level | 3 |
| concepts[2].score | 0.7393799424171448 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11658 |
| concepts[2].display_name | Transformer |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.5908859372138977 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C137293760 |
| concepts[4].level | 2 |
| concepts[4].score | 0.41654258966445923 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q3621696 |
| concepts[4].display_name | Language model |
| concepts[5].id | https://openalex.org/C119857082 |
| concepts[5].level | 1 |
| concepts[5].score | 0.4106833338737488 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q2539 |
| concepts[5].display_name | Machine learning |
| concepts[6].id | https://openalex.org/C165801399 |
| concepts[6].level | 2 |
| concepts[6].score | 0.10994237661361694 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q25428 |
| concepts[6].display_name | Voltage |
| concepts[7].id | https://openalex.org/C62520636 |
| concepts[7].level | 1 |
| concepts[7].score | 0.0 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q944 |
| concepts[7].display_name | Quantum mechanics |
| concepts[8].id | https://openalex.org/C121332964 |
| concepts[8].level | 0 |
| concepts[8].score | 0.0 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q413 |
| concepts[8].display_name | Physics |
| concepts[9].id | https://openalex.org/C199360897 |
| concepts[9].level | 1 |
| concepts[9].score | 0.0 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[9].display_name | Programming language |
| keywords[0].id | https://openalex.org/keywords/initialization |
| keywords[0].score | 0.8658191561698914 |
| keywords[0].display_name | Initialization |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.7842246294021606 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/transformer |
| keywords[2].score | 0.7393799424171448 |
| keywords[2].display_name | Transformer |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.5908859372138977 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/language-model |
| keywords[4].score | 0.41654258966445923 |
| keywords[4].display_name | Language model |
| keywords[5].id | https://openalex.org/keywords/machine-learning |
| keywords[5].score | 0.4106833338737488 |
| keywords[5].display_name | Machine learning |
| keywords[6].id | https://openalex.org/keywords/voltage |
| keywords[6].score | 0.10994237661361694 |
| keywords[6].display_name | Voltage |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2312.09299 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2312.09299 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2312.09299 |
| locations[1].id | doi:10.48550/arxiv.2312.09299 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2312.09299 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5033081533 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-1404-527X |
| authorships[0].author.display_name | Mohammad Samragh |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Samragh, Mohammad |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5050499655 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-5510-518X |
| authorships[1].author.display_name | Mehrdad Farajtabar |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Farajtabar, Mehrdad |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5074132108 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-5420-4725 |
| authorships[2].author.display_name | Sachin Mehta |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Mehta, Sachin |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5071825172 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-0425-7797 |
| authorships[3].author.display_name | Raviteja Vemulapalli |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Vemulapalli, Raviteja |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5036601505 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-5975-5158 |
| authorships[4].author.display_name | Fartash Faghri |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Faghri, Fartash |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5113880378 |
| authorships[5].author.orcid | https://orcid.org/0009-0007-7838-1623 |
| authorships[5].author.display_name | Devang Naik |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Naik, Devang |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5028613002 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Oncel Tuzel |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Tuzel, Oncel |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5056246621 |
| authorships[7].author.orcid | https://orcid.org/0000-0001-9606-3687 |
| authorships[7].author.display_name | Mohammad Rastegari |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Rastegari, Mohammad |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2312.09299 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Weight subcloning: direct initialization of transformers using larger pretrained ones |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11689 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9882000088691711 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Adversarial Robustness in Machine Learning |
| related_works | https://openalex.org/W2961085424, https://openalex.org/W4306674287, https://openalex.org/W3046775127, https://openalex.org/W3107602296, https://openalex.org/W3170094116, https://openalex.org/W4386462264, https://openalex.org/W4364306694, https://openalex.org/W4312192474, https://openalex.org/W4283697347, https://openalex.org/W4210805261 |
| cited_by_count | 2 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 1 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2312.09299 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2312.09299 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2312.09299 |
| primary_location.id | pmh:oai:arXiv.org:2312.09299 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2312.09299 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2312.09299 |
| publication_date | 2023-12-14 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 7, 34, 67, 77, 162 |
| abstract_inverted_index.4x | 183 |
| abstract_inverted_index.In | 62 |
| abstract_inverted_index.It | 118 |
| abstract_inverted_index.an | 105 |
| abstract_inverted_index.by | 27, 94 |
| abstract_inverted_index.if | 52 |
| abstract_inverted_index.in | 137, 155, 172, 189 |
| abstract_inverted_index.is | 15, 60, 161 |
| abstract_inverted_index.no | 53 |
| abstract_inverted_index.of | 12, 21, 33, 37, 56, 76, 91, 120, 153 |
| abstract_inverted_index.on | 107 |
| abstract_inverted_index.to | 43, 72, 80, 111, 130, 149, 165, 176 |
| abstract_inverted_index.we | 65, 125, 142, 181 |
| abstract_inverted_index.For | 179 |
| abstract_inverted_index.Our | 83 |
| abstract_inverted_index.The | 18, 159 |
| abstract_inverted_index.and | 14, 41, 47, 192 |
| abstract_inverted_index.for | 6, 186, 196 |
| abstract_inverted_index.key | 122 |
| abstract_inverted_index.per | 135 |
| abstract_inverted_index.the | 29, 38, 45, 57, 74, 89, 108, 113, 132, 138, 146, 151, 156 |
| abstract_inverted_index.two | 121 |
| abstract_inverted_index.yet | 69 |
| abstract_inverted_index.data | 13 |
| abstract_inverted_index.from | 4, 98, 145 |
| abstract_inverted_index.lots | 11 |
| abstract_inverted_index.next | 197 |
| abstract_inverted_index.same | 39 |
| abstract_inverted_index.size | 40, 59 |
| abstract_inverted_index.task | 9 |
| abstract_inverted_index.this | 25, 63 |
| abstract_inverted_index.what | 51 |
| abstract_inverted_index.with | 31 |
| abstract_inverted_index.Then, | 141 |
| abstract_inverted_index.gains | 169 |
| abstract_inverted_index.image | 190 |
| abstract_inverted_index.large | 1 |
| abstract_inverted_index.layer | 136 |
| abstract_inverted_index.match | 150 |
| abstract_inverted_index.model | 30, 36, 55, 79, 110, 148 |
| abstract_inverted_index.ready | 164 |
| abstract_inverted_index.speed | 174 |
| abstract_inverted_index.their | 96 |
| abstract_inverted_index.token | 198 |
| abstract_inverted_index.usual | 19 |
| abstract_inverted_index.which | 168 |
| abstract_inverted_index.Weight | 102 |
| abstract_inverted_index.blocks | 144 |
| abstract_inverted_index.called | 85 |
| abstract_inverted_index.faster | 184 |
| abstract_inverted_index.first, | 124 |
| abstract_inverted_index.larger | 99 |
| abstract_inverted_index.layers | 154 |
| abstract_inverted_index.model. | 117, 140 |
| abstract_inverted_index.models | 3, 194 |
| abstract_inverted_index.neuron | 127 |
| abstract_inverted_index.number | 152 |
| abstract_inverted_index.obtain | 112 |
| abstract_inverted_index.paper, | 64 |
| abstract_inverted_index.random | 177 |
| abstract_inverted_index.remove | 143 |
| abstract_inverted_index.result | 160 |
| abstract_inverted_index.simple | 68 |
| abstract_inverted_index.speed. | 49 |
| abstract_inverted_index.steps: | 123 |
| abstract_inverted_index.target | 8 |
| abstract_inverted_index.vision | 187 |
| abstract_inverted_index.weight | 86 |
| abstract_inverted_index.achieve | 182 |
| abstract_inverted_index.models. | 101 |
| abstract_inverted_index.network | 163 |
| abstract_inverted_index.ranking | 129 |
| abstract_inverted_index.scratch | 5 |
| abstract_inverted_index.smaller | 81 |
| abstract_inverted_index.undergo | 166 |
| abstract_inverted_index.weights | 32, 97 |
| abstract_inverted_index.However, | 50 |
| abstract_inverted_index.Training | 0 |
| abstract_inverted_index.approach | 84 |
| abstract_inverted_index.compared | 175 |
| abstract_inverted_index.consists | 119 |
| abstract_inverted_index.decrease | 131 |
| abstract_inverted_index.designed | 195 |
| abstract_inverted_index.increase | 44 |
| abstract_inverted_index.involves | 104 |
| abstract_inverted_index.language | 193 |
| abstract_inverted_index.learning | 23 |
| abstract_inverted_index.network. | 158 |
| abstract_inverted_index.practice | 20 |
| abstract_inverted_index.required | 58 |
| abstract_inverted_index.requires | 10 |
| abstract_inverted_index.training | 48, 90, 173, 185 |
| abstract_inverted_index.transfer | 22, 73 |
| abstract_inverted_index.challenge | 26 |
| abstract_inverted_index.dimension | 134 |
| abstract_inverted_index.effective | 70 |
| abstract_inverted_index.embedding | 133 |
| abstract_inverted_index.expedites | 88 |
| abstract_inverted_index.instance, | 180 |
| abstract_inverted_index.introduce | 66, 126 |
| abstract_inverted_index.knowledge | 75 |
| abstract_inverted_index.operation | 106 |
| abstract_inverted_index.overcomes | 24 |
| abstract_inverted_index.technique | 71 |
| abstract_inverted_index.training, | 167 |
| abstract_inverted_index.variants. | 82 |
| abstract_inverted_index.available? | 61 |
| abstract_inverted_index.demanding. | 17 |
| abstract_inverted_index.equivalent | 114 |
| abstract_inverted_index.importance | 128 |
| abstract_inverted_index.pretrained | 35, 54, 78, 100, 109, 139 |
| abstract_inverted_index.subcloning | 87, 103 |
| abstract_inverted_index.convergence | 46 |
| abstract_inverted_index.initialized | 115 |
| abstract_inverted_index.prediction. | 199 |
| abstract_inverted_index.scaled-down | 92, 116, 157 |
| abstract_inverted_index.significant | 170 |
| abstract_inverted_index.transformer | 2, 147 |
| abstract_inverted_index.improvements | 171 |
| abstract_inverted_index.initializing | 28, 95 |
| abstract_inverted_index.transformers | 93, 188 |
| abstract_inverted_index.specification | 42 |
| abstract_inverted_index.classification | 191 |
| abstract_inverted_index.computationally | 16 |
| abstract_inverted_index.initialization. | 178 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |