The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2306.17759
In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2306.17759
- https://arxiv.org/pdf/2306.17759
- OA Status
- green
- Cited By
- 3
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4383047452
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4383047452Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2306.17759Digital Object Identifier
- Title
-
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width LimitWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-06-30Full publication date if available
- Authors
-
Lorenzo Noci, C. F. Li, Mufan Bill Li, Bobby He, Thomas Hofmann, Chris J. Maddison, Daniel M. RoyList of authors in order
- Landing page
-
https://arxiv.org/abs/2306.17759Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2306.17759Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2306.17759Direct OA link when available
- Concepts
-
Softmax function, Mathematics, Covariance, Covariance matrix, Applied mathematics, Statistical physics, Transformer, Mathematical analysis, Computer science, Physics, Statistics, Artificial intelligence, Quantum mechanics, Artificial neural network, VoltageTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
3Total citation count in OpenAlex
- Citations by year (recent)
-
2024: 3Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4383047452 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2306.17759 |
| ids.doi | https://doi.org/10.48550/arxiv.2306.17759 |
| ids.openalex | https://openalex.org/W4383047452 |
| fwci | |
| type | preprint |
| title | The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11206 |
| topics[0].field.id | https://openalex.org/fields/31 |
| topics[0].field.display_name | Physics and Astronomy |
| topics[0].score | 0.9869999885559082 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/3109 |
| topics[0].subfield.display_name | Statistical and Nonlinear Physics |
| topics[0].display_name | Model Reduction and Neural Networks |
| topics[1].id | https://openalex.org/T10320 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9800999760627747 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Neural Networks and Applications |
| topics[2].id | https://openalex.org/T11612 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9746999740600586 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Stochastic Gradient Optimization Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C188441871 |
| concepts[0].level | 3 |
| concepts[0].score | 0.6185610294342041 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q7554146 |
| concepts[0].display_name | Softmax function |
| concepts[1].id | https://openalex.org/C33923547 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5807034373283386 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[1].display_name | Mathematics |
| concepts[2].id | https://openalex.org/C178650346 |
| concepts[2].level | 2 |
| concepts[2].score | 0.5767205357551575 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q201984 |
| concepts[2].display_name | Covariance |
| concepts[3].id | https://openalex.org/C185142706 |
| concepts[3].level | 2 |
| concepts[3].score | 0.4846058785915375 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1134404 |
| concepts[3].display_name | Covariance matrix |
| concepts[4].id | https://openalex.org/C28826006 |
| concepts[4].level | 1 |
| concepts[4].score | 0.47024860978126526 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q33521 |
| concepts[4].display_name | Applied mathematics |
| concepts[5].id | https://openalex.org/C121864883 |
| concepts[5].level | 1 |
| concepts[5].score | 0.4501030445098877 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q677916 |
| concepts[5].display_name | Statistical physics |
| concepts[6].id | https://openalex.org/C66322947 |
| concepts[6].level | 3 |
| concepts[6].score | 0.4328423738479614 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11658 |
| concepts[6].display_name | Transformer |
| concepts[7].id | https://openalex.org/C134306372 |
| concepts[7].level | 1 |
| concepts[7].score | 0.37716376781463623 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q7754 |
| concepts[7].display_name | Mathematical analysis |
| concepts[8].id | https://openalex.org/C41008148 |
| concepts[8].level | 0 |
| concepts[8].score | 0.298976868391037 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[8].display_name | Computer science |
| concepts[9].id | https://openalex.org/C121332964 |
| concepts[9].level | 0 |
| concepts[9].score | 0.23081916570663452 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q413 |
| concepts[9].display_name | Physics |
| concepts[10].id | https://openalex.org/C105795698 |
| concepts[10].level | 1 |
| concepts[10].score | 0.21958240866661072 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q12483 |
| concepts[10].display_name | Statistics |
| concepts[11].id | https://openalex.org/C154945302 |
| concepts[11].level | 1 |
| concepts[11].score | 0.1439715027809143 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[11].display_name | Artificial intelligence |
| concepts[12].id | https://openalex.org/C62520636 |
| concepts[12].level | 1 |
| concepts[12].score | 0.131077378988266 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q944 |
| concepts[12].display_name | Quantum mechanics |
| concepts[13].id | https://openalex.org/C50644808 |
| concepts[13].level | 2 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q192776 |
| concepts[13].display_name | Artificial neural network |
| concepts[14].id | https://openalex.org/C165801399 |
| concepts[14].level | 2 |
| concepts[14].score | 0.0 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q25428 |
| concepts[14].display_name | Voltage |
| keywords[0].id | https://openalex.org/keywords/softmax-function |
| keywords[0].score | 0.6185610294342041 |
| keywords[0].display_name | Softmax function |
| keywords[1].id | https://openalex.org/keywords/mathematics |
| keywords[1].score | 0.5807034373283386 |
| keywords[1].display_name | Mathematics |
| keywords[2].id | https://openalex.org/keywords/covariance |
| keywords[2].score | 0.5767205357551575 |
| keywords[2].display_name | Covariance |
| keywords[3].id | https://openalex.org/keywords/covariance-matrix |
| keywords[3].score | 0.4846058785915375 |
| keywords[3].display_name | Covariance matrix |
| keywords[4].id | https://openalex.org/keywords/applied-mathematics |
| keywords[4].score | 0.47024860978126526 |
| keywords[4].display_name | Applied mathematics |
| keywords[5].id | https://openalex.org/keywords/statistical-physics |
| keywords[5].score | 0.4501030445098877 |
| keywords[5].display_name | Statistical physics |
| keywords[6].id | https://openalex.org/keywords/transformer |
| keywords[6].score | 0.4328423738479614 |
| keywords[6].display_name | Transformer |
| keywords[7].id | https://openalex.org/keywords/mathematical-analysis |
| keywords[7].score | 0.37716376781463623 |
| keywords[7].display_name | Mathematical analysis |
| keywords[8].id | https://openalex.org/keywords/computer-science |
| keywords[8].score | 0.298976868391037 |
| keywords[8].display_name | Computer science |
| keywords[9].id | https://openalex.org/keywords/physics |
| keywords[9].score | 0.23081916570663452 |
| keywords[9].display_name | Physics |
| keywords[10].id | https://openalex.org/keywords/statistics |
| keywords[10].score | 0.21958240866661072 |
| keywords[10].display_name | Statistics |
| keywords[11].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[11].score | 0.1439715027809143 |
| keywords[11].display_name | Artificial intelligence |
| keywords[12].id | https://openalex.org/keywords/quantum-mechanics |
| keywords[12].score | 0.131077378988266 |
| keywords[12].display_name | Quantum mechanics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2306.17759 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2306.17759 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2306.17759 |
| locations[1].id | doi:10.48550/arxiv.2306.17759 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2306.17759 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5090198684 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Lorenzo Noci |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Noci, Lorenzo |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5111003170 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | C. F. Li |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Li, Chuning |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5039934689 |
| authorships[2].author.orcid | https://orcid.org/0000-0003-4427-3118 |
| authorships[2].author.display_name | Mufan Bill Li |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Li, Mufan Bill |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5031304561 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Bobby He |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | He, Bobby |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5045413165 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-4057-7165 |
| authorships[4].author.display_name | Thomas Hofmann |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Hofmann, Thomas |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5054711904 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Chris J. Maddison |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Maddison, Chris |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5110275739 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Daniel M. Roy |
| authorships[6].author_position | last |
| authorships[6].raw_author_name | Roy, Daniel M. |
| authorships[6].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2306.17759 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2023-07-04T00:00:00 |
| display_name | The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11206 |
| primary_topic.field.id | https://openalex.org/fields/31 |
| primary_topic.field.display_name | Physics and Astronomy |
| primary_topic.score | 0.9869999885559082 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/3109 |
| primary_topic.subfield.display_name | Statistical and Nonlinear Physics |
| primary_topic.display_name | Model Reduction and Neural Networks |
| related_works | https://openalex.org/W3107204728, https://openalex.org/W4287591324, https://openalex.org/W4226420367, https://openalex.org/W2980176872, https://openalex.org/W2962876041, https://openalex.org/W3090555870, https://openalex.org/W3108503355, https://openalex.org/W2886934452, https://openalex.org/W1489099099, https://openalex.org/W2024369332 |
| cited_by_count | 3 |
| counts_by_year[0].year | 2024 |
| counts_by_year[0].cited_by_count | 3 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2306.17759 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2306.17759 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2306.17759 |
| primary_location.id | pmh:oai:arXiv.org:2306.17759 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2306.17759 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2306.17759 |
| publication_date | 2023-06-30 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 12, 31, 57, 69, 92, 130, 168 |
| abstract_inverted_index.In | 0 |
| abstract_inverted_index.To | 67 |
| abstract_inverted_index.We | 45, 96, 177 |
| abstract_inverted_index.as | 11 |
| abstract_inverted_index.at | 48, 84 |
| abstract_inverted_index.be | 54, 118 |
| abstract_inverted_index.by | 20, 56, 63, 79, 91 |
| abstract_inverted_index.in | 39, 155 |
| abstract_inverted_index.is | 77, 138 |
| abstract_inverted_index.of | 7, 23, 30, 43, 100, 111, 124, 129, 152, 172 |
| abstract_inverted_index.to | 14 |
| abstract_inverted_index.we | 25, 160 |
| abstract_inverted_index.SDE | 132, 166 |
| abstract_inverted_index.The | 127 |
| abstract_inverted_index.aid | 123 |
| abstract_inverted_index.and | 86, 115, 145 |
| abstract_inverted_index.can | 53, 117 |
| abstract_inverted_index.for | 141, 183 |
| abstract_inverted_index.how | 108 |
| abstract_inverted_index.the | 4, 8, 16, 21, 27, 40, 50, 64, 73, 81, 88, 98, 101, 104, 109, 113, 122, 135, 149, 165, 173, 179 |
| abstract_inverted_index.SDE, | 106 |
| abstract_inverted_index.both | 112 |
| abstract_inverted_index.coin | 178 |
| abstract_inverted_index.deep | 1, 156 |
| abstract_inverted_index.even | 140 |
| abstract_inverted_index.good | 170 |
| abstract_inverted_index.name | 180 |
| abstract_inverted_index.rank | 153 |
| abstract_inverted_index.show | 46 |
| abstract_inverted_index.skip | 37 |
| abstract_inverted_index.that | 47, 134, 164 |
| abstract_inverted_index.thus | 147 |
| abstract_inverted_index.very | 142 |
| abstract_inverted_index.with | 36, 121 |
| abstract_inverted_index.(SDE) | 61 |
| abstract_inverted_index.depth | 144 |
| abstract_inverted_index.drift | 114 |
| abstract_inverted_index.large | 143 |
| abstract_inverted_index.limit | 42 |
| abstract_inverted_index.model | 35 |
| abstract_inverted_index.proxy | 13 |
| abstract_inverted_index.scale | 110 |
| abstract_inverted_index.show, | 161 |
| abstract_inverted_index.study | 26 |
| abstract_inverted_index.these | 184 |
| abstract_inverted_index.issues | 151 |
| abstract_inverted_index.limit, | 72 |
| abstract_inverted_index.logits | 90 |
| abstract_inverted_index.matrix | 6, 29 |
| abstract_inverted_index.model. | 176 |
| abstract_inverted_index.output | 83 |
| abstract_inverted_index.ratio. | 66 |
| abstract_inverted_index.serves | 10 |
| abstract_inverted_index.shaped | 181 |
| abstract_inverted_index.stable | 131 |
| abstract_inverted_index.width, | 146 |
| abstract_inverted_index.Softmax | 82, 89 |
| abstract_inverted_index.achieve | 68 |
| abstract_inverted_index.examine | 15, 97 |
| abstract_inverted_index.implies | 133 |
| abstract_inverted_index.indexed | 62 |
| abstract_inverted_index.models. | 158 |
| abstract_inverted_index.network | 102 |
| abstract_inverted_index.scaling | 87 |
| abstract_inverted_index.showing | 107 |
| abstract_inverted_index.success | 22 |
| abstract_inverted_index.theory, | 3 |
| abstract_inverted_index.through | 103, 162 |
| abstract_inverted_index.Finally, | 159 |
| abstract_inverted_index.equation | 60 |
| abstract_inverted_index.learning | 2 |
| abstract_inverted_index.limiting | 51 |
| abstract_inverted_index.modified | 32, 78 |
| abstract_inverted_index.provides | 167 |
| abstract_inverted_index.residual | 125 |
| abstract_inverted_index.Motivated | 19 |
| abstract_inverted_index.attention | 34, 75, 157 |
| abstract_inverted_index.centering | 80 |
| abstract_inverted_index.described | 55 |
| abstract_inverted_index.diffusion | 116 |
| abstract_inverted_index.elegantly | 119 |
| abstract_inverted_index.existence | 128 |
| abstract_inverted_index.identity, | 85 |
| abstract_inverted_index.mechanism | 76 |
| abstract_inverted_index.network's | 17 |
| abstract_inverted_index.notorious | 150 |
| abstract_inverted_index.stability | 99 |
| abstract_inverted_index.structure | 137 |
| abstract_inverted_index.controlled | 120 |
| abstract_inverted_index.covariance | 5, 28, 136 |
| abstract_inverted_index.degeneracy | 154 |
| abstract_inverted_index.parameter. | 95 |
| abstract_inverted_index.preventing | 148 |
| abstract_inverted_index.stochastic | 58, 71 |
| abstract_inverted_index.Transformer | 182 |
| abstract_inverted_index.connections | 38 |
| abstract_inverted_index.description | 171 |
| abstract_inverted_index.finite-size | 175 |
| abstract_inverted_index.temperature | 94 |
| abstract_inverted_index.connections. | 126 |
| abstract_inverted_index.differential | 59 |
| abstract_inverted_index.distribution | 52 |
| abstract_inverted_index.proportional | 41 |
| abstract_inverted_index.simulations, | 163 |
| abstract_inverted_index.surprisingly | 169 |
| abstract_inverted_index.well-defined | 70 |
| abstract_inverted_index.Softmax-based | 33 |
| abstract_inverted_index.Transformer's | 74 |
| abstract_inverted_index.Transformers, | 24 |
| abstract_inverted_index.architectural | 185 |
| abstract_inverted_index.corresponding | 105, 174 |
| abstract_inverted_index.trainability. | 18 |
| abstract_inverted_index.well-behaved, | 139 |
| abstract_inverted_index.depth-to-width | 65 |
| abstract_inverted_index.initialization | 49 |
| abstract_inverted_index.modifications. | 186 |
| abstract_inverted_index.representations | 9 |
| abstract_inverted_index.width-dependent | 93 |
| abstract_inverted_index.infinite-depth-and-width. | 44 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 7 |
| citation_normalized_percentile |