How Smooth Is Attention? Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2312.14820
Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2312.14820
- https://arxiv.org/pdf/2312.14820
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4390215368
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4390215368Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2312.14820Digital Object Identifier
- Title
-
How Smooth Is Attention?Work title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-12-22Full publication date if available
- Authors
-
Valérie Castin, Pierre Ablin, Gabriel PeyréList of authors in order
- Landing page
-
https://arxiv.org/abs/2312.14820Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2312.14820Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2312.14820Direct OA link when available
- Concepts
-
Lipschitz continuity, Robustness (evolution), Upper and lower bounds, Computer science, Probability measure, Mathematics, Measure (data warehouse), Constant (computer programming), Artificial neural network, Mathematical optimization, Artificial intelligence, Discrete mathematics, Mathematical analysis, Gene, Biochemistry, Programming language, Database, ChemistryTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4390215368 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2312.14820 |
| ids.doi | https://doi.org/10.48550/arxiv.2312.14820 |
| ids.openalex | https://openalex.org/W4390215368 |
| fwci | |
| type | preprint |
| title | How Smooth Is Attention? |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11689 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9998000264167786 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Adversarial Robustness in Machine Learning |
| topics[1].id | https://openalex.org/T11612 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9850999712944031 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Stochastic Gradient Optimization Techniques |
| topics[2].id | https://openalex.org/T10775 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9779000282287598 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Generative Adversarial Networks and Image Synthesis |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C22324862 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8446624279022217 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q652707 |
| concepts[0].display_name | Lipschitz continuity |
| concepts[1].id | https://openalex.org/C63479239 |
| concepts[1].level | 3 |
| concepts[1].score | 0.7057240009307861 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q7353546 |
| concepts[1].display_name | Robustness (evolution) |
| concepts[2].id | https://openalex.org/C77553402 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6179380416870117 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q13222579 |
| concepts[2].display_name | Upper and lower bounds |
| concepts[3].id | https://openalex.org/C41008148 |
| concepts[3].level | 0 |
| concepts[3].score | 0.5644579529762268 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[3].display_name | Computer science |
| concepts[4].id | https://openalex.org/C21031990 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4899963438510895 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q355020 |
| concepts[4].display_name | Probability measure |
| concepts[5].id | https://openalex.org/C33923547 |
| concepts[5].level | 0 |
| concepts[5].score | 0.45121151208877563 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[5].display_name | Mathematics |
| concepts[6].id | https://openalex.org/C2780009758 |
| concepts[6].level | 2 |
| concepts[6].score | 0.4502890408039093 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q6804172 |
| concepts[6].display_name | Measure (data warehouse) |
| concepts[7].id | https://openalex.org/C2777027219 |
| concepts[7].level | 2 |
| concepts[7].score | 0.4218902289867401 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q1284190 |
| concepts[7].display_name | Constant (computer programming) |
| concepts[8].id | https://openalex.org/C50644808 |
| concepts[8].level | 2 |
| concepts[8].score | 0.4146007001399994 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q192776 |
| concepts[8].display_name | Artificial neural network |
| concepts[9].id | https://openalex.org/C126255220 |
| concepts[9].level | 1 |
| concepts[9].score | 0.3433939814567566 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q141495 |
| concepts[9].display_name | Mathematical optimization |
| concepts[10].id | https://openalex.org/C154945302 |
| concepts[10].level | 1 |
| concepts[10].score | 0.21924886107444763 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[10].display_name | Artificial intelligence |
| concepts[11].id | https://openalex.org/C118615104 |
| concepts[11].level | 1 |
| concepts[11].score | 0.19924893975257874 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q121416 |
| concepts[11].display_name | Discrete mathematics |
| concepts[12].id | https://openalex.org/C134306372 |
| concepts[12].level | 1 |
| concepts[12].score | 0.13576307892799377 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q7754 |
| concepts[12].display_name | Mathematical analysis |
| concepts[13].id | https://openalex.org/C104317684 |
| concepts[13].level | 2 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q7187 |
| concepts[13].display_name | Gene |
| concepts[14].id | https://openalex.org/C55493867 |
| concepts[14].level | 1 |
| concepts[14].score | 0.0 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q7094 |
| concepts[14].display_name | Biochemistry |
| concepts[15].id | https://openalex.org/C199360897 |
| concepts[15].level | 1 |
| concepts[15].score | 0.0 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[15].display_name | Programming language |
| concepts[16].id | https://openalex.org/C77088390 |
| concepts[16].level | 1 |
| concepts[16].score | 0.0 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q8513 |
| concepts[16].display_name | Database |
| concepts[17].id | https://openalex.org/C185592680 |
| concepts[17].level | 0 |
| concepts[17].score | 0.0 |
| concepts[17].wikidata | https://www.wikidata.org/wiki/Q2329 |
| concepts[17].display_name | Chemistry |
| keywords[0].id | https://openalex.org/keywords/lipschitz-continuity |
| keywords[0].score | 0.8446624279022217 |
| keywords[0].display_name | Lipschitz continuity |
| keywords[1].id | https://openalex.org/keywords/robustness |
| keywords[1].score | 0.7057240009307861 |
| keywords[1].display_name | Robustness (evolution) |
| keywords[2].id | https://openalex.org/keywords/upper-and-lower-bounds |
| keywords[2].score | 0.6179380416870117 |
| keywords[2].display_name | Upper and lower bounds |
| keywords[3].id | https://openalex.org/keywords/computer-science |
| keywords[3].score | 0.5644579529762268 |
| keywords[3].display_name | Computer science |
| keywords[4].id | https://openalex.org/keywords/probability-measure |
| keywords[4].score | 0.4899963438510895 |
| keywords[4].display_name | Probability measure |
| keywords[5].id | https://openalex.org/keywords/mathematics |
| keywords[5].score | 0.45121151208877563 |
| keywords[5].display_name | Mathematics |
| keywords[6].id | https://openalex.org/keywords/measure |
| keywords[6].score | 0.4502890408039093 |
| keywords[6].display_name | Measure (data warehouse) |
| keywords[7].id | https://openalex.org/keywords/constant |
| keywords[7].score | 0.4218902289867401 |
| keywords[7].display_name | Constant (computer programming) |
| keywords[8].id | https://openalex.org/keywords/artificial-neural-network |
| keywords[8].score | 0.4146007001399994 |
| keywords[8].display_name | Artificial neural network |
| keywords[9].id | https://openalex.org/keywords/mathematical-optimization |
| keywords[9].score | 0.3433939814567566 |
| keywords[9].display_name | Mathematical optimization |
| keywords[10].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[10].score | 0.21924886107444763 |
| keywords[10].display_name | Artificial intelligence |
| keywords[11].id | https://openalex.org/keywords/discrete-mathematics |
| keywords[11].score | 0.19924893975257874 |
| keywords[11].display_name | Discrete mathematics |
| keywords[12].id | https://openalex.org/keywords/mathematical-analysis |
| keywords[12].score | 0.13576307892799377 |
| keywords[12].display_name | Mathematical analysis |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2312.14820 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2312.14820 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2312.14820 |
| locations[1].id | doi:10.48550/arxiv.2312.14820 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2312.14820 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5093581632 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Valérie Castin |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Castin, Valérie |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5042340163 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-4277-5202 |
| authorships[1].author.display_name | Pierre Ablin |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Ablin, Pierre |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5058651667 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-4477-0387 |
| authorships[2].author.display_name | Gabriel Peyré |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Peyré, Gabriel |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2312.14820 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2023-12-26T00:00:00 |
| display_name | How Smooth Is Attention? |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11689 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9998000264167786 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Adversarial Robustness in Machine Learning |
| related_works | https://openalex.org/W3185235544, https://openalex.org/W2897842840, https://openalex.org/W2911623553, https://openalex.org/W4297791327, https://openalex.org/W2397777611, https://openalex.org/W2052725501, https://openalex.org/W3084156746, https://openalex.org/W4287670495, https://openalex.org/W2158905189, https://openalex.org/W4319874546 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2312.14820 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2312.14820 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2312.14820 |
| primary_location.id | pmh:oai:arXiv.org:2312.14820 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2312.14820 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2312.14820 |
| publication_date | 2023-12-22 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.- | 24, 37 |
| abstract_inverted_index.a | 42, 102, 144 |
| abstract_inverted_index.In | 77 |
| abstract_inverted_index.We | 40 |
| abstract_inverted_index.an | 140 |
| abstract_inverted_index.as | 134 |
| abstract_inverted_index.at | 5 |
| abstract_inverted_index.be | 128 |
| abstract_inverted_index.by | 98 |
| abstract_inverted_index.in | 18, 51, 87 |
| abstract_inverted_index.is | 38, 96, 109, 120, 159 |
| abstract_inverted_index.it | 29 |
| abstract_inverted_index.of | 8, 16, 20, 45, 49, 58, 71, 84, 94, 151, 162 |
| abstract_inverted_index.on | 66, 167 |
| abstract_inverted_index.to | 31, 101, 127, 133 |
| abstract_inverted_index.up | 100 |
| abstract_inverted_index.we | 79, 131, 138 |
| abstract_inverted_index.$n$ | 62, 86, 119 |
| abstract_inverted_index.Our | 153, 165 |
| abstract_inverted_index.and | 1, 34, 63, 74, 105, 143, 161, 169, 173 |
| abstract_inverted_index.any | 88 |
| abstract_inverted_index.are | 4, 26, 149 |
| abstract_inverted_index.for | 82, 111, 123, 156 |
| abstract_inverted_index.its | 21 |
| abstract_inverted_index.key | 27 |
| abstract_inverted_index.our | 13, 176 |
| abstract_inverted_index.the | 6, 46, 56, 59, 67, 91, 116, 124, 135 |
| abstract_inverted_index.too | 121 |
| abstract_inverted_index.$n$. | 152 |
| abstract_inverted_index.BERT | 172 |
| abstract_inverted_index.When | 115 |
| abstract_inverted_index.both | 72 |
| abstract_inverted_index.set, | 90 |
| abstract_inverted_index.show | 80 |
| abstract_inverted_index.that | 81, 106 |
| abstract_inverted_index.this | 107 |
| abstract_inverted_index.when | 28 |
| abstract_inverted_index.GPT-2 | 174 |
| abstract_inverted_index.bound | 108, 126, 142, 147 |
| abstract_inverted_index.comes | 30 |
| abstract_inverted_index.heart | 7 |
| abstract_inverted_index.large | 122 |
| abstract_inverted_index.layer | 64 |
| abstract_inverted_index.local | 68 |
| abstract_inverted_index.lower | 146 |
| abstract_inverted_index.novel | 160 |
| abstract_inverted_index.power | 36 |
| abstract_inverted_index.refer | 132 |
| abstract_inverted_index.study | 44 |
| abstract_inverted_index.tight | 110 |
| abstract_inverted_index.upper | 141 |
| abstract_inverted_index.which | 25, 130, 148 |
| abstract_inverted_index.Still, | 12 |
| abstract_inverted_index.factor | 104 |
| abstract_inverted_index.impact | 57 |
| abstract_inverted_index.inputs | 83 |
| abstract_inverted_index.length | 61, 85, 118 |
| abstract_inverted_index.masked | 2, 75, 157 |
| abstract_inverted_index.tight, | 129 |
| abstract_inverted_index.bounded | 97 |
| abstract_inverted_index.compact | 89 |
| abstract_inverted_index.provide | 41, 139 |
| abstract_inverted_index.regime, | 137 |
| abstract_inverted_index.several | 52 |
| abstract_inverted_index.support | 175 |
| abstract_inverted_index.constant | 48, 70, 93, 103 |
| abstract_inverted_index.detailed | 43 |
| abstract_inverted_index.lengths. | 114 |
| abstract_inverted_index.matching | 145 |
| abstract_inverted_index.previous | 125 |
| abstract_inverted_index.randomly | 170 |
| abstract_inverted_index.sequence | 60, 113, 117 |
| abstract_inverted_index.success. | 11 |
| abstract_inverted_index.unmasked | 73 |
| abstract_inverted_index.Lipschitz | 22, 47, 69, 92 |
| abstract_inverted_index.analyzing | 32 |
| abstract_inverted_index.findings. | 178 |
| abstract_inverted_index.framework | 155 |
| abstract_inverted_index.interest. | 164 |
| abstract_inverted_index.practical | 53 |
| abstract_inverted_index.$\sqrt{n}$ | 99 |
| abstract_inverted_index.attention, | 17 |
| abstract_inverted_index.discussing | 55 |
| abstract_inverted_index.expressive | 35 |
| abstract_inverted_index.mean-field | 136, 154 |
| abstract_inverted_index.particular | 19 |
| abstract_inverted_index.pretrained | 168 |
| abstract_inverted_index.properties | 23 |
| abstract_inverted_index.reasonable | 112 |
| abstract_inverted_index.robustness | 33 |
| abstract_inverted_index.scenarios, | 54 |
| abstract_inverted_index.experiments | 166 |
| abstract_inverted_index.incomplete. | 39 |
| abstract_inverted_index.independent | 150, 163 |
| abstract_inverted_index.initialized | 171 |
| abstract_inverted_index.outstanding | 10 |
| abstract_inverted_index.particular, | 78 |
| abstract_inverted_index.theoretical | 177 |
| abstract_inverted_index.mathematical | 14 |
| abstract_inverted_index.Transformers' | 9 |
| abstract_inverted_index.normalization | 65 |
| abstract_inverted_index.understanding | 15 |
| abstract_inverted_index.Self-attention | 0 |
| abstract_inverted_index.self-attention | 3, 50, 95, 158 |
| abstract_inverted_index.self-attention. | 76 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |