Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2401.10536
Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2401.10536
- https://arxiv.org/pdf/2401.10536
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4391124075
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4391124075Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2401.10536Digital Object Identifier
- Title
-
Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion RecognitionWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-01-19Full publication date if available
- Authors
-
Yong Wang, Cheng Lu, Hailun Lian, Yan Zhao, Björn W. Schuller, Yuan Zong, Wenming ZhengList of authors in order
- Landing page
-
https://arxiv.org/abs/2401.10536Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2401.10536Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2401.10536Direct OA link when available
- Concepts
-
Transformer, Computer science, Speech recognition, Phrase, Utterance, Artificial intelligence, Natural language processing, Engineering, Voltage, Electrical engineeringTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2024: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4391124075 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2401.10536 |
| ids.doi | https://doi.org/10.48550/arxiv.2401.10536 |
| ids.openalex | https://openalex.org/W4391124075 |
| fwci | |
| type | preprint |
| title | Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10860 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9865000247955322 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1711 |
| topics[0].subfield.display_name | Signal Processing |
| topics[0].display_name | Speech and Audio Processing |
| topics[1].id | https://openalex.org/T10667 |
| topics[1].field.id | https://openalex.org/fields/32 |
| topics[1].field.display_name | Psychology |
| topics[1].score | 0.9807000160217285 |
| topics[1].domain.id | https://openalex.org/domains/2 |
| topics[1].domain.display_name | Social Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/3205 |
| topics[1].subfield.display_name | Experimental and Cognitive Psychology |
| topics[1].display_name | Emotion and Mood Recognition |
| topics[2].id | https://openalex.org/T10201 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9775000214576721 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Speech Recognition and Synthesis |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C66322947 |
| concepts[0].level | 3 |
| concepts[0].score | 0.6594070792198181 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q11658 |
| concepts[0].display_name | Transformer |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.6491532325744629 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C28490314 |
| concepts[2].level | 1 |
| concepts[2].score | 0.571292519569397 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[2].display_name | Speech recognition |
| concepts[3].id | https://openalex.org/C2776224158 |
| concepts[3].level | 2 |
| concepts[3].score | 0.4181937873363495 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q187931 |
| concepts[3].display_name | Phrase |
| concepts[4].id | https://openalex.org/C2775852435 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4116540253162384 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q258403 |
| concepts[4].display_name | Utterance |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.3871661424636841 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C204321447 |
| concepts[6].level | 1 |
| concepts[6].score | 0.36367470026016235 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[6].display_name | Natural language processing |
| concepts[7].id | https://openalex.org/C127413603 |
| concepts[7].level | 0 |
| concepts[7].score | 0.18623274564743042 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[7].display_name | Engineering |
| concepts[8].id | https://openalex.org/C165801399 |
| concepts[8].level | 2 |
| concepts[8].score | 0.0 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q25428 |
| concepts[8].display_name | Voltage |
| concepts[9].id | https://openalex.org/C119599485 |
| concepts[9].level | 1 |
| concepts[9].score | 0.0 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q43035 |
| concepts[9].display_name | Electrical engineering |
| keywords[0].id | https://openalex.org/keywords/transformer |
| keywords[0].score | 0.6594070792198181 |
| keywords[0].display_name | Transformer |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.6491532325744629 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/speech-recognition |
| keywords[2].score | 0.571292519569397 |
| keywords[2].display_name | Speech recognition |
| keywords[3].id | https://openalex.org/keywords/phrase |
| keywords[3].score | 0.4181937873363495 |
| keywords[3].display_name | Phrase |
| keywords[4].id | https://openalex.org/keywords/utterance |
| keywords[4].score | 0.4116540253162384 |
| keywords[4].display_name | Utterance |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.3871661424636841 |
| keywords[5].display_name | Artificial intelligence |
| keywords[6].id | https://openalex.org/keywords/natural-language-processing |
| keywords[6].score | 0.36367470026016235 |
| keywords[6].display_name | Natural language processing |
| keywords[7].id | https://openalex.org/keywords/engineering |
| keywords[7].score | 0.18623274564743042 |
| keywords[7].display_name | Engineering |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2401.10536 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2401.10536 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2401.10536 |
| locations[1].id | doi:10.48550/arxiv.2401.10536 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2401.10536 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100424416 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-1572-068X |
| authorships[0].author.display_name | Yong Wang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Wang, Yong |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5054796879 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-1477-1020 |
| authorships[1].author.display_name | Cheng Lu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Lu, Cheng |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5103280008 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-0519-8568 |
| authorships[2].author.display_name | Hailun Lian |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Lian, Hailun |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100727732 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-4577-7078 |
| authorships[3].author.display_name | Yan Zhao |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Zhao, Yan |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5043060302 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-6478-8699 |
| authorships[4].author.display_name | Björn W. Schuller |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Schuller, Björn |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5027316177 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-0839-8792 |
| authorships[5].author.display_name | Yuan Zong |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Zong, Yuan |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5029771864 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-7764-5179 |
| authorships[6].author.display_name | Wenming Zheng |
| authorships[6].author_position | last |
| authorships[6].raw_author_name | Zheng, Wenming |
| authorships[6].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2401.10536 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-01-23T00:00:00 |
| display_name | Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10860 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9865000247955322 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1711 |
| primary_topic.subfield.display_name | Signal Processing |
| primary_topic.display_name | Speech and Audio Processing |
| related_works | https://openalex.org/W2529301793, https://openalex.org/W2384121599, https://openalex.org/W2038083449, https://openalex.org/W3177678247, https://openalex.org/W2333799855, https://openalex.org/W1999617572, https://openalex.org/W2944572343, https://openalex.org/W2351687372, https://openalex.org/W3016124757, https://openalex.org/W3034520363 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2024 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2401.10536 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2401.10536 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2401.10536 |
| primary_location.id | pmh:oai:arXiv.org:2401.10536 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2401.10536 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2401.10536 |
| publication_date | 2024-01-19 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 41, 87, 94, 118, 136 |
| abstract_inverted_index.In | 17 |
| abstract_inverted_index.by | 8, 149 |
| abstract_inverted_index.in | 5, 71, 92 |
| abstract_inverted_index.is | 22, 98 |
| abstract_inverted_index.of | 27, 76, 89, 109, 130, 154 |
| abstract_inverted_index.on | 15 |
| abstract_inverted_index.to | 48, 100, 122, 140, 158 |
| abstract_inverted_index.we | 62, 115, 134 |
| abstract_inverted_index.and | 33 |
| abstract_inverted_index.are | 83 |
| abstract_inverted_index.for | 53, 124, 145 |
| abstract_inverted_index.has | 1 |
| abstract_inverted_index.its | 10 |
| abstract_inverted_index.our | 164 |
| abstract_inverted_index.the | 65, 72, 128, 151, 169 |
| abstract_inverted_index.Swin | 90 |
| abstract_inverted_index.also | 116 |
| abstract_inverted_index.each | 110 |
| abstract_inverted_index.from | 156 |
| abstract_inverted_index.into | 68 |
| abstract_inverted_index.near | 127 |
| abstract_inverted_index.that | 163 |
| abstract_inverted_index.then | 84 |
| abstract_inverted_index.this | 38 |
| abstract_inverted_index.time | 73 |
| abstract_inverted_index.with | 45 |
| abstract_inverted_index.After | 113 |
| abstract_inverted_index.These | 80 |
| abstract_inverted_index.above | 36 |
| abstract_inverted_index.based | 14 |
| abstract_inverted_index.field | 153 |
| abstract_inverted_index.first | 63 |
| abstract_inverted_index.frame | 78, 107 |
| abstract_inverted_index.local | 95, 102 |
| abstract_inverted_index.paper | 39 |
| abstract_inverted_index.patch | 125, 137 |
| abstract_inverted_index.stack | 88 |
| abstract_inverted_index.that, | 114 |
| abstract_inverted_index.using | 86 |
| abstract_inverted_index.which | 93 |
| abstract_inverted_index.word, | 31 |
| abstract_inverted_index.(SER), | 57 |
| abstract_inverted_index.Speech | 59, 166 |
| abstract_inverted_index.across | 24, 106 |
| abstract_inverted_index.called | 58 |
| abstract_inverted_index.design | 117 |
| abstract_inverted_index.divide | 64 |
| abstract_inverted_index.employ | 135 |
| abstract_inverted_index.patch. | 112 |
| abstract_inverted_index.scales | 26 |
| abstract_inverted_index.speech | 18, 28, 43, 54, 66, 147 |
| abstract_inverted_index.vision | 7 |
| abstract_inverted_index.window | 96, 120 |
| abstract_inverted_index.Drawing | 35 |
| abstract_inverted_index.blocks, | 91 |
| abstract_inverted_index.domain, | 74 |
| abstract_inverted_index.e.\,g., | 30 |
| abstract_inverted_index.emotion | 51, 55 |
| abstract_inverted_index.encoded | 85 |
| abstract_inverted_index.explore | 101 |
| abstract_inverted_index.feature | 12 |
| abstract_inverted_index.merging | 138 |
| abstract_inverted_index.patches | 70, 82, 108 |
| abstract_inverted_index.phrase, | 32 |
| abstract_inverted_index.results | 161 |
| abstract_inverted_index.segment | 111, 131 |
| abstract_inverted_index.shifted | 46, 119 |
| abstract_inverted_index.success | 4 |
| abstract_inverted_index.windows | 47 |
| abstract_inverted_index.Finally, | 133 |
| abstract_inverted_index.composed | 75 |
| abstract_inverted_index.computer | 6 |
| abstract_inverted_index.features | 52, 144 |
| abstract_inverted_index.methods. | 171 |
| abstract_inverted_index.multiple | 77 |
| abstract_inverted_index.patches. | 79, 132 |
| abstract_inverted_index.presents | 40 |
| abstract_inverted_index.proposed | 165 |
| abstract_inverted_index.signals, | 19 |
| abstract_inverted_index.utilized | 99 |
| abstract_inverted_index.aggregate | 49, 141 |
| abstract_inverted_index.different | 25 |
| abstract_inverted_index.emotional | 20, 104, 143 |
| abstract_inverted_index.expanding | 150 |
| abstract_inverted_index.features, | 29 |
| abstract_inverted_index.operation | 139 |
| abstract_inverted_index.receptive | 152 |
| abstract_inverted_index.boundaries | 129 |
| abstract_inverted_index.compensate | 123 |
| abstract_inverted_index.leveraging | 9 |
| abstract_inverted_index.remarkable | 3 |
| abstract_inverted_index.utterance. | 34 |
| abstract_inverted_index.Transformer | 44, 97, 121, 155 |
| abstract_inverted_index.demonstrate | 162 |
| abstract_inverted_index.distributed | 23 |
| abstract_inverted_index.frame-level | 157 |
| abstract_inverted_index.information | 21, 105 |
| abstract_inverted_index.inter-frame | 103 |
| abstract_inverted_index.multi-scale | 50 |
| abstract_inverted_index.outperforms | 168 |
| abstract_inverted_index.recognition | 56 |
| abstract_inverted_index.spectrogram | 67 |
| abstract_inverted_index.Experimental | 160 |
| abstract_inverted_index.Transformer. | 16 |
| abstract_inverted_index.correlations | 126 |
| abstract_inverted_index.demonstrated | 2 |
| abstract_inverted_index.hierarchical | 11, 42, 146 |
| abstract_inverted_index.inspiration, | 37 |
| abstract_inverted_index.Specifically, | 61 |
| abstract_inverted_index.segment-level | 69, 81, 142 |
| abstract_inverted_index.representation | 13, 148 |
| abstract_inverted_index.segment-level. | 159 |
| abstract_inverted_index.Swin-Transformer | 0, 167 |
| abstract_inverted_index.state-of-the-art | 170 |
| abstract_inverted_index.Swin-Transformer. | 60 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 7 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/16 |
| sustainable_development_goals[0].score | 0.41999998688697815 |
| sustainable_development_goals[0].display_name | Peace, Justice and strong institutions |
| citation_normalized_percentile |