What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2401.17632
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these models represent information is essential for refining model efficiency and effectiveness. Unlike the various analyses of speech SSL, there has been limited investigation into what information speaker SSL captures and how its representation differs from speech SSL or other fully-supervised speaker models. This paper addresses these fundamental questions. We explore the capacity to capture various speech properties by applying SUPERB evaluation probing tasks to speech and speaker SSL models. We also examine which layers are predominantly utilized for each task to identify differences in how speech is represented. Furthermore, we conduct direct comparisons to measure the similarities between layers within and across models. Our analysis unveils that 1) the capacity to represent content information is somewhat unrelated to enhanced speaker representation, 2) specific layers of speech SSL models would be partly specialized in capturing linguistic information, and 3) speaker SSL models tend to disregard linguistic information but exhibit more sophisticated speaker representation.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2401.17632
- https://arxiv.org/pdf/2401.17632
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4391462745
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4391462745Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2401.17632Digital Object Identifier
- Title
-
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise AnalysisWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-01-31Full publication date if available
- Authors
-
Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke IjimaList of authors in order
- Landing page
-
https://arxiv.org/abs/2401.17632Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2401.17632Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2401.17632Direct OA link when available
- Concepts
-
Layer (electronics), Computer science, Speech recognition, Speaker recognition, Natural language processing, Artificial intelligence, Chemistry, Organic chemistryTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4391462745 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2401.17632 |
| ids.doi | https://doi.org/10.48550/arxiv.2401.17632 |
| ids.openalex | https://openalex.org/W4391462745 |
| fwci | |
| type | preprint |
| title | What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10201 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9653000235557556 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Speech Recognition and Synthesis |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2779227376 |
| concepts[0].level | 2 |
| concepts[0].score | 0.6043485403060913 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q6505497 |
| concepts[0].display_name | Layer (electronics) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.6015930771827698 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C28490314 |
| concepts[2].level | 1 |
| concepts[2].score | 0.527520477771759 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[2].display_name | Speech recognition |
| concepts[3].id | https://openalex.org/C133892786 |
| concepts[3].level | 2 |
| concepts[3].score | 0.42147624492645264 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1145189 |
| concepts[3].display_name | Speaker recognition |
| concepts[4].id | https://openalex.org/C204321447 |
| concepts[4].level | 1 |
| concepts[4].score | 0.3856649696826935 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[4].display_name | Natural language processing |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.36971521377563477 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C185592680 |
| concepts[6].level | 0 |
| concepts[6].score | 0.0 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q2329 |
| concepts[6].display_name | Chemistry |
| concepts[7].id | https://openalex.org/C178790620 |
| concepts[7].level | 1 |
| concepts[7].score | 0.0 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q11351 |
| concepts[7].display_name | Organic chemistry |
| keywords[0].id | https://openalex.org/keywords/layer |
| keywords[0].score | 0.6043485403060913 |
| keywords[0].display_name | Layer (electronics) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.6015930771827698 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/speech-recognition |
| keywords[2].score | 0.527520477771759 |
| keywords[2].display_name | Speech recognition |
| keywords[3].id | https://openalex.org/keywords/speaker-recognition |
| keywords[3].score | 0.42147624492645264 |
| keywords[3].display_name | Speaker recognition |
| keywords[4].id | https://openalex.org/keywords/natural-language-processing |
| keywords[4].score | 0.3856649696826935 |
| keywords[4].display_name | Natural language processing |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.36971521377563477 |
| keywords[5].display_name | Artificial intelligence |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2401.17632 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2401.17632 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2401.17632 |
| locations[1].id | doi:10.48550/arxiv.2401.17632 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2401.17632 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5033975068 |
| authorships[0].author.orcid | https://orcid.org/0009-0003-4322-4127 |
| authorships[0].author.display_name | Takanori Ashihara |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Ashihara, Takanori |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5023868166 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-5175-7834 |
| authorships[1].author.display_name | Marc Delcroix |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Delcroix, Marc |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5087290011 |
| authorships[2].author.orcid | https://orcid.org/0000-0003-1942-7250 |
| authorships[2].author.display_name | Takafumi Moriya |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Moriya, Takafumi |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5104231303 |
| authorships[3].author.orcid | https://orcid.org/0009-0000-0884-2200 |
| authorships[3].author.display_name | Kohei Matsuura |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Matsuura, Kohei |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5112536171 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Taichi Asami |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Asami, Taichi |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5068604686 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Yusuke Ijima |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Ijima, Yusuke |
| authorships[5].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2401.17632 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10201 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9653000235557556 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Speech Recognition and Synthesis |
| related_works | https://openalex.org/W4297807400, https://openalex.org/W1491159402, https://openalex.org/W4313854686, https://openalex.org/W321304764, https://openalex.org/W2249138175, https://openalex.org/W2611678594, https://openalex.org/W3162054169, https://openalex.org/W1813780412, https://openalex.org/W289407349, https://openalex.org/W2029134149 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2401.17632 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2401.17632 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2401.17632 |
| primary_location.id | pmh:oai:arXiv.org:2401.17632 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2401.17632 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2401.17632 |
| publication_date | 2024-01-31 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.1) | 153 |
| abstract_inverted_index.2) | 167 |
| abstract_inverted_index.3) | 183 |
| abstract_inverted_index.In | 26 |
| abstract_inverted_index.We | 94, 115 |
| abstract_inverted_index.as | 16 |
| abstract_inverted_index.be | 175 |
| abstract_inverted_index.by | 32, 103 |
| abstract_inverted_index.in | 129, 178 |
| abstract_inverted_index.is | 49, 132, 160 |
| abstract_inverted_index.of | 61, 170 |
| abstract_inverted_index.or | 83 |
| abstract_inverted_index.to | 22, 98, 109, 126, 139, 156, 163, 188 |
| abstract_inverted_index.we | 135 |
| abstract_inverted_index.Our | 149 |
| abstract_inverted_index.SSL | 13, 29, 73, 82, 113, 172, 185 |
| abstract_inverted_index.and | 55, 75, 111, 146, 182 |
| abstract_inverted_index.are | 120 |
| abstract_inverted_index.but | 192 |
| abstract_inverted_index.for | 7, 40, 51, 123 |
| abstract_inverted_index.has | 3, 65 |
| abstract_inverted_index.how | 44, 76, 130 |
| abstract_inverted_index.its | 77 |
| abstract_inverted_index.the | 58, 96, 141, 154 |
| abstract_inverted_index.SSL, | 63 |
| abstract_inverted_index.This | 88 |
| abstract_inverted_index.also | 116 |
| abstract_inverted_index.been | 66 |
| abstract_inverted_index.each | 124 |
| abstract_inverted_index.from | 80 |
| abstract_inverted_index.into | 69 |
| abstract_inverted_index.more | 194 |
| abstract_inverted_index.such | 15 |
| abstract_inverted_index.task | 125 |
| abstract_inverted_index.tend | 187 |
| abstract_inverted_index.that | 152 |
| abstract_inverted_index.what | 70 |
| abstract_inverted_index.(SSL) | 2 |
| abstract_inverted_index.adopt | 35 |
| abstract_inverted_index.model | 53 |
| abstract_inverted_index.other | 84 |
| abstract_inverted_index.paper | 89 |
| abstract_inverted_index.tasks | 108 |
| abstract_inverted_index.there | 64 |
| abstract_inverted_index.these | 45, 91 |
| abstract_inverted_index.which | 118 |
| abstract_inverted_index.would | 174 |
| abstract_inverted_index.SUPERB | 105 |
| abstract_inverted_index.Speech | 12 |
| abstract_inverted_index.Unlike | 57 |
| abstract_inverted_index.WavLM, | 17 |
| abstract_inverted_index.across | 147 |
| abstract_inverted_index.direct | 137 |
| abstract_inverted_index.employ | 18 |
| abstract_inverted_index.encode | 23 |
| abstract_inverted_index.layers | 119, 144, 169 |
| abstract_inverted_index.masked | 19 |
| abstract_inverted_index.models | 46, 173, 186 |
| abstract_inverted_index.partly | 176 |
| abstract_inverted_index.speech | 10, 62, 81, 101, 110, 131, 171 |
| abstract_inverted_index.within | 145 |
| abstract_inverted_index.between | 143 |
| abstract_inverted_index.capture | 99 |
| abstract_inverted_index.conduct | 136 |
| abstract_inverted_index.content | 158 |
| abstract_inverted_index.differs | 79 |
| abstract_inverted_index.examine | 117 |
| abstract_inverted_index.exhibit | 193 |
| abstract_inverted_index.explore | 95 |
| abstract_inverted_index.limited | 67 |
| abstract_inverted_index.measure | 140 |
| abstract_inverted_index.models, | 14, 30, 34 |
| abstract_inverted_index.models. | 87, 114, 148 |
| abstract_inverted_index.probing | 107 |
| abstract_inverted_index.speaker | 28, 41, 72, 86, 112, 165, 184, 196 |
| abstract_inverted_index.unveils | 151 |
| abstract_inverted_index.various | 59, 100 |
| abstract_inverted_index.analyses | 60 |
| abstract_inverted_index.analysis | 150 |
| abstract_inverted_index.applying | 104 |
| abstract_inverted_index.capacity | 97, 155 |
| abstract_inverted_index.captures | 74 |
| abstract_inverted_index.enhanced | 164 |
| abstract_inverted_index.identify | 127 |
| abstract_inverted_index.learning | 1, 8 |
| abstract_inverted_index.refining | 52 |
| abstract_inverted_index.somewhat | 161 |
| abstract_inverted_index.specific | 168 |
| abstract_inverted_index.training | 21, 37 |
| abstract_inverted_index.utilized | 122 |
| abstract_inverted_index.addresses | 90 |
| abstract_inverted_index.attention | 6 |
| abstract_inverted_index.attracted | 4 |
| abstract_inverted_index.capturing | 179 |
| abstract_inverted_index.contrast, | 27 |
| abstract_inverted_index.disregard | 189 |
| abstract_inverted_index.essential | 50 |
| abstract_inverted_index.increased | 5 |
| abstract_inverted_index.primarily | 39 |
| abstract_inverted_index.represent | 47, 157 |
| abstract_inverted_index.unrelated | 162 |
| abstract_inverted_index.DINO-based | 33 |
| abstract_inverted_index.efficiency | 54 |
| abstract_inverted_index.evaluation | 106 |
| abstract_inverted_index.linguistic | 180, 190 |
| abstract_inverted_index.meaningful | 9 |
| abstract_inverted_index.objectives | 38 |
| abstract_inverted_index.prediction | 20 |
| abstract_inverted_index.properties | 102 |
| abstract_inverted_index.questions. | 93 |
| abstract_inverted_index.comparisons | 138 |
| abstract_inverted_index.differences | 128 |
| abstract_inverted_index.exemplified | 31 |
| abstract_inverted_index.fundamental | 92 |
| abstract_inverted_index.information | 48, 71, 159, 191 |
| abstract_inverted_index.specialized | 177 |
| abstract_inverted_index.Furthermore, | 134 |
| abstract_inverted_index.information, | 181 |
| abstract_inverted_index.represented. | 133 |
| abstract_inverted_index.similarities | 142 |
| abstract_inverted_index.Understanding | 43 |
| abstract_inverted_index.investigation | 68 |
| abstract_inverted_index.predominantly | 121 |
| abstract_inverted_index.sophisticated | 195 |
| abstract_inverted_index.effectiveness. | 56 |
| abstract_inverted_index.representation | 78 |
| abstract_inverted_index.Self-supervised | 0 |
| abstract_inverted_index.general-purpose | 24 |
| abstract_inverted_index.representation, | 166 |
| abstract_inverted_index.representation. | 42, 197 |
| abstract_inverted_index.utterance-level | 36 |
| abstract_inverted_index.fully-supervised | 85 |
| abstract_inverted_index.representations. | 11, 25 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |