What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis Article Swipe

PDF

Takanori Ashihara , Marc Delcroix , Takafumi Moriya , Kohei Matsuura , Taichi Asami , Yusuke Ijima ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2401.17632

Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these models represent information is essential for refining model efficiency and effectiveness. Unlike the various analyses of speech SSL, there has been limited investigation into what information speaker SSL captures and how its representation differs from speech SSL or other fully-supervised speaker models. This paper addresses these fundamental questions. We explore the capacity to capture various speech properties by applying SUPERB evaluation probing tasks to speech and speaker SSL models. We also examine which layers are predominantly utilized for each task to identify differences in how speech is represented. Furthermore, we conduct direct comparisons to measure the similarities between layers within and across models. Our analysis unveils that 1) the capacity to represent content information is somewhat unrelated to enhanced speaker representation, 2) specific layers of speech SSL models would be partly specialized in capturing linguistic information, and 3) speaker SSL models tend to disregard linguistic information but exhibit more sophisticated speaker representation.

Related Topics

Computer Science

Artificial Intelligence

Chemistry

Organic Chemistry

Concepts

Layer (electronics) Computer science Speech recognition Speaker recognition Natural language processing Artificial intelligence Chemistry Organic chemistry

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2401.17632
PDF: https://arxiv.org/pdf/2401.17632
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4391462745

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4391462745

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2401.17632

Digital Object Identifier
Title: What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-01-31

Full publication date if available
Authors: Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima

List of authors in order
Landing page: https://arxiv.org/abs/2401.17632

Publisher landing page
PDF URL: https://arxiv.org/pdf/2401.17632

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2401.17632

Direct OA link when available
Concepts: Layer (electronics), Computer science, Speech recognition, Speaker recognition, Natural language processing, Artificial intelligence, Chemistry, Organic chemistry

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4391462745
doi	https://doi.org/10.48550/arxiv.2401.17632
ids.doi	https://doi.org/10.48550/arxiv.2401.17632
ids.openalex	https://openalex.org/W4391462745
fwci
type	preprint
title	What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10201
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9653000235557556
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Speech Recognition and Synthesis
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C2779227376
concepts[0].level	2
concepts[0].score	0.6043485403060913
concepts[0].wikidata	https://www.wikidata.org/wiki/Q6505497
concepts[0].display_name	Layer (electronics)
concepts[1].id	https://openalex.org/C41008148
concepts[1].level	0
concepts[1].score	0.6015930771827698
concepts[1].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[1].display_name	Computer science
concepts[2].id	https://openalex.org/C28490314
concepts[2].level	1
concepts[2].score	0.527520477771759
concepts[2].wikidata	https://www.wikidata.org/wiki/Q189436
concepts[2].display_name	Speech recognition
concepts[3].id	https://openalex.org/C133892786
concepts[3].level	2
concepts[3].score	0.42147624492645264
concepts[3].wikidata	https://www.wikidata.org/wiki/Q1145189
concepts[3].display_name	Speaker recognition
concepts[4].id	https://openalex.org/C204321447
concepts[4].level	1
concepts[4].score	0.3856649696826935
concepts[4].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[4].display_name	Natural language processing
concepts[5].id	https://openalex.org/C154945302
concepts[5].level	1
concepts[5].score	0.36971521377563477
concepts[5].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[5].display_name	Artificial intelligence
concepts[6].id	https://openalex.org/C185592680
concepts[6].level	0
concepts[6].score	0.0
concepts[6].wikidata	https://www.wikidata.org/wiki/Q2329
concepts[6].display_name	Chemistry
concepts[7].id	https://openalex.org/C178790620
concepts[7].level	1
concepts[7].score	0.0
concepts[7].wikidata	https://www.wikidata.org/wiki/Q11351
concepts[7].display_name	Organic chemistry
keywords[0].id	https://openalex.org/keywords/layer
keywords[0].score	0.6043485403060913
keywords[0].display_name	Layer (electronics)
keywords[1].id	https://openalex.org/keywords/computer-science
keywords[1].score	0.6015930771827698
keywords[1].display_name	Computer science
keywords[2].id	https://openalex.org/keywords/speech-recognition
keywords[2].score	0.527520477771759
keywords[2].display_name	Speech recognition
keywords[3].id	https://openalex.org/keywords/speaker-recognition
keywords[3].score	0.42147624492645264
keywords[3].display_name	Speaker recognition
keywords[4].id	https://openalex.org/keywords/natural-language-processing
keywords[4].score	0.3856649696826935
keywords[4].display_name	Natural language processing
keywords[5].id	https://openalex.org/keywords/artificial-intelligence
keywords[5].score	0.36971521377563477
keywords[5].display_name	Artificial intelligence
language	en
locations[0].id	pmh:oai:arXiv.org:2401.17632
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2401.17632
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2401.17632
locations[1].id	doi:10.48550/arxiv.2401.17632
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2401.17632
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5033975068
authorships[0].author.orcid	https://orcid.org/0009-0003-4322-4127
authorships[0].author.display_name	Takanori Ashihara
authorships[0].author_position	first
authorships[0].raw_author_name	Ashihara, Takanori
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5023868166
authorships[1].author.orcid	https://orcid.org/0000-0002-5175-7834
authorships[1].author.display_name	Marc Delcroix
authorships[1].author_position	middle
authorships[1].raw_author_name	Delcroix, Marc
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5087290011
authorships[2].author.orcid	https://orcid.org/0000-0003-1942-7250
authorships[2].author.display_name	Takafumi Moriya
authorships[2].author_position	middle
authorships[2].raw_author_name	Moriya, Takafumi
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5104231303
authorships[3].author.orcid	https://orcid.org/0009-0000-0884-2200
authorships[3].author.display_name	Kohei Matsuura
authorships[3].author_position	middle
authorships[3].raw_author_name	Matsuura, Kohei
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5112536171
authorships[4].author.orcid
authorships[4].author.display_name	Taichi Asami
authorships[4].author_position	middle
authorships[4].raw_author_name	Asami, Taichi
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5068604686
authorships[5].author.orcid
authorships[5].author.display_name	Yusuke Ijima
authorships[5].author_position	last
authorships[5].raw_author_name	Ijima, Yusuke
authorships[5].is_corresponding	False
has_content.pdf	True
has_content.grobid_xml	True
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2401.17632
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
has_fulltext	True
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10201
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9653000235557556
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Speech Recognition and Synthesis
related_works	https://openalex.org/W4297807400, https://openalex.org/W1491159402, https://openalex.org/W4313854686, https://openalex.org/W321304764, https://openalex.org/W2249138175, https://openalex.org/W2611678594, https://openalex.org/W3162054169, https://openalex.org/W1813780412, https://openalex.org/W289407349, https://openalex.org/W2029134149
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2401.17632
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2401.17632
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2401.17632
primary_location.id	pmh:oai:arXiv.org:2401.17632
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2401.17632
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2401.17632
publication_date	2024-01-31
publication_year	2024
referenced_works_count	0
abstract_inverted_index.1)	153
abstract_inverted_index.2)	167
abstract_inverted_index.3)	183
abstract_inverted_index.In	26
abstract_inverted_index.We	94, 115
abstract_inverted_index.as	16
abstract_inverted_index.be	175
abstract_inverted_index.by	32, 103
abstract_inverted_index.in	129, 178
abstract_inverted_index.is	49, 132, 160
abstract_inverted_index.of	61, 170
abstract_inverted_index.or	83
abstract_inverted_index.to	22, 98, 109, 126, 139, 156, 163, 188
abstract_inverted_index.we	135
abstract_inverted_index.Our	149
abstract_inverted_index.SSL	13, 29, 73, 82, 113, 172, 185
abstract_inverted_index.and	55, 75, 111, 146, 182
abstract_inverted_index.are	120
abstract_inverted_index.but	192
abstract_inverted_index.for	7, 40, 51, 123
abstract_inverted_index.has	3, 65
abstract_inverted_index.how	44, 76, 130
abstract_inverted_index.its	77
abstract_inverted_index.the	58, 96, 141, 154
abstract_inverted_index.SSL,	63
abstract_inverted_index.This	88
abstract_inverted_index.also	116
abstract_inverted_index.been	66
abstract_inverted_index.each	124
abstract_inverted_index.from	80
abstract_inverted_index.into	69
abstract_inverted_index.more	194
abstract_inverted_index.such	15
abstract_inverted_index.task	125
abstract_inverted_index.tend	187
abstract_inverted_index.that	152
abstract_inverted_index.what	70
abstract_inverted_index.(SSL)	2
abstract_inverted_index.adopt	35
abstract_inverted_index.model	53
abstract_inverted_index.other	84
abstract_inverted_index.paper	89
abstract_inverted_index.tasks	108
abstract_inverted_index.there	64
abstract_inverted_index.these	45, 91
abstract_inverted_index.which	118
abstract_inverted_index.would	174
abstract_inverted_index.SUPERB	105
abstract_inverted_index.Speech	12
abstract_inverted_index.Unlike	57
abstract_inverted_index.WavLM,	17
abstract_inverted_index.across	147
abstract_inverted_index.direct	137
abstract_inverted_index.employ	18
abstract_inverted_index.encode	23
abstract_inverted_index.layers	119, 144, 169
abstract_inverted_index.masked	19
abstract_inverted_index.models	46, 173, 186
abstract_inverted_index.partly	176
abstract_inverted_index.speech	10, 62, 81, 101, 110, 131, 171
abstract_inverted_index.within	145
abstract_inverted_index.between	143
abstract_inverted_index.capture	99
abstract_inverted_index.conduct	136
abstract_inverted_index.content	158
abstract_inverted_index.differs	79
abstract_inverted_index.examine	117
abstract_inverted_index.exhibit	193
abstract_inverted_index.explore	95
abstract_inverted_index.limited	67
abstract_inverted_index.measure	140
abstract_inverted_index.models,	14, 30, 34
abstract_inverted_index.models.	87, 114, 148
abstract_inverted_index.probing	107
abstract_inverted_index.speaker	28, 41, 72, 86, 112, 165, 184, 196
abstract_inverted_index.unveils	151
abstract_inverted_index.various	59, 100
abstract_inverted_index.analyses	60
abstract_inverted_index.analysis	150
abstract_inverted_index.applying	104
abstract_inverted_index.capacity	97, 155
abstract_inverted_index.captures	74
abstract_inverted_index.enhanced	164
abstract_inverted_index.identify	127
abstract_inverted_index.learning	1, 8
abstract_inverted_index.refining	52
abstract_inverted_index.somewhat	161
abstract_inverted_index.specific	168
abstract_inverted_index.training	21, 37
abstract_inverted_index.utilized	122
abstract_inverted_index.addresses	90
abstract_inverted_index.attention	6
abstract_inverted_index.attracted	4
abstract_inverted_index.capturing	179
abstract_inverted_index.contrast,	27
abstract_inverted_index.disregard	189
abstract_inverted_index.essential	50
abstract_inverted_index.increased	5
abstract_inverted_index.primarily	39
abstract_inverted_index.represent	47, 157
abstract_inverted_index.unrelated	162
abstract_inverted_index.DINO-based	33
abstract_inverted_index.efficiency	54
abstract_inverted_index.evaluation	106
abstract_inverted_index.linguistic	180, 190
abstract_inverted_index.meaningful	9
abstract_inverted_index.objectives	38
abstract_inverted_index.prediction	20
abstract_inverted_index.properties	102
abstract_inverted_index.questions.	93
abstract_inverted_index.comparisons	138
abstract_inverted_index.differences	128
abstract_inverted_index.exemplified	31
abstract_inverted_index.fundamental	92
abstract_inverted_index.information	48, 71, 159, 191
abstract_inverted_index.specialized	177
abstract_inverted_index.Furthermore,	134
abstract_inverted_index.information,	181
abstract_inverted_index.represented.	133
abstract_inverted_index.similarities	142
abstract_inverted_index.Understanding	43
abstract_inverted_index.investigation	68
abstract_inverted_index.predominantly	121
abstract_inverted_index.sophisticated	195
abstract_inverted_index.effectiveness.	56
abstract_inverted_index.representation	78
abstract_inverted_index.Self-supervised	0
abstract_inverted_index.general-purpose	24
abstract_inverted_index.representation,	166
abstract_inverted_index.representation.	42, 197
abstract_inverted_index.utterance-level	36
abstract_inverted_index.fully-supervised	85
abstract_inverted_index.representations.	11, 25
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	6
citation_normalized_percentile