StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models Article Swipe

PDF

Kazuki Yamauchi , Yusuke Ijima , Yuki Saito ·

YOU? · · 2023 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2311.16509

We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. StyleCap is a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning. StyleCap is trained with paired data of speech and natural language descriptions. We train neural networks that convert a speech representation vector into prefix vectors that are fed into a large language model (LLM)-based text decoder. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SSL) features, and sentence rephrasing augmentation improves the accuracy and diversity of generated speaking-style captions. Samples of speaking-style captions generated by our StyleCap are publicly available.

Related Topics

Computer Science

Artificial Intelligence

Concepts

Computer science Closed captioning Natural language processing Speech recognition Sentence Artificial intelligence Style (visual arts) Focus (optics) Natural language Representation (politics) Natural language generation Feature (linguistics) Linguistics Philosophy Image (mathematics) History Physics Politics Optics Political science Law Archaeology

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2311.16509
PDF: https://arxiv.org/pdf/2311.16509
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4389156698

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4389156698

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2311.16509

Digital Object Identifier
Title: StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2023

Year of publication
Publication date: 2023-11-28

Full publication date if available
Authors: Kazuki Yamauchi, Yusuke Ijima, Yuki Saito

List of authors in order
Landing page: https://arxiv.org/abs/2311.16509

Publisher landing page
PDF URL: https://arxiv.org/pdf/2311.16509

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2311.16509

Direct OA link when available
Concepts: Computer science, Closed captioning, Natural language processing, Speech recognition, Sentence, Artificial intelligence, Style (visual arts), Focus (optics), Natural language, Representation (politics), Natural language generation, Feature (linguistics), Linguistics, Philosophy, Image (mathematics), History, Physics, Politics, Optics, Political science, Law, Archaeology

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4389156698
doi	https://doi.org/10.48550/arxiv.2311.16509
ids.doi	https://doi.org/10.48550/arxiv.2311.16509
ids.openalex	https://openalex.org/W4389156698
fwci
type	preprint
title	StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10181
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9983000159263611
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Natural Language Processing Techniques
topics[1].id	https://openalex.org/T12031
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9972000122070312
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Speech and dialogue systems
topics[2].id	https://openalex.org/T10028
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9950000047683716
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Topic Modeling
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.8147604465484619
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C157657479
concepts[1].level	3
concepts[1].score	0.7064263820648193
concepts[1].wikidata	https://www.wikidata.org/wiki/Q2367247
concepts[1].display_name	Closed captioning
concepts[2].id	https://openalex.org/C204321447
concepts[2].level	1
concepts[2].score	0.6739928722381592
concepts[2].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[2].display_name	Natural language processing
concepts[3].id	https://openalex.org/C28490314
concepts[3].level	1
concepts[3].score	0.6077100038528442
concepts[3].wikidata	https://www.wikidata.org/wiki/Q189436
concepts[3].display_name	Speech recognition
concepts[4].id	https://openalex.org/C2777530160
concepts[4].level	2
concepts[4].score	0.5815175175666809
concepts[4].wikidata	https://www.wikidata.org/wiki/Q41796
concepts[4].display_name	Sentence
concepts[5].id	https://openalex.org/C154945302
concepts[5].level	1
concepts[5].score	0.5727705955505371
concepts[5].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[5].display_name	Artificial intelligence
concepts[6].id	https://openalex.org/C2776445246
concepts[6].level	2
concepts[6].score	0.5426627993583679
concepts[6].wikidata	https://www.wikidata.org/wiki/Q1792644
concepts[6].display_name	Style (visual arts)
concepts[7].id	https://openalex.org/C192209626
concepts[7].level	2
concepts[7].score	0.5323460698127747
concepts[7].wikidata	https://www.wikidata.org/wiki/Q190909
concepts[7].display_name	Focus (optics)
concepts[8].id	https://openalex.org/C195324797
concepts[8].level	2
concepts[8].score	0.5276447534561157
concepts[8].wikidata	https://www.wikidata.org/wiki/Q33742
concepts[8].display_name	Natural language
concepts[9].id	https://openalex.org/C2776359362
concepts[9].level	3
concepts[9].score	0.43999525904655457
concepts[9].wikidata	https://www.wikidata.org/wiki/Q2145286
concepts[9].display_name	Representation (politics)
concepts[10].id	https://openalex.org/C2776187449
concepts[10].level	3
concepts[10].score	0.4383416175842285
concepts[10].wikidata	https://www.wikidata.org/wiki/Q1513879
concepts[10].display_name	Natural language generation
concepts[11].id	https://openalex.org/C2776401178
concepts[11].level	2
concepts[11].score	0.4251898229122162
concepts[11].wikidata	https://www.wikidata.org/wiki/Q12050496
concepts[11].display_name	Feature (linguistics)
concepts[12].id	https://openalex.org/C41895202
concepts[12].level	1
concepts[12].score	0.2886231541633606
concepts[12].wikidata	https://www.wikidata.org/wiki/Q8162
concepts[12].display_name	Linguistics
concepts[13].id	https://openalex.org/C138885662
concepts[13].level	0
concepts[13].score	0.0
concepts[13].wikidata	https://www.wikidata.org/wiki/Q5891
concepts[13].display_name	Philosophy
concepts[14].id	https://openalex.org/C115961682
concepts[14].level	2
concepts[14].score	0.0
concepts[14].wikidata	https://www.wikidata.org/wiki/Q860623
concepts[14].display_name	Image (mathematics)
concepts[15].id	https://openalex.org/C95457728
concepts[15].level	0
concepts[15].score	0.0
concepts[15].wikidata	https://www.wikidata.org/wiki/Q309
concepts[15].display_name	History
concepts[16].id	https://openalex.org/C121332964
concepts[16].level	0
concepts[16].score	0.0
concepts[16].wikidata	https://www.wikidata.org/wiki/Q413
concepts[16].display_name	Physics
concepts[17].id	https://openalex.org/C94625758
concepts[17].level	2
concepts[17].score	0.0
concepts[17].wikidata	https://www.wikidata.org/wiki/Q7163
concepts[17].display_name	Politics
concepts[18].id	https://openalex.org/C120665830
concepts[18].level	1
concepts[18].score	0.0
concepts[18].wikidata	https://www.wikidata.org/wiki/Q14620
concepts[18].display_name	Optics
concepts[19].id	https://openalex.org/C17744445
concepts[19].level	0
concepts[19].score	0.0
concepts[19].wikidata	https://www.wikidata.org/wiki/Q36442
concepts[19].display_name	Political science
concepts[20].id	https://openalex.org/C199539241
concepts[20].level	1
concepts[20].score	0.0
concepts[20].wikidata	https://www.wikidata.org/wiki/Q7748
concepts[20].display_name	Law
concepts[21].id	https://openalex.org/C166957645
concepts[21].level	1
concepts[21].score	0.0
concepts[21].wikidata	https://www.wikidata.org/wiki/Q23498
concepts[21].display_name	Archaeology
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.8147604465484619
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/closed-captioning
keywords[1].score	0.7064263820648193
keywords[1].display_name	Closed captioning
keywords[2].id	https://openalex.org/keywords/natural-language-processing
keywords[2].score	0.6739928722381592
keywords[2].display_name	Natural language processing
keywords[3].id	https://openalex.org/keywords/speech-recognition
keywords[3].score	0.6077100038528442
keywords[3].display_name	Speech recognition
keywords[4].id	https://openalex.org/keywords/sentence
keywords[4].score	0.5815175175666809
keywords[4].display_name	Sentence
keywords[5].id	https://openalex.org/keywords/artificial-intelligence
keywords[5].score	0.5727705955505371
keywords[5].display_name	Artificial intelligence
keywords[6].id	https://openalex.org/keywords/style
keywords[6].score	0.5426627993583679
keywords[6].display_name	Style (visual arts)
keywords[7].id	https://openalex.org/keywords/focus
keywords[7].score	0.5323460698127747
keywords[7].display_name	Focus (optics)
keywords[8].id	https://openalex.org/keywords/natural-language
keywords[8].score	0.5276447534561157
keywords[8].display_name	Natural language
keywords[9].id	https://openalex.org/keywords/representation
keywords[9].score	0.43999525904655457
keywords[9].display_name	Representation (politics)
keywords[10].id	https://openalex.org/keywords/natural-language-generation
keywords[10].score	0.4383416175842285
keywords[10].display_name	Natural language generation
keywords[11].id	https://openalex.org/keywords/feature
keywords[11].score	0.4251898229122162
keywords[11].display_name	Feature (linguistics)
keywords[12].id	https://openalex.org/keywords/linguistics
keywords[12].score	0.2886231541633606
keywords[12].display_name	Linguistics
language	en
locations[0].id	pmh:oai:arXiv.org:2311.16509
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license	cc-by-sa
locations[0].pdf_url	https://arxiv.org/pdf/2311.16509
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id	https://openalex.org/licenses/cc-by-sa
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2311.16509
locations[1].id	doi:10.48550/arxiv.2311.16509
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2311.16509
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5108574971
authorships[0].author.orcid
authorships[0].author.display_name	Kazuki Yamauchi
authorships[0].author_position	first
authorships[0].raw_author_name	Yamauchi, Kazuki
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5068604686
authorships[1].author.orcid
authorships[1].author.display_name	Yusuke Ijima
authorships[1].author_position	middle
authorships[1].raw_author_name	Ijima, Yusuke
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5083394213
authorships[2].author.orcid	https://orcid.org/0000-0002-7967-2613
authorships[2].author.display_name	Yuki Saito
authorships[2].author_position	last
authorships[2].raw_author_name	Saito, Yuki
authorships[2].is_corresponding	False
has_content.pdf	True
has_content.grobid_xml	True
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2311.16509
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2023-11-30T00:00:00
display_name	StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10181
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9983000159263611
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Natural Language Processing Techniques
related_works	https://openalex.org/W2050523636, https://openalex.org/W3009270862, https://openalex.org/W2152921782, https://openalex.org/W382594479, https://openalex.org/W2470045054, https://openalex.org/W2575772232, https://openalex.org/W2151245229, https://openalex.org/W2140902089, https://openalex.org/W2030298461, https://openalex.org/W1510553545
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2311.16509
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license	cc-by-sa
best_oa_location.pdf_url	https://arxiv.org/pdf/2311.16509
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id	https://openalex.org/licenses/cc-by-sa
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2311.16509
primary_location.id	pmh:oai:arXiv.org:2311.16509
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license	cc-by-sa
primary_location.pdf_url	https://arxiv.org/pdf/2311.16509
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id	https://openalex.org/licenses/cc-by-sa
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2311.16509
publication_date	2023-11-28
publication_year	2023
referenced_works_count	0
abstract_inverted_index.a	3, 52, 87, 98
abstract_inverted_index.We	0, 81, 105
abstract_inverted_index.an	47, 56, 107
abstract_inverted_index.by	157
abstract_inverted_index.in	14, 46
abstract_inverted_index.is	51, 70
abstract_inverted_index.of	10, 18, 34, 42, 75, 148, 153
abstract_inverted_index.on	26
abstract_inverted_index.or	30
abstract_inverted_index.to	5
abstract_inverted_index.The	120
abstract_inverted_index.and	77, 111, 139, 146
abstract_inverted_index.are	95, 160
abstract_inverted_index.fed	96
abstract_inverted_index.for	21, 59, 116, 130
abstract_inverted_index.new	118
abstract_inverted_index.our	125, 158
abstract_inverted_index.the	27, 31, 40, 43, 131, 144
abstract_inverted_index.LLMs	129
abstract_inverted_index.data	74
abstract_inverted_index.from	63
abstract_inverted_index.into	91, 97
abstract_inverted_index.most	17
abstract_inverted_index.step	54
abstract_inverted_index.text	103, 109, 132
abstract_inverted_index.that	85, 94, 124
abstract_inverted_index.they	37
abstract_inverted_index.this	117
abstract_inverted_index.with	72
abstract_inverted_index.(SSL)	137
abstract_inverted_index.first	53
abstract_inverted_index.focus	25
abstract_inverted_index.i.e.,	65
abstract_inverted_index.large	99
abstract_inverted_index.model	101
abstract_inverted_index.task.	119
abstract_inverted_index.train	82
abstract_inverted_index.cannot	38
abstract_inverted_index.method	4, 58
abstract_inverted_index.neural	83
abstract_inverted_index.paired	73
abstract_inverted_index.prefix	92
abstract_inverted_index.result	45
abstract_inverted_index.richer	128
abstract_inverted_index.speech	76, 88, 112, 134
abstract_inverted_index.styles	12
abstract_inverted_index.vector	90
abstract_inverted_index.Samples	152
abstract_inverted_index.convert	86
abstract_inverted_index.decoder	110
abstract_inverted_index.explore	106
abstract_inverted_index.feature	113
abstract_inverted_index.labels,	36
abstract_inverted_index.manner.	49
abstract_inverted_index.natural	7, 78
abstract_inverted_index.prompts	62
abstract_inverted_index.propose	1
abstract_inverted_index.provide	39
abstract_inverted_index.results	122
abstract_inverted_index.speech,	64
abstract_inverted_index.speech.	15
abstract_inverted_index.towards	55
abstract_inverted_index.trained	71
abstract_inverted_index.vectors	93
abstract_inverted_index.Although	16
abstract_inverted_index.StyleCap	50, 69, 126, 159
abstract_inverted_index.accuracy	145
abstract_inverted_index.captions	155
abstract_inverted_index.category	28
abstract_inverted_index.decoder,	133
abstract_inverted_index.decoder.	104
abstract_inverted_index.generate	6
abstract_inverted_index.improves	143
abstract_inverted_index.language	8, 79, 100
abstract_inverted_index.learning	136
abstract_inverted_index.networks	84
abstract_inverted_index.publicly	161
abstract_inverted_index.sentence	140
abstract_inverted_index.speaking	11
abstract_inverted_index.suitable	115
abstract_inverted_index.StyleCap,	2
abstract_inverted_index.appearing	13
abstract_inverted_index.automatic	66
abstract_inverted_index.captions.	151
abstract_inverted_index.diversity	147
abstract_inverted_index.features,	138
abstract_inverted_index.generated	149, 156
abstract_inverted_index.intensity	32
abstract_inverted_index.reasoning	41
abstract_inverted_index.available.	162
abstract_inverted_index.end-to-end	57
abstract_inverted_index.estimation	33
abstract_inverted_index.generating	60
abstract_inverted_index.leveraging	127
abstract_inverted_index.rephrasing	141
abstract_inverted_index.techniques	20
abstract_inverted_index.(LLM)-based	102
abstract_inverted_index.appropriate	108
abstract_inverted_index.captioning.	68
abstract_inverted_index.demonstrate	123
abstract_inverted_index.information	23
abstract_inverted_index.pre-defined	35
abstract_inverted_index.recognition	24, 44
abstract_inverted_index.augmentation	142
abstract_inverted_index.conventional	19
abstract_inverted_index.descriptions	9
abstract_inverted_index.experimental	121
abstract_inverted_index.descriptions.	80
abstract_inverted_index.interpretable	48
abstract_inverted_index.classification	29
abstract_inverted_index.representation	89, 114
abstract_inverted_index.speaking-style	61, 67, 150, 154
abstract_inverted_index.self-supervised	135
abstract_inverted_index.para-/non-linguistic	22
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	3
sustainable_development_goals[0].id	https://metadata.un.org/sdg/4
sustainable_development_goals[0].score	0.8199999928474426
sustainable_development_goals[0].display_name	Quality Education
citation_normalized_percentile