Exploring modality-agnostic representations for music classification Article Swipe

PDF

Ho-Hsiang Wu , Magdalena Fuentes , Juan Pablo Bello ·

YOU? · · 2021 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2106.01149

Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.

Related Topics

Computer Science

Artificial Intelligence

Concepts

Modality (human–computer interaction) Computer science Psychology Natural language processing Artificial intelligence

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2106.01149
PDF: https://arxiv.org/pdf/2106.01149
OA Status: green
Cited By: 4
References: 31
Related Works: 10
OpenAlex ID: https://openalex.org/W3172444685

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W3172444685

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2106.01149

Digital Object Identifier
Title: Exploring modality-agnostic representations for music classification

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2021

Year of publication
Publication date: 2021-06-02

Full publication date if available
Authors: Ho-Hsiang Wu, Magdalena Fuentes, Juan Pablo Bello

List of authors in order
Landing page: https://arxiv.org/abs/2106.01149

Publisher landing page
PDF URL: https://arxiv.org/pdf/2106.01149

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2106.01149

Direct OA link when available
Concepts: Modality (human–computer interaction), Computer science, Psychology, Natural language processing, Artificial intelligence

Top concepts (fields/topics) attached by OpenAlex
Cited by: 4

Total citation count in OpenAlex
Citations by year (recent): 2024: 1, 2023: 3

Per-year citation counts (last 5 years)
References (count): 31

Number of works referenced by this work
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W3172444685
doi	https://doi.org/10.48550/arxiv.2106.01149
ids.doi	https://doi.org/10.48550/arxiv.2106.01149
ids.mag	3172444685
ids.openalex	https://openalex.org/W3172444685
fwci
type	preprint
title	Exploring modality-agnostic representations for music classification
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T11309
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	1.0
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1711
topics[0].subfield.display_name	Signal Processing
topics[0].display_name	Music and Audio Processing
topics[1].id	https://openalex.org/T10860
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9925000071525574
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1711
topics[1].subfield.display_name	Signal Processing
topics[1].display_name	Speech and Audio Processing
topics[2].id	https://openalex.org/T10201
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9912999868392944
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Speech Recognition and Synthesis
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C2780226545
concepts[0].level	2
concepts[0].score	0.8031980991363525
concepts[0].wikidata	https://www.wikidata.org/wiki/Q6888030
concepts[0].display_name	Modality (human–computer interaction)
concepts[1].id	https://openalex.org/C41008148
concepts[1].level	0
concepts[1].score	0.4829553961753845
concepts[1].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[1].display_name	Computer science
concepts[2].id	https://openalex.org/C15744967
concepts[2].level	0
concepts[2].score	0.36694014072418213
concepts[2].wikidata	https://www.wikidata.org/wiki/Q9418
concepts[2].display_name	Psychology
concepts[3].id	https://openalex.org/C204321447
concepts[3].level	1
concepts[3].score	0.3405355215072632
concepts[3].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[3].display_name	Natural language processing
concepts[4].id	https://openalex.org/C154945302
concepts[4].level	1
concepts[4].score	0.32367485761642456
concepts[4].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[4].display_name	Artificial intelligence
keywords[0].id	https://openalex.org/keywords/modality
keywords[0].score	0.8031980991363525
keywords[0].display_name	Modality (human–computer interaction)
keywords[1].id	https://openalex.org/keywords/computer-science
keywords[1].score	0.4829553961753845
keywords[1].display_name	Computer science
keywords[2].id	https://openalex.org/keywords/psychology
keywords[2].score	0.36694014072418213
keywords[2].display_name	Psychology
keywords[3].id	https://openalex.org/keywords/natural-language-processing
keywords[3].score	0.3405355215072632
keywords[3].display_name	Natural language processing
keywords[4].id	https://openalex.org/keywords/artificial-intelligence
keywords[4].score	0.32367485761642456
keywords[4].display_name	Artificial intelligence
language	en
locations[0].id	pmh:oai:arXiv.org:2106.01149
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2106.01149
locations[0].version	submittedVersion
locations[0].raw_type
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2106.01149
locations[1].id	doi:10.48550/arxiv.2106.01149
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2106.01149
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5035643647
authorships[0].author.orcid	https://orcid.org/0000-0002-1102-074X
authorships[0].author.display_name	Ho-Hsiang Wu
authorships[0].author_position	first
authorships[0].raw_author_name	Ho-Hsiang Wu
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5021235229
authorships[1].author.orcid	https://orcid.org/0000-0003-4506-6639
authorships[1].author.display_name	Magdalena Fuentes
authorships[1].author_position	middle
authorships[1].raw_author_name	Magdalena Fuentes
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5031398497
authorships[2].author.orcid	https://orcid.org/0000-0001-8561-5204
authorships[2].author.display_name	Juan Pablo Bello
authorships[2].author_position	last
authorships[2].raw_author_name	Juan Pablo Bello
authorships[2].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2106.01149
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Exploring modality-agnostic representations for music classification
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T11309
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	1.0
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1711
primary_topic.subfield.display_name	Signal Processing
primary_topic.display_name	Music and Audio Processing
related_works	https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2385859805, https://openalex.org/W2530972254, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2382290278, https://openalex.org/W3204019825
cited_by_count	4
counts_by_year[0].year	2024
counts_by_year[0].cited_by_count	1
counts_by_year[1].year	2023
counts_by_year[1].cited_by_count	3
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2106.01149
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2106.01149
best_oa_location.version	submittedVersion
best_oa_location.raw_type
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2106.01149
primary_location.id	pmh:oai:arXiv.org:2106.01149
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2106.01149
primary_location.version	submittedVersion
primary_location.raw_type
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2106.01149
publication_date	2021-06-02
publication_year	2021
referenced_works	https://openalex.org/W2619383789, https://openalex.org/W2963115079, https://openalex.org/W2890267272, https://openalex.org/W2526050071, https://openalex.org/W2990325209, https://openalex.org/W2963350250, https://openalex.org/W2990796920, https://openalex.org/W3162583214, https://openalex.org/W2157364932, https://openalex.org/W2593116425, https://openalex.org/W2890559714, https://openalex.org/W2990245503, https://openalex.org/W2619329613, https://openalex.org/W2996266053, https://openalex.org/W272962585, https://openalex.org/W2108598243, https://openalex.org/W2146104196, https://openalex.org/W2619697695, https://openalex.org/W2890913619, https://openalex.org/W2906289885, https://openalex.org/W2194775991, https://openalex.org/W2962835968, https://openalex.org/W2990387939, https://openalex.org/W3103014337, https://openalex.org/W2152790380, https://openalex.org/W1931639407, https://openalex.org/W2963988212, https://openalex.org/W1514535095, https://openalex.org/W2842511635, https://openalex.org/W2939574508, https://openalex.org/W2138621090
referenced_works_count	31
abstract_inverted_index.a	109, 185, 212, 217
abstract_inverted_index.To	72
abstract_inverted_index.We	101, 130, 151, 200, 215
abstract_inverted_index.an	135
abstract_inverted_index.as	53, 108, 121, 134, 141, 163
abstract_inverted_index.be	119
abstract_inverted_index.by	193
abstract_inverted_index.in	191, 211
abstract_inverted_index.is	2, 180
abstract_inverted_index.no	78
abstract_inverted_index.of	36, 58, 75, 105, 128, 207, 220, 229
abstract_inverted_index.on	30
abstract_inverted_index.or	5, 92, 161, 170
abstract_inverted_index.to	15, 50, 61, 84, 112, 123, 168, 203, 223
abstract_inverted_index.we	174
abstract_inverted_index.70%	206
abstract_inverted_index.all	68
abstract_inverted_index.and	19, 94, 144, 165, 188, 227, 232
abstract_inverted_index.are	70, 126, 201
abstract_inverted_index.but	12
abstract_inverted_index.can	117, 157
abstract_inverted_index.few	63
abstract_inverted_index.for	39, 138, 184
abstract_inverted_index.has	26, 81
abstract_inverted_index.not	13
abstract_inverted_index.our	76, 139
abstract_inverted_index.the	51, 56, 62, 73, 82, 103, 176, 189, 225, 230
abstract_inverted_index.use	57, 104
abstract_inverted_index.Some	42
abstract_inverted_index.able	202
abstract_inverted_index.best	74, 208
abstract_inverted_index.both	142, 159
abstract_inverted_index.case	177
abstract_inverted_index.data	9, 66, 183, 196
abstract_inverted_index.e.g.	90
abstract_inverted_index.each	40
abstract_inverted_index.from	67, 87, 197
abstract_inverted_index.into	97
abstract_inverted_index.take	85, 158
abstract_inverted_index.task	111, 137
abstract_inverted_index.text	18
abstract_inverted_index.that	125, 156
abstract_inverted_index.them	96
abstract_inverted_index.then	118
abstract_inverted_index.used	120
abstract_inverted_index.when	178
abstract_inverted_index.Music	0
abstract_inverted_index.audio	145
abstract_inverted_index.cases	64
abstract_inverted_index.given	49, 186
abstract_inverted_index.learn	113
abstract_inverted_index.model	52, 80
abstract_inverted_index.music	22, 99, 153
abstract_inverted_index.often	3
abstract_inverted_index.other	198
abstract_inverted_index.steps	235
abstract_inverted_index.study	140
abstract_inverted_index.there	179
abstract_inverted_index.these	59
abstract_inverted_index.train	152
abstract_inverted_index.using	194
abstract_inverted_index.where	65
abstract_inverted_index.which	116
abstract_inverted_index.works	44
abstract_inverted_index.across	7
abstract_inverted_index.almost	27, 205
abstract_inverted_index.audio,	16
abstract_inverted_index.future	234
abstract_inverted_index.images	91, 160
abstract_inverted_index.impact	190
abstract_inverted_index.input,	164
abstract_inverted_index.inputs	86, 122
abstract_inverted_index.models	38, 60
abstract_inverted_index.select	131
abstract_inverted_index.single	31
abstract_inverted_index.sounds	162
abstract_inverted_index.system	210
abstract_inverted_index.visual	143
abstract_inverted_index.ability	83
abstract_inverted_index.achieve	204
abstract_inverted_index.discuss	233
abstract_inverted_index.example	136
abstract_inverted_index.explore	102, 175
abstract_inverted_index.focused	29
abstract_inverted_index.images,	17
abstract_inverted_index.inputs,	54
abstract_inverted_index.labeled	182, 195
abstract_inverted_index.limited	14, 181
abstract_inverted_index.perform	166
abstract_inverted_index.pretext	110
abstract_inverted_index.provide	147, 216
abstract_inverted_index.require	45
abstract_inverted_index.results	222
abstract_inverted_index.scores.	20
abstract_inverted_index.sounds,	93
abstract_inverted_index.towards	236
abstract_inverted_index.unified	98
abstract_inverted_index.varying	88
abstract_inverted_index.However,	21
abstract_inverted_index.analysis	219
abstract_inverted_index.classify	95
abstract_inverted_index.conveyed	4
abstract_inverted_index.detailed	218
abstract_inverted_index.existing	79
abstract_inverted_index.modality	32
abstract_inverted_index.multiple	8, 46
abstract_inverted_index.recorded	6
abstract_inverted_index.relevant	148
abstract_inverted_index.research	25
abstract_inverted_index.semantic	149
abstract_inverted_index.separate	37
abstract_inverted_index.setting.	214
abstract_inverted_index.approach,	231
abstract_inverted_index.including	11
abstract_inverted_index.modality,	187
abstract_inverted_index.modality.	41, 129
abstract_inverted_index.potential	226
abstract_inverted_index.requiring	34
abstract_inverted_index.retrieval	24, 107
abstract_inverted_index.zero-shot	213
abstract_inverted_index.available.	71
abstract_inverted_index.coexisting	47
abstract_inverted_index.comparably	167
abstract_inverted_index.components	146
abstract_inverted_index.image-only	171
abstract_inverted_index.instrument	132, 154
abstract_inverted_index.knowledge,	77
abstract_inverted_index.modalities	10, 48, 69
abstract_inverted_index.performing	209
abstract_inverted_index.sound-only	169
abstract_inverted_index.understand	224
abstract_inverted_index.categories.	100
abstract_inverted_index.classifiers	124, 155
abstract_inverted_index.cross-modal	106
abstract_inverted_index.development	35
abstract_inverted_index.exclusively	28
abstract_inverted_index.independent	127
abstract_inverted_index.information	1, 23
abstract_inverted_index.limitations	228
abstract_inverted_index.modalities,	89
abstract_inverted_index.modalities.	199
abstract_inverted_index.multi-modal	43
abstract_inverted_index.performance	192
abstract_inverted_index.Furthermore,	173
abstract_inverted_index.classifiers.	172, 238
abstract_inverted_index.constraining	55
abstract_inverted_index.experimental	221
abstract_inverted_index.information.	150
abstract_inverted_index.recognition,	33
abstract_inverted_index.classification	133
abstract_inverted_index.representations,	115
abstract_inverted_index.modality-agnostic	114, 237
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	3
sustainable_development_goals[0].id	https://metadata.un.org/sdg/4
sustainable_development_goals[0].score	0.7599999904632568
sustainable_development_goals[0].display_name	Quality Education
citation_normalized_percentile