Exploring modality-agnostic representations for music classification Article Swipe
YOU?
·
· 2021
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2106.01149
Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2106.01149
- https://arxiv.org/pdf/2106.01149
- OA Status
- green
- Cited By
- 4
- References
- 31
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W3172444685
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W3172444685Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2106.01149Digital Object Identifier
- Title
-
Exploring modality-agnostic representations for music classificationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2021Year of publication
- Publication date
-
2021-06-02Full publication date if available
- Authors
-
Ho-Hsiang Wu, Magdalena Fuentes, Juan Pablo BelloList of authors in order
- Landing page
-
https://arxiv.org/abs/2106.01149Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2106.01149Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2106.01149Direct OA link when available
- Concepts
-
Modality (human–computer interaction), Computer science, Psychology, Natural language processing, Artificial intelligenceTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
4Total citation count in OpenAlex
- Citations by year (recent)
-
2024: 1, 2023: 3Per-year citation counts (last 5 years)
- References (count)
-
31Number of works referenced by this work
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W3172444685 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2106.01149 |
| ids.doi | https://doi.org/10.48550/arxiv.2106.01149 |
| ids.mag | 3172444685 |
| ids.openalex | https://openalex.org/W3172444685 |
| fwci | |
| type | preprint |
| title | Exploring modality-agnostic representations for music classification |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11309 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 1.0 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1711 |
| topics[0].subfield.display_name | Signal Processing |
| topics[0].display_name | Music and Audio Processing |
| topics[1].id | https://openalex.org/T10860 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9925000071525574 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Speech and Audio Processing |
| topics[2].id | https://openalex.org/T10201 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9912999868392944 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Speech Recognition and Synthesis |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2780226545 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8031980991363525 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q6888030 |
| concepts[0].display_name | Modality (human–computer interaction) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.4829553961753845 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C15744967 |
| concepts[2].level | 0 |
| concepts[2].score | 0.36694014072418213 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[2].display_name | Psychology |
| concepts[3].id | https://openalex.org/C204321447 |
| concepts[3].level | 1 |
| concepts[3].score | 0.3405355215072632 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[3].display_name | Natural language processing |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.32367485761642456 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| keywords[0].id | https://openalex.org/keywords/modality |
| keywords[0].score | 0.8031980991363525 |
| keywords[0].display_name | Modality (human–computer interaction) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.4829553961753845 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/psychology |
| keywords[2].score | 0.36694014072418213 |
| keywords[2].display_name | Psychology |
| keywords[3].id | https://openalex.org/keywords/natural-language-processing |
| keywords[3].score | 0.3405355215072632 |
| keywords[3].display_name | Natural language processing |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.32367485761642456 |
| keywords[4].display_name | Artificial intelligence |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2106.01149 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2106.01149 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2106.01149 |
| locations[1].id | doi:10.48550/arxiv.2106.01149 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2106.01149 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5035643647 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-1102-074X |
| authorships[0].author.display_name | Ho-Hsiang Wu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Ho-Hsiang Wu |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5021235229 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-4506-6639 |
| authorships[1].author.display_name | Magdalena Fuentes |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Magdalena Fuentes |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5031398497 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-8561-5204 |
| authorships[2].author.display_name | Juan Pablo Bello |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Juan Pablo Bello |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2106.01149 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Exploring modality-agnostic representations for music classification |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11309 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 1.0 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1711 |
| primary_topic.subfield.display_name | Signal Processing |
| primary_topic.display_name | Music and Audio Processing |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2385859805, https://openalex.org/W2530972254, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2382290278, https://openalex.org/W3204019825 |
| cited_by_count | 4 |
| counts_by_year[0].year | 2024 |
| counts_by_year[0].cited_by_count | 1 |
| counts_by_year[1].year | 2023 |
| counts_by_year[1].cited_by_count | 3 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2106.01149 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2106.01149 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2106.01149 |
| primary_location.id | pmh:oai:arXiv.org:2106.01149 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2106.01149 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2106.01149 |
| publication_date | 2021-06-02 |
| publication_year | 2021 |
| referenced_works | https://openalex.org/W2619383789, https://openalex.org/W2963115079, https://openalex.org/W2890267272, https://openalex.org/W2526050071, https://openalex.org/W2990325209, https://openalex.org/W2963350250, https://openalex.org/W2990796920, https://openalex.org/W3162583214, https://openalex.org/W2157364932, https://openalex.org/W2593116425, https://openalex.org/W2890559714, https://openalex.org/W2990245503, https://openalex.org/W2619329613, https://openalex.org/W2996266053, https://openalex.org/W272962585, https://openalex.org/W2108598243, https://openalex.org/W2146104196, https://openalex.org/W2619697695, https://openalex.org/W2890913619, https://openalex.org/W2906289885, https://openalex.org/W2194775991, https://openalex.org/W2962835968, https://openalex.org/W2990387939, https://openalex.org/W3103014337, https://openalex.org/W2152790380, https://openalex.org/W1931639407, https://openalex.org/W2963988212, https://openalex.org/W1514535095, https://openalex.org/W2842511635, https://openalex.org/W2939574508, https://openalex.org/W2138621090 |
| referenced_works_count | 31 |
| abstract_inverted_index.a | 109, 185, 212, 217 |
| abstract_inverted_index.To | 72 |
| abstract_inverted_index.We | 101, 130, 151, 200, 215 |
| abstract_inverted_index.an | 135 |
| abstract_inverted_index.as | 53, 108, 121, 134, 141, 163 |
| abstract_inverted_index.be | 119 |
| abstract_inverted_index.by | 193 |
| abstract_inverted_index.in | 191, 211 |
| abstract_inverted_index.is | 2, 180 |
| abstract_inverted_index.no | 78 |
| abstract_inverted_index.of | 36, 58, 75, 105, 128, 207, 220, 229 |
| abstract_inverted_index.on | 30 |
| abstract_inverted_index.or | 5, 92, 161, 170 |
| abstract_inverted_index.to | 15, 50, 61, 84, 112, 123, 168, 203, 223 |
| abstract_inverted_index.we | 174 |
| abstract_inverted_index.70% | 206 |
| abstract_inverted_index.all | 68 |
| abstract_inverted_index.and | 19, 94, 144, 165, 188, 227, 232 |
| abstract_inverted_index.are | 70, 126, 201 |
| abstract_inverted_index.but | 12 |
| abstract_inverted_index.can | 117, 157 |
| abstract_inverted_index.few | 63 |
| abstract_inverted_index.for | 39, 138, 184 |
| abstract_inverted_index.has | 26, 81 |
| abstract_inverted_index.not | 13 |
| abstract_inverted_index.our | 76, 139 |
| abstract_inverted_index.the | 51, 56, 62, 73, 82, 103, 176, 189, 225, 230 |
| abstract_inverted_index.use | 57, 104 |
| abstract_inverted_index.Some | 42 |
| abstract_inverted_index.able | 202 |
| abstract_inverted_index.best | 74, 208 |
| abstract_inverted_index.both | 142, 159 |
| abstract_inverted_index.case | 177 |
| abstract_inverted_index.data | 9, 66, 183, 196 |
| abstract_inverted_index.e.g. | 90 |
| abstract_inverted_index.each | 40 |
| abstract_inverted_index.from | 67, 87, 197 |
| abstract_inverted_index.into | 97 |
| abstract_inverted_index.take | 85, 158 |
| abstract_inverted_index.task | 111, 137 |
| abstract_inverted_index.text | 18 |
| abstract_inverted_index.that | 125, 156 |
| abstract_inverted_index.them | 96 |
| abstract_inverted_index.then | 118 |
| abstract_inverted_index.used | 120 |
| abstract_inverted_index.when | 178 |
| abstract_inverted_index.Music | 0 |
| abstract_inverted_index.audio | 145 |
| abstract_inverted_index.cases | 64 |
| abstract_inverted_index.given | 49, 186 |
| abstract_inverted_index.learn | 113 |
| abstract_inverted_index.model | 52, 80 |
| abstract_inverted_index.music | 22, 99, 153 |
| abstract_inverted_index.often | 3 |
| abstract_inverted_index.other | 198 |
| abstract_inverted_index.steps | 235 |
| abstract_inverted_index.study | 140 |
| abstract_inverted_index.there | 179 |
| abstract_inverted_index.these | 59 |
| abstract_inverted_index.train | 152 |
| abstract_inverted_index.using | 194 |
| abstract_inverted_index.where | 65 |
| abstract_inverted_index.which | 116 |
| abstract_inverted_index.works | 44 |
| abstract_inverted_index.across | 7 |
| abstract_inverted_index.almost | 27, 205 |
| abstract_inverted_index.audio, | 16 |
| abstract_inverted_index.future | 234 |
| abstract_inverted_index.images | 91, 160 |
| abstract_inverted_index.impact | 190 |
| abstract_inverted_index.input, | 164 |
| abstract_inverted_index.inputs | 86, 122 |
| abstract_inverted_index.models | 38, 60 |
| abstract_inverted_index.select | 131 |
| abstract_inverted_index.single | 31 |
| abstract_inverted_index.sounds | 162 |
| abstract_inverted_index.system | 210 |
| abstract_inverted_index.visual | 143 |
| abstract_inverted_index.ability | 83 |
| abstract_inverted_index.achieve | 204 |
| abstract_inverted_index.discuss | 233 |
| abstract_inverted_index.example | 136 |
| abstract_inverted_index.explore | 102, 175 |
| abstract_inverted_index.focused | 29 |
| abstract_inverted_index.images, | 17 |
| abstract_inverted_index.inputs, | 54 |
| abstract_inverted_index.labeled | 182, 195 |
| abstract_inverted_index.limited | 14, 181 |
| abstract_inverted_index.perform | 166 |
| abstract_inverted_index.pretext | 110 |
| abstract_inverted_index.provide | 147, 216 |
| abstract_inverted_index.require | 45 |
| abstract_inverted_index.results | 222 |
| abstract_inverted_index.scores. | 20 |
| abstract_inverted_index.sounds, | 93 |
| abstract_inverted_index.towards | 236 |
| abstract_inverted_index.unified | 98 |
| abstract_inverted_index.varying | 88 |
| abstract_inverted_index.However, | 21 |
| abstract_inverted_index.analysis | 219 |
| abstract_inverted_index.classify | 95 |
| abstract_inverted_index.conveyed | 4 |
| abstract_inverted_index.detailed | 218 |
| abstract_inverted_index.existing | 79 |
| abstract_inverted_index.modality | 32 |
| abstract_inverted_index.multiple | 8, 46 |
| abstract_inverted_index.recorded | 6 |
| abstract_inverted_index.relevant | 148 |
| abstract_inverted_index.research | 25 |
| abstract_inverted_index.semantic | 149 |
| abstract_inverted_index.separate | 37 |
| abstract_inverted_index.setting. | 214 |
| abstract_inverted_index.approach, | 231 |
| abstract_inverted_index.including | 11 |
| abstract_inverted_index.modality, | 187 |
| abstract_inverted_index.modality. | 41, 129 |
| abstract_inverted_index.potential | 226 |
| abstract_inverted_index.requiring | 34 |
| abstract_inverted_index.retrieval | 24, 107 |
| abstract_inverted_index.zero-shot | 213 |
| abstract_inverted_index.available. | 71 |
| abstract_inverted_index.coexisting | 47 |
| abstract_inverted_index.comparably | 167 |
| abstract_inverted_index.components | 146 |
| abstract_inverted_index.image-only | 171 |
| abstract_inverted_index.instrument | 132, 154 |
| abstract_inverted_index.knowledge, | 77 |
| abstract_inverted_index.modalities | 10, 48, 69 |
| abstract_inverted_index.performing | 209 |
| abstract_inverted_index.sound-only | 169 |
| abstract_inverted_index.understand | 224 |
| abstract_inverted_index.categories. | 100 |
| abstract_inverted_index.classifiers | 124, 155 |
| abstract_inverted_index.cross-modal | 106 |
| abstract_inverted_index.development | 35 |
| abstract_inverted_index.exclusively | 28 |
| abstract_inverted_index.independent | 127 |
| abstract_inverted_index.information | 1, 23 |
| abstract_inverted_index.limitations | 228 |
| abstract_inverted_index.modalities, | 89 |
| abstract_inverted_index.modalities. | 199 |
| abstract_inverted_index.multi-modal | 43 |
| abstract_inverted_index.performance | 192 |
| abstract_inverted_index.Furthermore, | 173 |
| abstract_inverted_index.classifiers. | 172, 238 |
| abstract_inverted_index.constraining | 55 |
| abstract_inverted_index.experimental | 221 |
| abstract_inverted_index.information. | 150 |
| abstract_inverted_index.recognition, | 33 |
| abstract_inverted_index.classification | 133 |
| abstract_inverted_index.representations, | 115 |
| abstract_inverted_index.modality-agnostic | 114, 237 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.7599999904632568 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |