Compositional Audio Representation Learning Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2409.09619
Human auditory perception is compositional in nature -- we identify auditory streams from auditory scenes with multiple sound events. However, such auditory scenes are typically represented using clip-level representations that do not disentangle the constituent sound sources. In this work, we learn source-centric audio representations where each sound source is represented using a distinct, disentangled source embedding in the audio representation. We propose two novel approaches to learning source-centric audio representations: a supervised model guided by classification and an unsupervised model guided by feature reconstruction, both of which outperform the baselines. We thoroughly evaluate the design choices of both approaches using an audio classification task. We find that supervision is beneficial to learn source-centric representations, and that reconstructing audio features is more useful than reconstructing spectrograms to learn unsupervised source-centric representations. Leveraging source-centric models can help unlock the potential of greater interpretability and more flexible decoding in machine listening.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2409.09619
- https://arxiv.org/pdf/2409.09619
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403668229
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403668229Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2409.09619Digital Object Identifier
- Title
-
Compositional Audio Representation LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-09-15Full publication date if available
- Authors
-
Sripathi Sridhar, Mark CartwrightList of authors in order
- Landing page
-
https://arxiv.org/abs/2409.09619Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2409.09619Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2409.09619Direct OA link when available
- Concepts
-
Representation (politics), Computer science, Natural language processing, Political science, Law, PoliticsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403668229 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2409.09619 |
| ids.doi | https://doi.org/10.48550/arxiv.2409.09619 |
| ids.openalex | https://openalex.org/W4403668229 |
| fwci | |
| type | preprint |
| title | Compositional Audio Representation Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11309 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9986000061035156 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1711 |
| topics[0].subfield.display_name | Signal Processing |
| topics[0].display_name | Music and Audio Processing |
| topics[1].id | https://openalex.org/T10201 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9575999975204468 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Speech Recognition and Synthesis |
| topics[2].id | https://openalex.org/T10860 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9509000182151794 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1711 |
| topics[2].subfield.display_name | Signal Processing |
| topics[2].display_name | Speech and Audio Processing |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2776359362 |
| concepts[0].level | 3 |
| concepts[0].score | 0.6567932963371277 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q2145286 |
| concepts[0].display_name | Representation (politics) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.493990957736969 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C204321447 |
| concepts[2].level | 1 |
| concepts[2].score | 0.3369273543357849 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[2].display_name | Natural language processing |
| concepts[3].id | https://openalex.org/C17744445 |
| concepts[3].level | 0 |
| concepts[3].score | 0.09465962648391724 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q36442 |
| concepts[3].display_name | Political science |
| concepts[4].id | https://openalex.org/C199539241 |
| concepts[4].level | 1 |
| concepts[4].score | 0.0 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q7748 |
| concepts[4].display_name | Law |
| concepts[5].id | https://openalex.org/C94625758 |
| concepts[5].level | 2 |
| concepts[5].score | 0.0 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q7163 |
| concepts[5].display_name | Politics |
| keywords[0].id | https://openalex.org/keywords/representation |
| keywords[0].score | 0.6567932963371277 |
| keywords[0].display_name | Representation (politics) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.493990957736969 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/natural-language-processing |
| keywords[2].score | 0.3369273543357849 |
| keywords[2].display_name | Natural language processing |
| keywords[3].id | https://openalex.org/keywords/political-science |
| keywords[3].score | 0.09465962648391724 |
| keywords[3].display_name | Political science |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2409.09619 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by-sa |
| locations[0].pdf_url | https://arxiv.org/pdf/2409.09619 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | https://openalex.org/licenses/cc-by-sa |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2409.09619 |
| locations[1].id | doi:10.48550/arxiv.2409.09619 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2409.09619 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5011188265 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-9761-3564 |
| authorships[0].author.display_name | Sripathi Sridhar |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Sridhar, Sripathi |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5056532548 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-5908-390X |
| authorships[1].author.display_name | Mark Cartwright |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Cartwright, Mark |
| authorships[1].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2409.09619 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Compositional Audio Representation Learning |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11309 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9986000061035156 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1711 |
| primary_topic.subfield.display_name | Signal Processing |
| primary_topic.display_name | Music and Audio Processing |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2409.09619 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by-sa |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2409.09619 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by-sa |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2409.09619 |
| primary_location.id | pmh:oai:arXiv.org:2409.09619 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by-sa |
| primary_location.pdf_url | https://arxiv.org/pdf/2409.09619 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | https://openalex.org/licenses/cc-by-sa |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2409.09619 |
| publication_date | 2024-09-15 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 52, 71 |
| abstract_inverted_index.-- | 7 |
| abstract_inverted_index.In | 37 |
| abstract_inverted_index.We | 61, 91, 105 |
| abstract_inverted_index.an | 78, 101 |
| abstract_inverted_index.by | 75, 82 |
| abstract_inverted_index.do | 30 |
| abstract_inverted_index.in | 5, 57, 146 |
| abstract_inverted_index.is | 3, 49, 109, 120 |
| abstract_inverted_index.of | 86, 97, 139 |
| abstract_inverted_index.to | 66, 111, 126 |
| abstract_inverted_index.we | 8, 40 |
| abstract_inverted_index.and | 77, 115, 142 |
| abstract_inverted_index.are | 23 |
| abstract_inverted_index.can | 134 |
| abstract_inverted_index.not | 31 |
| abstract_inverted_index.the | 33, 58, 89, 94, 137 |
| abstract_inverted_index.two | 63 |
| abstract_inverted_index.both | 85, 98 |
| abstract_inverted_index.each | 46 |
| abstract_inverted_index.find | 106 |
| abstract_inverted_index.from | 12 |
| abstract_inverted_index.help | 135 |
| abstract_inverted_index.more | 121, 143 |
| abstract_inverted_index.such | 20 |
| abstract_inverted_index.than | 123 |
| abstract_inverted_index.that | 29, 107, 116 |
| abstract_inverted_index.this | 38 |
| abstract_inverted_index.with | 15 |
| abstract_inverted_index.Human | 0 |
| abstract_inverted_index.audio | 43, 59, 69, 102, 118 |
| abstract_inverted_index.learn | 41, 112, 127 |
| abstract_inverted_index.model | 73, 80 |
| abstract_inverted_index.novel | 64 |
| abstract_inverted_index.sound | 17, 35, 47 |
| abstract_inverted_index.task. | 104 |
| abstract_inverted_index.using | 26, 51, 100 |
| abstract_inverted_index.where | 45 |
| abstract_inverted_index.which | 87 |
| abstract_inverted_index.work, | 39 |
| abstract_inverted_index.design | 95 |
| abstract_inverted_index.guided | 74, 81 |
| abstract_inverted_index.models | 133 |
| abstract_inverted_index.nature | 6 |
| abstract_inverted_index.scenes | 14, 22 |
| abstract_inverted_index.source | 48, 55 |
| abstract_inverted_index.unlock | 136 |
| abstract_inverted_index.useful | 122 |
| abstract_inverted_index.choices | 96 |
| abstract_inverted_index.events. | 18 |
| abstract_inverted_index.feature | 83 |
| abstract_inverted_index.greater | 140 |
| abstract_inverted_index.machine | 147 |
| abstract_inverted_index.propose | 62 |
| abstract_inverted_index.streams | 11 |
| abstract_inverted_index.However, | 19 |
| abstract_inverted_index.auditory | 1, 10, 13, 21 |
| abstract_inverted_index.decoding | 145 |
| abstract_inverted_index.evaluate | 93 |
| abstract_inverted_index.features | 119 |
| abstract_inverted_index.flexible | 144 |
| abstract_inverted_index.identify | 9 |
| abstract_inverted_index.learning | 67 |
| abstract_inverted_index.multiple | 16 |
| abstract_inverted_index.sources. | 36 |
| abstract_inverted_index.distinct, | 53 |
| abstract_inverted_index.embedding | 56 |
| abstract_inverted_index.potential | 138 |
| abstract_inverted_index.typically | 24 |
| abstract_inverted_index.Leveraging | 131 |
| abstract_inverted_index.approaches | 65, 99 |
| abstract_inverted_index.baselines. | 90 |
| abstract_inverted_index.beneficial | 110 |
| abstract_inverted_index.clip-level | 27 |
| abstract_inverted_index.listening. | 148 |
| abstract_inverted_index.outperform | 88 |
| abstract_inverted_index.perception | 2 |
| abstract_inverted_index.supervised | 72 |
| abstract_inverted_index.thoroughly | 92 |
| abstract_inverted_index.constituent | 34 |
| abstract_inverted_index.disentangle | 32 |
| abstract_inverted_index.represented | 25, 50 |
| abstract_inverted_index.supervision | 108 |
| abstract_inverted_index.disentangled | 54 |
| abstract_inverted_index.spectrograms | 125 |
| abstract_inverted_index.unsupervised | 79, 128 |
| abstract_inverted_index.compositional | 4 |
| abstract_inverted_index.classification | 76, 103 |
| abstract_inverted_index.reconstructing | 117, 124 |
| abstract_inverted_index.source-centric | 42, 68, 113, 129, 132 |
| abstract_inverted_index.reconstruction, | 84 |
| abstract_inverted_index.representation. | 60 |
| abstract_inverted_index.representations | 28, 44 |
| abstract_inverted_index.interpretability | 141 |
| abstract_inverted_index.representations, | 114 |
| abstract_inverted_index.representations. | 130 |
| abstract_inverted_index.representations: | 70 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile |