How to Listen? Rethinking Visual Sound Localization Article Swipe
YOU?
·
· 2022
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2204.05156
Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these decisions, the model performance, and the data, by digging into different evaluation datasets spanning different difficulties and characteristics, and discuss the implications of such decisions in the context of real-world applications. Our code and model weights are open-sourced and made available for further applications.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2204.05156
- https://arxiv.org/pdf/2204.05156
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4223603255
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4223603255Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2204.05156Digital Object Identifier
- Title
-
How to Listen? Rethinking Visual Sound LocalizationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2022Year of publication
- Publication date
-
2022-04-11Full publication date if available
- Authors
-
Ho-Hsiang Wu, Magdalena Fuentes, Prem Seetharaman, Juan Pablo BelloList of authors in order
- Landing page
-
https://arxiv.org/abs/2204.05156Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2204.05156Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2204.05156Direct OA link when available
- Concepts
-
Computer science, Context (archaeology), Architecture, Adaptability, Human–computer interaction, Object (grammar), Artificial intelligence, Geography, Biology, Archaeology, EcologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4223603255 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2204.05156 |
| ids.doi | https://doi.org/10.48550/arxiv.2204.05156 |
| ids.openalex | https://openalex.org/W4223603255 |
| fwci | |
| type | preprint |
| title | How to Listen? Rethinking Visual Sound Localization |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11309 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9994000196456909 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1711 |
| topics[0].subfield.display_name | Signal Processing |
| topics[0].display_name | Music and Audio Processing |
| topics[1].id | https://openalex.org/T10860 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9991000294685364 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Speech and Audio Processing |
| topics[2].id | https://openalex.org/T11349 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.972000002861023 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Music Technology and Sound Studies |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7795695662498474 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C2779343474 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5479483604431152 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[1].display_name | Context (archaeology) |
| concepts[2].id | https://openalex.org/C123657996 |
| concepts[2].level | 2 |
| concepts[2].score | 0.48174571990966797 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q12271 |
| concepts[2].display_name | Architecture |
| concepts[3].id | https://openalex.org/C177606310 |
| concepts[3].level | 2 |
| concepts[3].score | 0.4711933135986328 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q5674297 |
| concepts[3].display_name | Adaptability |
| concepts[4].id | https://openalex.org/C107457646 |
| concepts[4].level | 1 |
| concepts[4].score | 0.4565006196498871 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q207434 |
| concepts[4].display_name | Human–computer interaction |
| concepts[5].id | https://openalex.org/C2781238097 |
| concepts[5].level | 2 |
| concepts[5].score | 0.4277232885360718 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q175026 |
| concepts[5].display_name | Object (grammar) |
| concepts[6].id | https://openalex.org/C154945302 |
| concepts[6].level | 1 |
| concepts[6].score | 0.4153529107570648 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[6].display_name | Artificial intelligence |
| concepts[7].id | https://openalex.org/C205649164 |
| concepts[7].level | 0 |
| concepts[7].score | 0.1408005654811859 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[7].display_name | Geography |
| concepts[8].id | https://openalex.org/C86803240 |
| concepts[8].level | 0 |
| concepts[8].score | 0.0 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[8].display_name | Biology |
| concepts[9].id | https://openalex.org/C166957645 |
| concepts[9].level | 1 |
| concepts[9].score | 0.0 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q23498 |
| concepts[9].display_name | Archaeology |
| concepts[10].id | https://openalex.org/C18903297 |
| concepts[10].level | 1 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q7150 |
| concepts[10].display_name | Ecology |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.7795695662498474 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/context |
| keywords[1].score | 0.5479483604431152 |
| keywords[1].display_name | Context (archaeology) |
| keywords[2].id | https://openalex.org/keywords/architecture |
| keywords[2].score | 0.48174571990966797 |
| keywords[2].display_name | Architecture |
| keywords[3].id | https://openalex.org/keywords/adaptability |
| keywords[3].score | 0.4711933135986328 |
| keywords[3].display_name | Adaptability |
| keywords[4].id | https://openalex.org/keywords/human–computer-interaction |
| keywords[4].score | 0.4565006196498871 |
| keywords[4].display_name | Human–computer interaction |
| keywords[5].id | https://openalex.org/keywords/object |
| keywords[5].score | 0.4277232885360718 |
| keywords[5].display_name | Object (grammar) |
| keywords[6].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[6].score | 0.4153529107570648 |
| keywords[6].display_name | Artificial intelligence |
| keywords[7].id | https://openalex.org/keywords/geography |
| keywords[7].score | 0.1408005654811859 |
| keywords[7].display_name | Geography |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2204.05156 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2204.05156 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2204.05156 |
| locations[1].id | doi:10.48550/arxiv.2204.05156 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2204.05156 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5035643647 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-1102-074X |
| authorships[0].author.display_name | Ho-Hsiang Wu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Wu, Ho-Hsiang |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5021235229 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-4506-6639 |
| authorships[1].author.display_name | Magdalena Fuentes |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Fuentes, Magdalena |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5023673004 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Prem Seetharaman |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Seetharaman, Prem |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5031398497 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-8561-5204 |
| authorships[3].author.display_name | Juan Pablo Bello |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Bello, Juan Pablo |
| authorships[3].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2204.05156 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | How to Listen? Rethinking Visual Sound Localization |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11309 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9994000196456909 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1711 |
| primary_topic.subfield.display_name | Signal Processing |
| primary_topic.display_name | Music and Audio Processing |
| related_works | https://openalex.org/W2357124094, https://openalex.org/W2387399993, https://openalex.org/W2389739210, https://openalex.org/W2348924972, https://openalex.org/W2365736347, https://openalex.org/W2047454415, https://openalex.org/W2070040999, https://openalex.org/W2387293848, https://openalex.org/W2250140200, https://openalex.org/W3121791438 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2204.05156 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2204.05156 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2204.05156 |
| primary_location.id | pmh:oai:arXiv.org:2204.05156 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2204.05156 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2204.05156 |
| publication_date | 2022-04-11 |
| publication_year | 2022 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 18, 47, 77 |
| abstract_inverted_index.In | 89 |
| abstract_inverted_index.It | 16 |
| abstract_inverted_index.an | 14 |
| abstract_inverted_index.as | 32 |
| abstract_inverted_index.by | 136 |
| abstract_inverted_index.in | 25, 79, 85, 154 |
| abstract_inverted_index.is | 17 |
| abstract_inverted_index.it | 69 |
| abstract_inverted_index.of | 8, 59, 82, 151, 157 |
| abstract_inverted_index.on | 4 |
| abstract_inverted_index.or | 64 |
| abstract_inverted_index.we | 92, 123 |
| abstract_inverted_index.Our | 160 |
| abstract_inverted_index.and | 28, 35, 52, 101, 118, 133, 145, 147, 162, 167 |
| abstract_inverted_index.are | 40, 165 |
| abstract_inverted_index.but | 68 |
| abstract_inverted_index.for | 97, 170 |
| abstract_inverted_index.how | 72, 103 |
| abstract_inverted_index.the | 6, 57, 80, 108, 112, 115, 119, 125, 130, 134, 149, 155 |
| abstract_inverted_index.area | 21 |
| abstract_inverted_index.code | 161 |
| abstract_inverted_index.emit | 11 |
| abstract_inverted_index.into | 138 |
| abstract_inverted_index.loss | 116 |
| abstract_inverted_index.made | 168 |
| abstract_inverted_index.more | 86 |
| abstract_inverted_index.play | 76 |
| abstract_inverted_index.role | 78 |
| abstract_inverted_index.such | 31, 152 |
| abstract_inverted_index.that | 10 |
| abstract_inverted_index.this | 90 |
| abstract_inverted_index.with | 22, 43 |
| abstract_inverted_index.data, | 135 |
| abstract_inverted_index.model | 95, 131, 163 |
| abstract_inverted_index.sound | 12, 99 |
| abstract_inverted_index.study | 124 |
| abstract_inverted_index.their | 104 |
| abstract_inverted_index.these | 73, 83, 128 |
| abstract_inverted_index.urban | 29, 36 |
| abstract_inverted_index.work, | 91 |
| abstract_inverted_index.works | 39 |
| abstract_inverted_index.affect | 107 |
| abstract_inverted_index.design | 74 |
| abstract_inverted_index.during | 62 |
| abstract_inverted_index.having | 45 |
| abstract_inverted_index.image. | 15 |
| abstract_inverted_index.models | 54 |
| abstract_inverted_index.mostly | 46 |
| abstract_inverted_index.namely | 111 |
| abstract_inverted_index.single | 48 |
| abstract_inverted_index.sounds | 2 |
| abstract_inverted_index.visual | 1, 98 |
| abstract_inverted_index.within | 13 |
| abstract_inverted_index.analyze | 93 |
| abstract_inverted_index.between | 127 |
| abstract_inverted_index.choices | 75, 96 |
| abstract_inverted_index.context | 156 |
| abstract_inverted_index.digging | 137 |
| abstract_inverted_index.discuss | 102, 148 |
| abstract_inverted_index.further | 171 |
| abstract_inverted_index.growing | 19 |
| abstract_inverted_index.methods | 84 |
| abstract_inverted_index.model's | 109 |
| abstract_inverted_index.modules | 61 |
| abstract_inverted_index.natural | 27 |
| abstract_inverted_index.object, | 51 |
| abstract_inverted_index.objects | 9 |
| abstract_inverted_index.remains | 70 |
| abstract_inverted_index.require | 56 |
| abstract_inverted_index.unclear | 71 |
| abstract_inverted_index.usually | 41, 55 |
| abstract_inverted_index.various | 94 |
| abstract_inverted_index.visible | 50 |
| abstract_inverted_index.weights | 164 |
| abstract_inverted_index.Previous | 38 |
| abstract_inverted_index.consists | 3 |
| abstract_inverted_index.datasets | 44, 141 |
| abstract_inverted_index.dominant | 49 |
| abstract_inverted_index.function | 117 |
| abstract_inverted_index.locating | 5 |
| abstract_inverted_index.position | 7 |
| abstract_inverted_index.proposed | 53 |
| abstract_inverted_index.research | 20 |
| abstract_inverted_index.sampling | 66 |
| abstract_inverted_index.spanning | 142 |
| abstract_inverted_index.traffic. | 37 |
| abstract_inverted_index.training | 63 |
| abstract_inverted_index.wildlife | 33 |
| abstract_inverted_index.available | 169 |
| abstract_inverted_index.decisions | 153 |
| abstract_inverted_index.dedicated | 65 |
| abstract_inverted_index.different | 105, 139, 143 |
| abstract_inverted_index.encoders' | 113 |
| abstract_inverted_index.evaluated | 42 |
| abstract_inverted_index.migration | 34 |
| abstract_inverted_index.potential | 23 |
| abstract_inverted_index.strategy. | 121 |
| abstract_inverted_index.Localizing | 0 |
| abstract_inverted_index.components | 106 |
| abstract_inverted_index.decisions, | 129 |
| abstract_inverted_index.evaluation | 140 |
| abstract_inverted_index.monitoring | 26 |
| abstract_inverted_index.real-world | 158 |
| abstract_inverted_index.scenarios. | 88 |
| abstract_inverted_index.challenging | 87 |
| abstract_inverted_index.interaction | 126 |
| abstract_inverted_index.strategies, | 67 |
| abstract_inverted_index.Furthermore, | 122 |
| abstract_inverted_index.adaptability | 81 |
| abstract_inverted_index.applications | 24 |
| abstract_inverted_index.difficulties | 144 |
| abstract_inverted_index.implications | 150 |
| abstract_inverted_index.introduction | 58 |
| abstract_inverted_index.localization | 60, 100, 120 |
| abstract_inverted_index.open-sourced | 166 |
| abstract_inverted_index.performance, | 110, 132 |
| abstract_inverted_index.applications. | 159, 172 |
| abstract_inverted_index.architecture, | 114 |
| abstract_inverted_index.environments, | 30 |
| abstract_inverted_index.characteristics, | 146 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/11 |
| sustainable_development_goals[0].score | 0.8199999928474426 |
| sustainable_development_goals[0].display_name | Sustainable cities and communities |
| citation_normalized_percentile |