Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2307.13236
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2307.13236
- https://arxiv.org/pdf/2307.13236
- OA Status
- green
- Cited By
- 5
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4385292115
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4385292115Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2307.13236Digital Object Identifier
- Title
-
Audio-aware Query-enhanced Transformer for Audio-Visual SegmentationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-07-25Full publication date if available
- Authors
-
Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya ZhangList of authors in order
- Landing page
-
https://arxiv.org/abs/2307.13236Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2307.13236Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2307.13236Direct OA link when available
- Concepts
-
Computer science, Segmentation, Transformer, Audio visual, Artificial intelligence, Salient, Speech recognition, Computer vision, Multimedia, Quantum mechanics, Physics, VoltageTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
5Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 2, 2024: 3Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4385292115 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2307.13236 |
| ids.doi | https://doi.org/10.48550/arxiv.2307.13236 |
| ids.openalex | https://openalex.org/W4385292115 |
| fwci | |
| type | preprint |
| title | Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10860 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9995999932289124 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1711 |
| topics[0].subfield.display_name | Signal Processing |
| topics[0].display_name | Speech and Audio Processing |
| topics[1].id | https://openalex.org/T11309 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9991000294685364 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Music and Audio Processing |
| topics[2].id | https://openalex.org/T10283 |
| topics[2].field.id | https://openalex.org/fields/28 |
| topics[2].field.display_name | Neuroscience |
| topics[2].score | 0.9936000108718872 |
| topics[2].domain.id | https://openalex.org/domains/1 |
| topics[2].domain.display_name | Life Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/2805 |
| topics[2].subfield.display_name | Cognitive Neuroscience |
| topics[2].display_name | Hearing Loss and Rehabilitation |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.8517049551010132 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C89600930 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6198641657829285 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q1423946 |
| concepts[1].display_name | Segmentation |
| concepts[2].id | https://openalex.org/C66322947 |
| concepts[2].level | 3 |
| concepts[2].score | 0.5651654005050659 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11658 |
| concepts[2].display_name | Transformer |
| concepts[3].id | https://openalex.org/C3017588708 |
| concepts[3].level | 2 |
| concepts[3].score | 0.5587079524993896 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q758901 |
| concepts[3].display_name | Audio visual |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.5071156024932861 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C2780719617 |
| concepts[5].level | 2 |
| concepts[5].score | 0.4857045114040375 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q1030752 |
| concepts[5].display_name | Salient |
| concepts[6].id | https://openalex.org/C28490314 |
| concepts[6].level | 1 |
| concepts[6].score | 0.45910289883613586 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[6].display_name | Speech recognition |
| concepts[7].id | https://openalex.org/C31972630 |
| concepts[7].level | 1 |
| concepts[7].score | 0.38088735938072205 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q844240 |
| concepts[7].display_name | Computer vision |
| concepts[8].id | https://openalex.org/C49774154 |
| concepts[8].level | 1 |
| concepts[8].score | 0.19375252723693848 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q131765 |
| concepts[8].display_name | Multimedia |
| concepts[9].id | https://openalex.org/C62520636 |
| concepts[9].level | 1 |
| concepts[9].score | 0.0 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q944 |
| concepts[9].display_name | Quantum mechanics |
| concepts[10].id | https://openalex.org/C121332964 |
| concepts[10].level | 0 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q413 |
| concepts[10].display_name | Physics |
| concepts[11].id | https://openalex.org/C165801399 |
| concepts[11].level | 2 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q25428 |
| concepts[11].display_name | Voltage |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.8517049551010132 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/segmentation |
| keywords[1].score | 0.6198641657829285 |
| keywords[1].display_name | Segmentation |
| keywords[2].id | https://openalex.org/keywords/transformer |
| keywords[2].score | 0.5651654005050659 |
| keywords[2].display_name | Transformer |
| keywords[3].id | https://openalex.org/keywords/audio-visual |
| keywords[3].score | 0.5587079524993896 |
| keywords[3].display_name | Audio visual |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.5071156024932861 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/salient |
| keywords[5].score | 0.4857045114040375 |
| keywords[5].display_name | Salient |
| keywords[6].id | https://openalex.org/keywords/speech-recognition |
| keywords[6].score | 0.45910289883613586 |
| keywords[6].display_name | Speech recognition |
| keywords[7].id | https://openalex.org/keywords/computer-vision |
| keywords[7].score | 0.38088735938072205 |
| keywords[7].display_name | Computer vision |
| keywords[8].id | https://openalex.org/keywords/multimedia |
| keywords[8].score | 0.19375252723693848 |
| keywords[8].display_name | Multimedia |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2307.13236 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2307.13236 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2307.13236 |
| locations[1].id | doi:10.48550/arxiv.2307.13236 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2307.13236 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101682289 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-2583-8881 |
| authorships[0].author.display_name | Jinxiang Liu |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Liu, Jinxiang |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100774574 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-8472-7677 |
| authorships[1].author.display_name | Chen Ju |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Ju, Chen |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5102703207 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Chaofan Ma |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Ma, Chaofan |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100645705 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-3196-2347 |
| authorships[3].author.display_name | Yanfeng Wang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wang, Yanfeng |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100445356 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-2883-1087 |
| authorships[4].author.display_name | Yu Wang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Wang, Yu |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5100342828 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-5390-9053 |
| authorships[5].author.display_name | Ya Zhang |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Zhang, Ya |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2307.13236 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10860 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9995999932289124 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1711 |
| primary_topic.subfield.display_name | Signal Processing |
| primary_topic.display_name | Speech and Audio Processing |
| related_works | https://openalex.org/W2329500892, https://openalex.org/W2271369634, https://openalex.org/W3147472394, https://openalex.org/W2047100085, https://openalex.org/W2350550760, https://openalex.org/W28991112, https://openalex.org/W578794879, https://openalex.org/W2370726991, https://openalex.org/W2625296515, https://openalex.org/W3137890128 |
| cited_by_count | 5 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 2 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 3 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2307.13236 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2307.13236 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2307.13236 |
| primary_location.id | pmh:oai:arXiv.org:2307.13236 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2307.13236 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2307.13236 |
| publication_date | 2023-07-25 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 49, 65 |
| abstract_inverted_index.To | 43 |
| abstract_inverted_index.an | 81 |
| abstract_inverted_index.in | 14, 124 |
| abstract_inverted_index.is | 8 |
| abstract_inverted_index.of | 2, 35, 40, 75, 95 |
| abstract_inverted_index.on | 92, 101 |
| abstract_inverted_index.to | 9, 30, 55 |
| abstract_inverted_index.we | 47, 79 |
| abstract_inverted_index.The | 0 |
| abstract_inverted_index.and | 37, 73, 119, 126 |
| abstract_inverted_index.due | 29 |
| abstract_inverted_index.our | 62, 114 |
| abstract_inverted_index.the | 3, 11, 15, 26, 31, 57, 89, 93, 96 |
| abstract_inverted_index.yet | 107 |
| abstract_inverted_index.deep | 71 |
| abstract_inverted_index.goal | 1 |
| abstract_inverted_index.have | 25 |
| abstract_inverted_index.show | 112 |
| abstract_inverted_index.task | 7 |
| abstract_inverted_index.that | 69, 86, 113 |
| abstract_inverted_index.(AVS) | 6 |
| abstract_inverted_index.audio | 19, 102 |
| abstract_inverted_index.based | 100 |
| abstract_inverted_index.cues. | 20 |
| abstract_inverted_index.field | 34 |
| abstract_inverted_index.focus | 91 |
| abstract_inverted_index.helps | 88 |
| abstract_inverted_index.model | 90 |
| abstract_inverted_index.novel | 50 |
| abstract_inverted_index.small | 32 |
| abstract_inverted_index.task. | 58 |
| abstract_inverted_index.these | 45 |
| abstract_inverted_index.using | 18 |
| abstract_inverted_index.video | 16 |
| abstract_inverted_index.while | 104 |
| abstract_inverted_index.(AuTR) | 54 |
| abstract_inverted_index.Unlike | 59 |
| abstract_inverted_index.better | 121 |
| abstract_inverted_index.devise | 80 |
| abstract_inverted_index.frames | 17 |
| abstract_inverted_index.fusion | 39, 72 |
| abstract_inverted_index.method | 115 |
| abstract_inverted_index.silent | 106 |
| abstract_inverted_index.tackle | 56 |
| abstract_inverted_index.ability | 123 |
| abstract_inverted_index.current | 22 |
| abstract_inverted_index.decoder | 85 |
| abstract_inverted_index.enables | 70 |
| abstract_inverted_index.issues, | 46 |
| abstract_inverted_index.methods | 24, 118 |
| abstract_inverted_index.objects | 13, 99 |
| abstract_inverted_index.propose | 48 |
| abstract_inverted_index.results | 111 |
| abstract_inverted_index.salient | 108 |
| abstract_inverted_index.segment | 10 |
| abstract_inverted_index.However, | 21 |
| abstract_inverted_index.approach | 63 |
| abstract_inverted_index.existing | 60 |
| abstract_inverted_index.methods, | 61 |
| abstract_inverted_index.objects. | 109 |
| abstract_inverted_index.open-set | 127 |
| abstract_inverted_index.overcome | 44 |
| abstract_inverted_index.previous | 117 |
| abstract_inverted_index.signals, | 103 |
| abstract_inverted_index.sounding | 12, 98 |
| abstract_inverted_index.features. | 42, 77 |
| abstract_inverted_index.receptive | 33 |
| abstract_inverted_index.explicitly | 87 |
| abstract_inverted_index.inadequate | 38 |
| abstract_inverted_index.introduces | 64 |
| abstract_inverted_index.multimodal | 66 |
| abstract_inverted_index.pinpointed | 97 |
| abstract_inverted_index.scenarios. | 128 |
| abstract_inverted_index.aggregation | 74 |
| abstract_inverted_index.audio-aware | 82 |
| abstract_inverted_index.convolution | 36 |
| abstract_inverted_index.limitations | 28 |
| abstract_inverted_index.multi-sound | 125 |
| abstract_inverted_index.outperforms | 116 |
| abstract_inverted_index.performance | 27 |
| abstract_inverted_index.transformer | 67, 84 |
| abstract_inverted_index.Experimental | 110 |
| abstract_inverted_index.Furthermore, | 78 |
| abstract_inverted_index.architecture | 68 |
| abstract_inverted_index.audio-visual | 4, 41, 76 |
| abstract_inverted_index.demonstrates | 120 |
| abstract_inverted_index.disregarding | 105 |
| abstract_inverted_index.fusion-based | 23 |
| abstract_inverted_index.segmentation | 5, 94 |
| abstract_inverted_index.generalization | 122 |
| abstract_inverted_index.query-enhanced | 52, 83 |
| abstract_inverted_index.\textbf{Au}dio-aware | 51 |
| abstract_inverted_index.\textbf{TR}ansformer | 53 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |