Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2505.19938
Audio-visual zero-shot learning (ZSL) has been extensively researched for its capability to classify video data from unseen classes during training. Nevertheless, current methodologies often struggle with background scene biases and inadequate motion detail. This paper proposes a novel dual-stream Multi-Timescale Motion-Decoupled Spiking Transformer (MDST++), which decouples contextual semantic information and sparse dynamic motion information. The recurrent joint learning unit is proposed to extract contextual semantic information and capture joint knowledge across various modalities to understand the environment of actions. By converting RGB images to events, our method captures motion information more accurately and mitigates background scene biases. Moreover, we introduce a discrepancy analysis block to model audio motion information. To enhance the robustness of SNNs in extracting temporal and motion cues, we dynamically adjust the threshold of Leaky Integrate-and-Fire neurons based on global motion and contextual semantic information. Our experiments validate the effectiveness of MDST++, demonstrating their consistent superiority over state-of-the-art methods on mainstream benchmarks. Additionally, incorporating motion and multi-timescale information significantly improves HM and ZSL accuracy by 26.2\% and 39.9\%.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2505.19938
- https://arxiv.org/pdf/2505.19938
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414587896
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414587896Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2505.19938Digital Object Identifier
- Title
-
Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-26Full publication date if available
- Authors
-
Wenrui Li, Penghong Wang, Xingtao Wang, Wangmeng Zuo, Xiaopeng Fan, Yonghong TianList of authors in order
- Landing page
-
https://arxiv.org/abs/2505.19938Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2505.19938Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2505.19938Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414587896 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2505.19938 |
| ids.doi | https://doi.org/10.48550/arxiv.2505.19938 |
| ids.openalex | https://openalex.org/W4414587896 |
| fwci | |
| type | preprint |
| title | Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T12611 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.993399977684021 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Neural Networks and Reservoir Computing |
| topics[1].id | https://openalex.org/T10860 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9882000088691711 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Speech and Audio Processing |
| topics[2].id | https://openalex.org/T10283 |
| topics[2].field.id | https://openalex.org/fields/28 |
| topics[2].field.display_name | Neuroscience |
| topics[2].score | 0.9842000007629395 |
| topics[2].domain.id | https://openalex.org/domains/1 |
| topics[2].domain.display_name | Life Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/2805 |
| topics[2].subfield.display_name | Cognitive Neuroscience |
| topics[2].display_name | Hearing Loss and Rehabilitation |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2505.19938 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2505.19938 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2505.19938 |
| locations[1].id | doi:10.48550/arxiv.2505.19938 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2505.19938 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100739626 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-0635-7919 |
| authorships[0].author.display_name | Wenrui Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Wenrui |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5033038967 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-5401-5676 |
| authorships[1].author.display_name | Penghong Wang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Wang, Penghong |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5022103187 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-5763-2493 |
| authorships[2].author.display_name | Xingtao Wang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wang, Xingtao |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100636655 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-3330-783X |
| authorships[3].author.display_name | Wangmeng Zuo |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Zuo, Wangmeng |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5079412089 |
| authorships[4].author.orcid | https://orcid.org/0000-0002-9660-3636 |
| authorships[4].author.display_name | Xiaopeng Fan |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Fan, Xiaopeng |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5023918894 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-2978-5935 |
| authorships[5].author.display_name | Yonghong Tian |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Tian, Yonghong |
| authorships[5].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2505.19938 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Multi-Timescale Motion-Decoupled Spiking Transformer for Audio-Visual Zero-Shot Learning |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T12611 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.993399977684021 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Neural Networks and Reservoir Computing |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2505.19938 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2505.19938 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2505.19938 |
| primary_location.id | pmh:oai:arXiv.org:2505.19938 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2505.19938 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2505.19938 |
| publication_date | 2025-05-26 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 36, 100 |
| abstract_inverted_index.By | 79 |
| abstract_inverted_index.HM | 163 |
| abstract_inverted_index.To | 109 |
| abstract_inverted_index.by | 167 |
| abstract_inverted_index.in | 115 |
| abstract_inverted_index.is | 59 |
| abstract_inverted_index.of | 77, 113, 126, 143 |
| abstract_inverted_index.on | 131, 152 |
| abstract_inverted_index.to | 11, 61, 73, 83, 104 |
| abstract_inverted_index.we | 98, 121 |
| abstract_inverted_index.Our | 138 |
| abstract_inverted_index.RGB | 81 |
| abstract_inverted_index.The | 54 |
| abstract_inverted_index.ZSL | 165 |
| abstract_inverted_index.and | 29, 49, 66, 92, 118, 134, 158, 164, 169 |
| abstract_inverted_index.for | 8 |
| abstract_inverted_index.has | 4 |
| abstract_inverted_index.its | 9 |
| abstract_inverted_index.our | 85 |
| abstract_inverted_index.the | 75, 111, 124, 141 |
| abstract_inverted_index.SNNs | 114 |
| abstract_inverted_index.This | 33 |
| abstract_inverted_index.been | 5 |
| abstract_inverted_index.data | 14 |
| abstract_inverted_index.from | 15 |
| abstract_inverted_index.more | 90 |
| abstract_inverted_index.over | 149 |
| abstract_inverted_index.unit | 58 |
| abstract_inverted_index.with | 25 |
| abstract_inverted_index.(ZSL) | 3 |
| abstract_inverted_index.Leaky | 127 |
| abstract_inverted_index.audio | 106 |
| abstract_inverted_index.based | 130 |
| abstract_inverted_index.block | 103 |
| abstract_inverted_index.cues, | 120 |
| abstract_inverted_index.joint | 56, 68 |
| abstract_inverted_index.model | 105 |
| abstract_inverted_index.novel | 37 |
| abstract_inverted_index.often | 23 |
| abstract_inverted_index.paper | 34 |
| abstract_inverted_index.scene | 27, 95 |
| abstract_inverted_index.their | 146 |
| abstract_inverted_index.video | 13 |
| abstract_inverted_index.which | 44 |
| abstract_inverted_index.26.2\% | 168 |
| abstract_inverted_index.across | 70 |
| abstract_inverted_index.adjust | 123 |
| abstract_inverted_index.biases | 28 |
| abstract_inverted_index.during | 18 |
| abstract_inverted_index.global | 132 |
| abstract_inverted_index.images | 82 |
| abstract_inverted_index.method | 86 |
| abstract_inverted_index.motion | 31, 52, 88, 107, 119, 133, 157 |
| abstract_inverted_index.sparse | 50 |
| abstract_inverted_index.unseen | 16 |
| abstract_inverted_index.39.9\%. | 170 |
| abstract_inverted_index.MDST++, | 144 |
| abstract_inverted_index.Spiking | 41 |
| abstract_inverted_index.biases. | 96 |
| abstract_inverted_index.capture | 67 |
| abstract_inverted_index.classes | 17 |
| abstract_inverted_index.current | 21 |
| abstract_inverted_index.detail. | 32 |
| abstract_inverted_index.dynamic | 51 |
| abstract_inverted_index.enhance | 110 |
| abstract_inverted_index.events, | 84 |
| abstract_inverted_index.extract | 62 |
| abstract_inverted_index.methods | 151 |
| abstract_inverted_index.neurons | 129 |
| abstract_inverted_index.various | 71 |
| abstract_inverted_index.accuracy | 166 |
| abstract_inverted_index.actions. | 78 |
| abstract_inverted_index.analysis | 102 |
| abstract_inverted_index.captures | 87 |
| abstract_inverted_index.classify | 12 |
| abstract_inverted_index.improves | 162 |
| abstract_inverted_index.learning | 2, 57 |
| abstract_inverted_index.proposed | 60 |
| abstract_inverted_index.proposes | 35 |
| abstract_inverted_index.semantic | 47, 64, 136 |
| abstract_inverted_index.struggle | 24 |
| abstract_inverted_index.temporal | 117 |
| abstract_inverted_index.validate | 140 |
| abstract_inverted_index.(MDST++), | 43 |
| abstract_inverted_index.Moreover, | 97 |
| abstract_inverted_index.decouples | 45 |
| abstract_inverted_index.introduce | 99 |
| abstract_inverted_index.knowledge | 69 |
| abstract_inverted_index.mitigates | 93 |
| abstract_inverted_index.recurrent | 55 |
| abstract_inverted_index.threshold | 125 |
| abstract_inverted_index.training. | 19 |
| abstract_inverted_index.zero-shot | 1 |
| abstract_inverted_index.accurately | 91 |
| abstract_inverted_index.background | 26, 94 |
| abstract_inverted_index.capability | 10 |
| abstract_inverted_index.consistent | 147 |
| abstract_inverted_index.contextual | 46, 63, 135 |
| abstract_inverted_index.converting | 80 |
| abstract_inverted_index.extracting | 116 |
| abstract_inverted_index.inadequate | 30 |
| abstract_inverted_index.mainstream | 153 |
| abstract_inverted_index.modalities | 72 |
| abstract_inverted_index.researched | 7 |
| abstract_inverted_index.robustness | 112 |
| abstract_inverted_index.understand | 74 |
| abstract_inverted_index.Transformer | 42 |
| abstract_inverted_index.benchmarks. | 154 |
| abstract_inverted_index.discrepancy | 101 |
| abstract_inverted_index.dual-stream | 38 |
| abstract_inverted_index.dynamically | 122 |
| abstract_inverted_index.environment | 76 |
| abstract_inverted_index.experiments | 139 |
| abstract_inverted_index.extensively | 6 |
| abstract_inverted_index.information | 48, 65, 89, 160 |
| abstract_inverted_index.superiority | 148 |
| abstract_inverted_index.Audio-visual | 0 |
| abstract_inverted_index.information. | 53, 108, 137 |
| abstract_inverted_index.Additionally, | 155 |
| abstract_inverted_index.Nevertheless, | 20 |
| abstract_inverted_index.demonstrating | 145 |
| abstract_inverted_index.effectiveness | 142 |
| abstract_inverted_index.incorporating | 156 |
| abstract_inverted_index.methodologies | 22 |
| abstract_inverted_index.significantly | 161 |
| abstract_inverted_index.Multi-Timescale | 39 |
| abstract_inverted_index.multi-timescale | 159 |
| abstract_inverted_index.Motion-Decoupled | 40 |
| abstract_inverted_index.state-of-the-art | 150 |
| abstract_inverted_index.Integrate-and-Fire | 128 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |