BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2411.14100
Spoken term detection (STD) is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching, limiting its practicality. To address these challenges, we propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms. Our approach focuses on generating consistent token sequences across varying utterances of the same term. We also propose a bidirectional state space modeling within the Mamba encoder, trained in a self-supervised learning framework, to learn contextual frame-level features that are further encoded into discrete tokens. Our analysis shows that our speech tokens exhibit greater speaker invariance than those from existing tokenizers, making them more suitable for STD tasks. Empirical evaluation on LibriSpeech and TIMIT databases indicates that our method outperforms existing STD baselines while being more efficient.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2411.14100
- https://arxiv.org/pdf/2411.14100
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4404652703
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4404652703Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2411.14100Digital Object Identifier
- Title
-
BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term DetectionWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-11-21Full publication date if available
- Authors
-
Anup K. Singh, Kris Demuynck, Vipul AroraList of authors in order
- Landing page
-
https://arxiv.org/abs/2411.14100Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2411.14100Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2411.14100Direct OA link when available
- Concepts
-
Term (time), Lexical analysis, Speech recognition, Computer science, Natural language processing, Physics, Quantum mechanicsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4404652703 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2411.14100 |
| ids.doi | https://doi.org/10.48550/arxiv.2411.14100 |
| ids.openalex | https://openalex.org/W4404652703 |
| fwci | |
| type | preprint |
| title | BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10201 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9945999979972839 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Speech Recognition and Synthesis |
| topics[1].id | https://openalex.org/T10860 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9843999743461609 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Speech and Audio Processing |
| topics[2].id | https://openalex.org/T10403 |
| topics[2].field.id | https://openalex.org/fields/32 |
| topics[2].field.display_name | Psychology |
| topics[2].score | 0.9771000146865845 |
| topics[2].domain.id | https://openalex.org/domains/2 |
| topics[2].domain.display_name | Social Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/3205 |
| topics[2].subfield.display_name | Experimental and Cognitive Psychology |
| topics[2].display_name | Phonetics and Phonology Research |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C61797465 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7382351160049438 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1188986 |
| concepts[0].display_name | Term (time) |
| concepts[1].id | https://openalex.org/C176982825 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6860363483428955 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q835922 |
| concepts[1].display_name | Lexical analysis |
| concepts[2].id | https://openalex.org/C28490314 |
| concepts[2].level | 1 |
| concepts[2].score | 0.6246493458747864 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[2].display_name | Speech recognition |
| concepts[3].id | https://openalex.org/C41008148 |
| concepts[3].level | 0 |
| concepts[3].score | 0.6095845103263855 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[3].display_name | Computer science |
| concepts[4].id | https://openalex.org/C204321447 |
| concepts[4].level | 1 |
| concepts[4].score | 0.30111372470855713 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[4].display_name | Natural language processing |
| concepts[5].id | https://openalex.org/C121332964 |
| concepts[5].level | 0 |
| concepts[5].score | 0.06803804636001587 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q413 |
| concepts[5].display_name | Physics |
| concepts[6].id | https://openalex.org/C62520636 |
| concepts[6].level | 1 |
| concepts[6].score | 0.0 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q944 |
| concepts[6].display_name | Quantum mechanics |
| keywords[0].id | https://openalex.org/keywords/term |
| keywords[0].score | 0.7382351160049438 |
| keywords[0].display_name | Term (time) |
| keywords[1].id | https://openalex.org/keywords/lexical-analysis |
| keywords[1].score | 0.6860363483428955 |
| keywords[1].display_name | Lexical analysis |
| keywords[2].id | https://openalex.org/keywords/speech-recognition |
| keywords[2].score | 0.6246493458747864 |
| keywords[2].display_name | Speech recognition |
| keywords[3].id | https://openalex.org/keywords/computer-science |
| keywords[3].score | 0.6095845103263855 |
| keywords[3].display_name | Computer science |
| keywords[4].id | https://openalex.org/keywords/natural-language-processing |
| keywords[4].score | 0.30111372470855713 |
| keywords[4].display_name | Natural language processing |
| keywords[5].id | https://openalex.org/keywords/physics |
| keywords[5].score | 0.06803804636001587 |
| keywords[5].display_name | Physics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2411.14100 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2411.14100 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2411.14100 |
| locations[1].id | doi:10.48550/arxiv.2411.14100 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2411.14100 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101680403 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-3653-1618 |
| authorships[0].author.display_name | Anup K. Singh |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Singh, Anup |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5046536366 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-8525-7160 |
| authorships[1].author.display_name | Kris Demuynck |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Demuynck, Kris |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5011121139 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-1207-1258 |
| authorships[2].author.display_name | Vipul Arora |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Arora, Vipul |
| authorships[2].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2411.14100 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-11-24T00:00:00 |
| display_name | BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10201 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9945999979972839 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Speech Recognition and Synthesis |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W4300598845, https://openalex.org/W2601638452, https://openalex.org/W2285263069, https://openalex.org/W4376107815, https://openalex.org/W4319309671, https://openalex.org/W4319309603, https://openalex.org/W1599985958 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2411.14100 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2411.14100 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2411.14100 |
| primary_location.id | pmh:oai:arXiv.org:2411.14100 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2411.14100 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2411.14100 |
| publication_date | 2024-11-21 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 28, 70, 81 |
| abstract_inverted_index.To | 22 |
| abstract_inverted_index.We | 67 |
| abstract_inverted_index.by | 7 |
| abstract_inverted_index.in | 80 |
| abstract_inverted_index.is | 4 |
| abstract_inverted_index.of | 63 |
| abstract_inverted_index.on | 9, 55, 122 |
| abstract_inverted_index.to | 85 |
| abstract_inverted_index.we | 26 |
| abstract_inverted_index.Our | 52, 97 |
| abstract_inverted_index.STD | 118, 133 |
| abstract_inverted_index.and | 12, 47, 124 |
| abstract_inverted_index.are | 91 |
| abstract_inverted_index.for | 117 |
| abstract_inverted_index.its | 20 |
| abstract_inverted_index.our | 101, 129 |
| abstract_inverted_index.the | 13, 64, 76 |
| abstract_inverted_index.This | 39 |
| abstract_inverted_index.also | 68 |
| abstract_inverted_index.fast | 41 |
| abstract_inverted_index.from | 110 |
| abstract_inverted_index.into | 34, 94 |
| abstract_inverted_index.more | 115, 137 |
| abstract_inverted_index.same | 65 |
| abstract_inverted_index.term | 1 |
| abstract_inverted_index.than | 108 |
| abstract_inverted_index.that | 31, 90, 100, 128 |
| abstract_inverted_index.them | 114 |
| abstract_inverted_index.(STD) | 3 |
| abstract_inverted_index.Mamba | 77 |
| abstract_inverted_index.TIMIT | 125 |
| abstract_inverted_index.being | 136 |
| abstract_inverted_index.learn | 86 |
| abstract_inverted_index.novel | 29 |
| abstract_inverted_index.often | 5 |
| abstract_inverted_index.shows | 99 |
| abstract_inverted_index.space | 73 |
| abstract_inverted_index.state | 72 |
| abstract_inverted_index.term. | 66 |
| abstract_inverted_index.these | 24 |
| abstract_inverted_index.those | 109 |
| abstract_inverted_index.token | 58 |
| abstract_inverted_index.using | 43 |
| abstract_inverted_index.while | 135 |
| abstract_inverted_index.Spoken | 0 |
| abstract_inverted_index.across | 60 |
| abstract_inverted_index.making | 113 |
| abstract_inverted_index.method | 130 |
| abstract_inverted_index.search | 45 |
| abstract_inverted_index.speech | 33, 102 |
| abstract_inverted_index.tasks. | 119 |
| abstract_inverted_index.terms. | 51 |
| abstract_inverted_index.tokens | 103 |
| abstract_inverted_index.within | 75 |
| abstract_inverted_index.address | 23 |
| abstract_inverted_index.encoded | 93 |
| abstract_inverted_index.encodes | 32 |
| abstract_inverted_index.exhibit | 104 |
| abstract_inverted_index.focuses | 54 |
| abstract_inverted_index.further | 92 |
| abstract_inverted_index.greater | 105 |
| abstract_inverted_index.handles | 49 |
| abstract_inverted_index.propose | 27, 69 |
| abstract_inverted_index.speaker | 106 |
| abstract_inverted_index.tokens. | 38, 96 |
| abstract_inverted_index.trained | 79 |
| abstract_inverted_index.varying | 61 |
| abstract_inverted_index.analysis | 98 |
| abstract_inverted_index.approach | 30, 53 |
| abstract_inverted_index.discrete | 95 |
| abstract_inverted_index.encoder, | 78 |
| abstract_inverted_index.existing | 111, 132 |
| abstract_inverted_index.features | 11, 89 |
| abstract_inverted_index.hindered | 6 |
| abstract_inverted_index.learning | 83 |
| abstract_inverted_index.limiting | 19 |
| abstract_inverted_index.modeling | 74 |
| abstract_inverted_index.reliance | 8 |
| abstract_inverted_index.semantic | 37 |
| abstract_inverted_index.suitable | 116 |
| abstract_inverted_index.template | 17 |
| abstract_inverted_index.DTW-based | 16 |
| abstract_inverted_index.Empirical | 120 |
| abstract_inverted_index.baselines | 134 |
| abstract_inverted_index.databases | 126 |
| abstract_inverted_index.detection | 2 |
| abstract_inverted_index.discrete, | 35 |
| abstract_inverted_index.indicates | 127 |
| abstract_inverted_index.intensive | 15 |
| abstract_inverted_index.matching, | 18 |
| abstract_inverted_index.retrieval | 42 |
| abstract_inverted_index.sequences | 59 |
| abstract_inverted_index.algorithms | 46 |
| abstract_inverted_index.consistent | 57 |
| abstract_inverted_index.contextual | 87 |
| abstract_inverted_index.efficient. | 138 |
| abstract_inverted_index.evaluation | 121 |
| abstract_inverted_index.framework, | 84 |
| abstract_inverted_index.generating | 56 |
| abstract_inverted_index.invariance | 107 |
| abstract_inverted_index.text-based | 44 |
| abstract_inverted_index.utterances | 62 |
| abstract_inverted_index.LibriSpeech | 123 |
| abstract_inverted_index.challenges, | 25 |
| abstract_inverted_index.effectively | 48 |
| abstract_inverted_index.facilitates | 40 |
| abstract_inverted_index.frame-level | 10, 88 |
| abstract_inverted_index.outperforms | 131 |
| abstract_inverted_index.tokenizers, | 112 |
| abstract_inverted_index.bidirectional | 71 |
| abstract_inverted_index.practicality. | 21 |
| abstract_inverted_index.computationally | 14 |
| abstract_inverted_index.self-supervised | 82 |
| abstract_inverted_index.speaker-agnostic | 36 |
| abstract_inverted_index.out-of-vocabulary | 50 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |