End-to-End Speaker-Dependent Voice Activity Detection Article Swipe

PDF

Yefei Chen , Shuai Wang , Yanmin Qian , Kai Yu ·

YOU? · · 2020 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2009.09906

Voice activity detection (VAD) is an essential pre-processing step for tasks such as automatic speech recognition (ASR) and speaker recognition. A basic goal is to remove silent segments within an audio, while a more general VAD system could remove all the irrelevant segments such as noise and even unwanted speech from non-target speakers. We define the task, which only detects the speech from the target speaker, as speaker-dependent voice activity detection (SDVAD). This task is quite common in real applications and usually implemented by performing speaker verification (SV) on audio segments extracted from VAD. In this paper, we propose an end-to-end neural network based approach to address this problem, which explicitly takes the speaker identity into the modeling process. Moreover, inference can be performed in an online fashion, which leads to low system latency. Experiments are carried out on a conversational telephone dataset generated from the Switchboard corpus. Results show that our proposed online approach achieves significantly better performance than the usual VAD/SV system in terms of both frame accuracy and F-score. We also used our previously proposed segment-level metric for a more comprehensive analysis.

Related Topics

Computer Science

Process (Computing)

Artificial Intelligence

Economics

Management

Concepts

Speech recognition Computer science Voice activity detection Speaker diarisation Speaker recognition Latency (audio) Task (project management) Inference Speech processing Frame (networking) Process (computing) Artificial intelligence Telecommunications Economics Operating system Management

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2009.09906
PDF: https://arxiv.org/pdf/2009.09906
OA Status: green
Cited By: 2
References: 20
Related Works: 10
OpenAlex ID: https://openalex.org/W3087422378

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W3087422378

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2009.09906

Digital Object Identifier
Title: End-to-End Speaker-Dependent Voice Activity Detection

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2020

Year of publication
Publication date: 2020-09-21

Full publication date if available
Authors: Yefei Chen, Shuai Wang, Yanmin Qian, Kai Yu

List of authors in order
Landing page: https://arxiv.org/abs/2009.09906

Publisher landing page
PDF URL: https://arxiv.org/pdf/2009.09906

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2009.09906

Direct OA link when available
Concepts: Speech recognition, Computer science, Voice activity detection, Speaker diarisation, Speaker recognition, Latency (audio), Task (project management), Inference, Speech processing, Frame (networking), Process (computing), Artificial intelligence, Telecommunications, Economics, Operating system, Management

Top concepts (fields/topics) attached by OpenAlex
Cited by: 2

Total citation count in OpenAlex
Citations by year (recent): 2021: 2

Per-year citation counts (last 5 years)
References (count): 20

Number of works referenced by this work
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W3087422378
doi	https://doi.org/10.48550/arxiv.2009.09906
ids.doi	https://doi.org/10.48550/arxiv.2009.09906
ids.mag	3087422378
ids.openalex	https://openalex.org/W3087422378
fwci
type	preprint
title	End-to-End Speaker-Dependent Voice Activity Detection
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10201
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9998999834060669
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Speech Recognition and Synthesis
topics[1].id	https://openalex.org/T10860
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9998999834060669
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1711
topics[1].subfield.display_name	Signal Processing
topics[1].display_name	Speech and Audio Processing
topics[2].id	https://openalex.org/T11309
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9980000257492065
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1711
topics[2].subfield.display_name	Signal Processing
topics[2].display_name	Music and Audio Processing
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C28490314
concepts[0].level	1
concepts[0].score	0.820976972579956
concepts[0].wikidata	https://www.wikidata.org/wiki/Q189436
concepts[0].display_name	Speech recognition
concepts[1].id	https://openalex.org/C41008148
concepts[1].level	0
concepts[1].score	0.815491795539856
concepts[1].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[1].display_name	Computer science
concepts[2].id	https://openalex.org/C204201278
concepts[2].level	3
concepts[2].score	0.7618936896324158
concepts[2].wikidata	https://www.wikidata.org/wiki/Q1332614
concepts[2].display_name	Voice activity detection
concepts[3].id	https://openalex.org/C149838564
concepts[3].level	3
concepts[3].score	0.7297342419624329
concepts[3].wikidata	https://www.wikidata.org/wiki/Q7574248
concepts[3].display_name	Speaker diarisation
concepts[4].id	https://openalex.org/C133892786
concepts[4].level	2
concepts[4].score	0.6609026193618774
concepts[4].wikidata	https://www.wikidata.org/wiki/Q1145189
concepts[4].display_name	Speaker recognition
concepts[5].id	https://openalex.org/C82876162
concepts[5].level	2
concepts[5].score	0.5993220806121826
concepts[5].wikidata	https://www.wikidata.org/wiki/Q17096504
concepts[5].display_name	Latency (audio)
concepts[6].id	https://openalex.org/C2780451532
concepts[6].level	2
concepts[6].score	0.5513314604759216
concepts[6].wikidata	https://www.wikidata.org/wiki/Q759676
concepts[6].display_name	Task (project management)
concepts[7].id	https://openalex.org/C2776214188
concepts[7].level	2
concepts[7].score	0.5020081996917725
concepts[7].wikidata	https://www.wikidata.org/wiki/Q408386
concepts[7].display_name	Inference
concepts[8].id	https://openalex.org/C61328038
concepts[8].level	2
concepts[8].score	0.443744957447052
concepts[8].wikidata	https://www.wikidata.org/wiki/Q3358061
concepts[8].display_name	Speech processing
concepts[9].id	https://openalex.org/C126042441
concepts[9].level	2
concepts[9].score	0.42494115233421326
concepts[9].wikidata	https://www.wikidata.org/wiki/Q1324888
concepts[9].display_name	Frame (networking)
concepts[10].id	https://openalex.org/C98045186
concepts[10].level	2
concepts[10].score	0.41453057527542114
concepts[10].wikidata	https://www.wikidata.org/wiki/Q205663
concepts[10].display_name	Process (computing)
concepts[11].id	https://openalex.org/C154945302
concepts[11].level	1
concepts[11].score	0.34192168712615967
concepts[11].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[11].display_name	Artificial intelligence
concepts[12].id	https://openalex.org/C76155785
concepts[12].level	1
concepts[12].score	0.0
concepts[12].wikidata	https://www.wikidata.org/wiki/Q418
concepts[12].display_name	Telecommunications
concepts[13].id	https://openalex.org/C162324750
concepts[13].level	0
concepts[13].score	0.0
concepts[13].wikidata	https://www.wikidata.org/wiki/Q8134
concepts[13].display_name	Economics
concepts[14].id	https://openalex.org/C111919701
concepts[14].level	1
concepts[14].score	0.0
concepts[14].wikidata	https://www.wikidata.org/wiki/Q9135
concepts[14].display_name	Operating system
concepts[15].id	https://openalex.org/C187736073
concepts[15].level	1
concepts[15].score	0.0
concepts[15].wikidata	https://www.wikidata.org/wiki/Q2920921
concepts[15].display_name	Management
keywords[0].id	https://openalex.org/keywords/speech-recognition
keywords[0].score	0.820976972579956
keywords[0].display_name	Speech recognition
keywords[1].id	https://openalex.org/keywords/computer-science
keywords[1].score	0.815491795539856
keywords[1].display_name	Computer science
keywords[2].id	https://openalex.org/keywords/voice-activity-detection
keywords[2].score	0.7618936896324158
keywords[2].display_name	Voice activity detection
keywords[3].id	https://openalex.org/keywords/speaker-diarisation
keywords[3].score	0.7297342419624329
keywords[3].display_name	Speaker diarisation
keywords[4].id	https://openalex.org/keywords/speaker-recognition
keywords[4].score	0.6609026193618774
keywords[4].display_name	Speaker recognition
keywords[5].id	https://openalex.org/keywords/latency
keywords[5].score	0.5993220806121826
keywords[5].display_name	Latency (audio)
keywords[6].id	https://openalex.org/keywords/task
keywords[6].score	0.5513314604759216
keywords[6].display_name	Task (project management)
keywords[7].id	https://openalex.org/keywords/inference
keywords[7].score	0.5020081996917725
keywords[7].display_name	Inference
keywords[8].id	https://openalex.org/keywords/speech-processing
keywords[8].score	0.443744957447052
keywords[8].display_name	Speech processing
keywords[9].id	https://openalex.org/keywords/frame
keywords[9].score	0.42494115233421326
keywords[9].display_name	Frame (networking)
keywords[10].id	https://openalex.org/keywords/process
keywords[10].score	0.41453057527542114
keywords[10].display_name	Process (computing)
keywords[11].id	https://openalex.org/keywords/artificial-intelligence
keywords[11].score	0.34192168712615967
keywords[11].display_name	Artificial intelligence
language	en
locations[0].id	pmh:oai:arXiv.org:2009.09906
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2009.09906
locations[0].version	submittedVersion
locations[0].raw_type
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2009.09906
locations[1].id	doi:10.48550/arxiv.2009.09906
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2009.09906
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5012756702
authorships[0].author.orcid	https://orcid.org/0000-0003-1414-2045
authorships[0].author.display_name	Yefei Chen
authorships[0].author_position	first
authorships[0].raw_author_name	Yefei Chen
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5100328312
authorships[1].author.orcid	https://orcid.org/0000-0002-7897-2024
authorships[1].author.display_name	Shuai Wang
authorships[1].author_position	middle
authorships[1].raw_author_name	Shuai Wang
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5100341993
authorships[2].author.orcid	https://orcid.org/0000-0002-0314-3790
authorships[2].author.display_name	Yanmin Qian
authorships[2].author_position	middle
authorships[2].raw_author_name	Yanmin Qian
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5043098653
authorships[3].author.orcid	https://orcid.org/0000-0002-7102-9826
authorships[3].author.display_name	Kai Yu
authorships[3].author_position	last
authorships[3].raw_author_name	Kai Yu
authorships[3].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2009.09906
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	End-to-End Speaker-Dependent Voice Activity Detection
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10201
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9998999834060669
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Speech Recognition and Synthesis
related_works	https://openalex.org/W2206035908, https://openalex.org/W2162158162, https://openalex.org/W4247736853, https://openalex.org/W1493012537, https://openalex.org/W1999004162, https://openalex.org/W2175373321, https://openalex.org/W2125642021, https://openalex.org/W1521049138, https://openalex.org/W2938358845, https://openalex.org/W2997340161
cited_by_count	2
counts_by_year[0].year	2021
counts_by_year[0].cited_by_count	2
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2009.09906
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2009.09906
best_oa_location.version	submittedVersion
best_oa_location.raw_type
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2009.09906
primary_location.id	pmh:oai:arXiv.org:2009.09906
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2009.09906
primary_location.version	submittedVersion
primary_location.raw_type
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2009.09906
publication_date	2020-09-21
publication_year	2020
referenced_works	https://openalex.org/W2069095950, https://openalex.org/W2106214098, https://openalex.org/W1999454387, https://openalex.org/W2079623482, https://openalex.org/W2032474878, https://openalex.org/W1991899119, https://openalex.org/W296042737, https://openalex.org/W2401833940, https://openalex.org/W2408468399, https://openalex.org/W2032362923, https://openalex.org/W2048497537, https://openalex.org/W2395750323, https://openalex.org/W2283417180, https://openalex.org/W2150769028, https://openalex.org/W2623155250, https://openalex.org/W2059203007, https://openalex.org/W2406262283, https://openalex.org/W2130426352, https://openalex.org/W2651834199, https://openalex.org/W2403186097
referenced_works_count	20
abstract_inverted_index.A	20
abstract_inverted_index.a	32, 139, 181
abstract_inverted_index.In	94
abstract_inverted_index.We	53, 172
abstract_inverted_index.an	5, 29, 99, 125
abstract_inverted_index.as	12, 44, 66
abstract_inverted_index.be	122
abstract_inverted_index.by	83
abstract_inverted_index.in	77, 124, 164
abstract_inverted_index.is	4, 23, 74
abstract_inverted_index.of	166
abstract_inverted_index.on	88, 138
abstract_inverted_index.to	24, 105, 130
abstract_inverted_index.we	97
abstract_inverted_index.VAD	35
abstract_inverted_index.all	39
abstract_inverted_index.and	17, 46, 80, 170
abstract_inverted_index.are	135
abstract_inverted_index.can	121
abstract_inverted_index.for	9, 180
abstract_inverted_index.low	131
abstract_inverted_index.our	151, 175
abstract_inverted_index.out	137
abstract_inverted_index.the	40, 55, 60, 63, 112, 116, 145, 160
abstract_inverted_index.(SV)	87
abstract_inverted_index.This	72
abstract_inverted_index.VAD.	93
abstract_inverted_index.also	173
abstract_inverted_index.both	167
abstract_inverted_index.even	47
abstract_inverted_index.from	50, 62, 92, 144
abstract_inverted_index.goal	22
abstract_inverted_index.into	115
abstract_inverted_index.more	33, 182
abstract_inverted_index.only	58
abstract_inverted_index.real	78
abstract_inverted_index.show	149
abstract_inverted_index.step	8
abstract_inverted_index.such	11, 43
abstract_inverted_index.task	73
abstract_inverted_index.than	159
abstract_inverted_index.that	150
abstract_inverted_index.this	95, 107
abstract_inverted_index.used	174
abstract_inverted_index.(ASR)	16
abstract_inverted_index.(VAD)	3
abstract_inverted_index.Voice	0
abstract_inverted_index.audio	89
abstract_inverted_index.based	103
abstract_inverted_index.basic	21
abstract_inverted_index.could	37
abstract_inverted_index.frame	168
abstract_inverted_index.leads	129
abstract_inverted_index.noise	45
abstract_inverted_index.quite	75
abstract_inverted_index.takes	111
abstract_inverted_index.task,	56
abstract_inverted_index.tasks	10
abstract_inverted_index.terms	165
abstract_inverted_index.usual	161
abstract_inverted_index.voice	68
abstract_inverted_index.which	57, 109, 128
abstract_inverted_index.while	31
abstract_inverted_index.VAD/SV	162
abstract_inverted_index.audio,	30
abstract_inverted_index.better	157
abstract_inverted_index.common	76
abstract_inverted_index.define	54
abstract_inverted_index.metric	179
abstract_inverted_index.neural	101
abstract_inverted_index.online	126, 153
abstract_inverted_index.paper,	96
abstract_inverted_index.remove	25, 38
abstract_inverted_index.silent	26
abstract_inverted_index.speech	14, 49, 61
abstract_inverted_index.system	36, 132, 163
abstract_inverted_index.target	64
abstract_inverted_index.within	28
abstract_inverted_index.Results	148
abstract_inverted_index.address	106
abstract_inverted_index.carried	136
abstract_inverted_index.corpus.	147
abstract_inverted_index.dataset	142
abstract_inverted_index.detects	59
abstract_inverted_index.general	34
abstract_inverted_index.network	102
abstract_inverted_index.propose	98
abstract_inverted_index.speaker	18, 85, 113
abstract_inverted_index.usually	81
abstract_inverted_index.(SDVAD).	71
abstract_inverted_index.F-score.	171
abstract_inverted_index.accuracy	169
abstract_inverted_index.achieves	155
abstract_inverted_index.activity	1, 69
abstract_inverted_index.approach	104, 154
abstract_inverted_index.fashion,	127
abstract_inverted_index.identity	114
abstract_inverted_index.latency.	133
abstract_inverted_index.modeling	117
abstract_inverted_index.problem,	108
abstract_inverted_index.process.	118
abstract_inverted_index.proposed	152, 177
abstract_inverted_index.segments	27, 42, 90
abstract_inverted_index.speaker,	65
abstract_inverted_index.unwanted	48
abstract_inverted_index.Moreover,	119
abstract_inverted_index.analysis.	184
abstract_inverted_index.automatic	13
abstract_inverted_index.detection	2, 70
abstract_inverted_index.essential	6
abstract_inverted_index.extracted	91
abstract_inverted_index.generated	143
abstract_inverted_index.inference	120
abstract_inverted_index.performed	123
abstract_inverted_index.speakers.	52
abstract_inverted_index.telephone	141
abstract_inverted_index.end-to-end	100
abstract_inverted_index.explicitly	110
abstract_inverted_index.irrelevant	41
abstract_inverted_index.non-target	51
abstract_inverted_index.performing	84
abstract_inverted_index.previously	176
abstract_inverted_index.Experiments	134
abstract_inverted_index.Switchboard	146
abstract_inverted_index.implemented	82
abstract_inverted_index.performance	158
abstract_inverted_index.recognition	15
abstract_inverted_index.applications	79
abstract_inverted_index.recognition.	19
abstract_inverted_index.verification	86
abstract_inverted_index.comprehensive	183
abstract_inverted_index.segment-level	178
abstract_inverted_index.significantly	156
abstract_inverted_index.conversational	140
abstract_inverted_index.pre-processing	7
abstract_inverted_index.speaker-dependent	67
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	4
sustainable_development_goals[0].id	https://metadata.un.org/sdg/16
sustainable_development_goals[0].score	0.5400000214576721
sustainable_development_goals[0].display_name	Peace, Justice and strong institutions
citation_normalized_percentile