E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR Article Swipe

PDF

W. Ronny Huang , Shuo-Yiin Chang , David Rybach , Rohit Prabhavalkar , Tara N. Sainath , Cyril Allauzen , Cal Peyser , Zhiyun Lu ·

YOU? · · 2022 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2204.10749

Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form audio (YouTube) with lengths of up to 30 minutes, we demonstrate 8.5% relative WER improvement and 250 ms reduction in median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.

Related Topics

Artificial Intelligence

Concepts

Computer science Speech recognition Segmentation Sentence Market segmentation Latency (audio) Set (abstract data type) False alarm Decoding methods End-to-end principle ALARM Artificial intelligence Algorithm Telecommunications Marketing Materials science Composite material Business Programming language

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2204.10749
PDF: https://arxiv.org/pdf/2204.10749
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4360601747

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4360601747

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2204.10749

Digital Object Identifier
Title: E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2022

Year of publication
Publication date: 2022-04-22

Full publication date if available
Authors: W. Ronny Huang, Shuo-Yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

List of authors in order
Landing page: https://arxiv.org/abs/2204.10749

Publisher landing page
PDF URL: https://arxiv.org/pdf/2204.10749

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2204.10749

Direct OA link when available
Concepts: Computer science, Speech recognition, Segmentation, Sentence, Market segmentation, Latency (audio), Set (abstract data type), False alarm, Decoding methods, End-to-end principle, ALARM, Artificial intelligence, Algorithm, Telecommunications, Marketing, Materials science, Composite material, Business, Programming language

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4360601747
doi	https://doi.org/10.48550/arxiv.2204.10749
ids.doi	https://doi.org/10.48550/arxiv.2204.10749
ids.openalex	https://openalex.org/W4360601747
fwci
type	preprint
title	E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10201
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9998000264167786
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Speech Recognition and Synthesis
topics[1].id	https://openalex.org/T10860
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9983000159263611
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1711
topics[1].subfield.display_name	Signal Processing
topics[1].display_name	Speech and Audio Processing
topics[2].id	https://openalex.org/T11309
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9976000189781189
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1711
topics[2].subfield.display_name	Signal Processing
topics[2].display_name	Music and Audio Processing
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.7312777042388916
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C28490314
concepts[1].level	1
concepts[1].score	0.6845752596855164
concepts[1].wikidata	https://www.wikidata.org/wiki/Q189436
concepts[1].display_name	Speech recognition
concepts[2].id	https://openalex.org/C89600930
concepts[2].level	2
concepts[2].score	0.6802265644073486
concepts[2].wikidata	https://www.wikidata.org/wiki/Q1423946
concepts[2].display_name	Segmentation
concepts[3].id	https://openalex.org/C2777530160
concepts[3].level	2
concepts[3].score	0.5759625434875488
concepts[3].wikidata	https://www.wikidata.org/wiki/Q41796
concepts[3].display_name	Sentence
concepts[4].id	https://openalex.org/C125308379
concepts[4].level	2
concepts[4].score	0.5214880704879761
concepts[4].wikidata	https://www.wikidata.org/wiki/Q363057
concepts[4].display_name	Market segmentation
concepts[5].id	https://openalex.org/C82876162
concepts[5].level	2
concepts[5].score	0.4913095533847809
concepts[5].wikidata	https://www.wikidata.org/wiki/Q17096504
concepts[5].display_name	Latency (audio)
concepts[6].id	https://openalex.org/C177264268
concepts[6].level	2
concepts[6].score	0.4585984945297241
concepts[6].wikidata	https://www.wikidata.org/wiki/Q1514741
concepts[6].display_name	Set (abstract data type)
concepts[7].id	https://openalex.org/C2776836416
concepts[7].level	2
concepts[7].score	0.45825234055519104
concepts[7].wikidata	https://www.wikidata.org/wiki/Q1364844
concepts[7].display_name	False alarm
concepts[8].id	https://openalex.org/C57273362
concepts[8].level	2
concepts[8].score	0.4573310315608978
concepts[8].wikidata	https://www.wikidata.org/wiki/Q576722
concepts[8].display_name	Decoding methods
concepts[9].id	https://openalex.org/C74296488
concepts[9].level	2
concepts[9].score	0.43564122915267944
concepts[9].wikidata	https://www.wikidata.org/wiki/Q2527392
concepts[9].display_name	End-to-end principle
concepts[10].id	https://openalex.org/C2779119184
concepts[10].level	2
concepts[10].score	0.4156178832054138
concepts[10].wikidata	https://www.wikidata.org/wiki/Q294350
concepts[10].display_name	ALARM
concepts[11].id	https://openalex.org/C154945302
concepts[11].level	1
concepts[11].score	0.33117783069610596
concepts[11].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[11].display_name	Artificial intelligence
concepts[12].id	https://openalex.org/C11413529
concepts[12].level	1
concepts[12].score	0.10815045237541199
concepts[12].wikidata	https://www.wikidata.org/wiki/Q8366
concepts[12].display_name	Algorithm
concepts[13].id	https://openalex.org/C76155785
concepts[13].level	1
concepts[13].score	0.07899779081344604
concepts[13].wikidata	https://www.wikidata.org/wiki/Q418
concepts[13].display_name	Telecommunications
concepts[14].id	https://openalex.org/C162853370
concepts[14].level	1
concepts[14].score	0.0
concepts[14].wikidata	https://www.wikidata.org/wiki/Q39809
concepts[14].display_name	Marketing
concepts[15].id	https://openalex.org/C192562407
concepts[15].level	0
concepts[15].score	0.0
concepts[15].wikidata	https://www.wikidata.org/wiki/Q228736
concepts[15].display_name	Materials science
concepts[16].id	https://openalex.org/C159985019
concepts[16].level	1
concepts[16].score	0.0
concepts[16].wikidata	https://www.wikidata.org/wiki/Q181790
concepts[16].display_name	Composite material
concepts[17].id	https://openalex.org/C144133560
concepts[17].level	0
concepts[17].score	0.0
concepts[17].wikidata	https://www.wikidata.org/wiki/Q4830453
concepts[17].display_name	Business
concepts[18].id	https://openalex.org/C199360897
concepts[18].level	1
concepts[18].score	0.0
concepts[18].wikidata	https://www.wikidata.org/wiki/Q9143
concepts[18].display_name	Programming language
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.7312777042388916
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/speech-recognition
keywords[1].score	0.6845752596855164
keywords[1].display_name	Speech recognition
keywords[2].id	https://openalex.org/keywords/segmentation
keywords[2].score	0.6802265644073486
keywords[2].display_name	Segmentation
keywords[3].id	https://openalex.org/keywords/sentence
keywords[3].score	0.5759625434875488
keywords[3].display_name	Sentence
keywords[4].id	https://openalex.org/keywords/market-segmentation
keywords[4].score	0.5214880704879761
keywords[4].display_name	Market segmentation
keywords[5].id	https://openalex.org/keywords/latency
keywords[5].score	0.4913095533847809
keywords[5].display_name	Latency (audio)
keywords[6].id	https://openalex.org/keywords/set
keywords[6].score	0.4585984945297241
keywords[6].display_name	Set (abstract data type)
keywords[7].id	https://openalex.org/keywords/false-alarm
keywords[7].score	0.45825234055519104
keywords[7].display_name	False alarm
keywords[8].id	https://openalex.org/keywords/decoding-methods
keywords[8].score	0.4573310315608978
keywords[8].display_name	Decoding methods
keywords[9].id	https://openalex.org/keywords/end-to-end-principle
keywords[9].score	0.43564122915267944
keywords[9].display_name	End-to-end principle
keywords[10].id	https://openalex.org/keywords/alarm
keywords[10].score	0.4156178832054138
keywords[10].display_name	ALARM
keywords[11].id	https://openalex.org/keywords/artificial-intelligence
keywords[11].score	0.33117783069610596
keywords[11].display_name	Artificial intelligence
keywords[12].id	https://openalex.org/keywords/algorithm
keywords[12].score	0.10815045237541199
keywords[12].display_name	Algorithm
keywords[13].id	https://openalex.org/keywords/telecommunications
keywords[13].score	0.07899779081344604
keywords[13].display_name	Telecommunications
language	en
locations[0].id	pmh:oai:arXiv.org:2204.10749
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license	cc-by
locations[0].pdf_url	https://arxiv.org/pdf/2204.10749
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id	https://openalex.org/licenses/cc-by
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2204.10749
locations[1].id	doi:10.48550/arxiv.2204.10749
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2204.10749
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5091738469
authorships[0].author.orcid
authorships[0].author.display_name	W. Ronny Huang
authorships[0].author_position	first
authorships[0].raw_author_name	Huang, W. Ronny
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5001306222
authorships[1].author.orcid
authorships[1].author.display_name	Shuo-Yiin Chang
authorships[1].author_position	middle
authorships[1].raw_author_name	Chang, Shuo-yiin
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5050133412
authorships[2].author.orcid
authorships[2].author.display_name	David Rybach
authorships[2].author_position	middle
authorships[2].raw_author_name	Rybach, David
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5032640894
authorships[3].author.orcid	https://orcid.org/0000-0001-5331-6058
authorships[3].author.display_name	Rohit Prabhavalkar
authorships[3].author_position	middle
authorships[3].raw_author_name	Prabhavalkar, Rohit
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5070513394
authorships[4].author.orcid	https://orcid.org/0000-0002-4126-6556
authorships[4].author.display_name	Tara N. Sainath
authorships[4].author_position	middle
authorships[4].raw_author_name	Sainath, Tara N.
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5030888546
authorships[5].author.orcid
authorships[5].author.display_name	Cyril Allauzen
authorships[5].author_position	middle
authorships[5].raw_author_name	Allauzen, Cyril
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5037066965
authorships[6].author.orcid
authorships[6].author.display_name	Cal Peyser
authorships[6].author_position	middle
authorships[6].raw_author_name	Peyser, Cal
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5039693533
authorships[7].author.orcid	https://orcid.org/0000-0002-1733-4061
authorships[7].author.display_name	Zhiyun Lu
authorships[7].author_position	last
authorships[7].raw_author_name	Lu, Zhiyun
authorships[7].is_corresponding	False
has_content.pdf	True
has_content.grobid_xml	True
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2204.10749
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
has_fulltext	True
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10201
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9998000264167786
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Speech Recognition and Synthesis
related_works	https://openalex.org/W1584123598, https://openalex.org/W2731305060, https://openalex.org/W2372003537, https://openalex.org/W3179968364, https://openalex.org/W2732807254, https://openalex.org/W2587670262, https://openalex.org/W3037375888, https://openalex.org/W2366730739, https://openalex.org/W3121346907, https://openalex.org/W4379535633
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2204.10749
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license	cc-by
best_oa_location.pdf_url	https://arxiv.org/pdf/2204.10749
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id	https://openalex.org/licenses/cc-by
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2204.10749
primary_location.id	pmh:oai:arXiv.org:2204.10749
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license	cc-by
primary_location.pdf_url	https://arxiv.org/pdf/2204.10749
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id	https://openalex.org/licenses/cc-by
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2204.10749
publication_date	2022-04-22
publication_year	2022
referenced_works_count	0
abstract_inverted_index.5	83
abstract_inverted_index.A	24
abstract_inverted_index.a	35, 63, 71, 102, 167
abstract_inverted_index.30	144
abstract_inverted_index.In	131
abstract_inverted_index.We	85
abstract_inverted_index.an	18, 80, 92
abstract_inverted_index.as	70
abstract_inverted_index.be	56, 68, 110
abstract_inverted_index.in	15, 21, 32, 76, 101, 156
abstract_inverted_index.is	17, 27
abstract_inverted_index.ms	154
abstract_inverted_index.of	3, 97, 141
abstract_inverted_index.on	7, 48, 114, 120, 133, 166
abstract_inverted_index.to	13, 28, 87, 109, 143, 161
abstract_inverted_index.up	142
abstract_inverted_index.we	146
abstract_inverted_index.250	153
abstract_inverted_index.ASR	5, 94
abstract_inverted_index.VAD	52, 90, 163
abstract_inverted_index.WER	150
abstract_inverted_index.and	152
abstract_inverted_index.but	118
abstract_inverted_index.for	58
abstract_inverted_index.may	55, 73
abstract_inverted_index.not	112
abstract_inverted_index.the	1, 30, 77, 89, 106, 124, 162
abstract_inverted_index.8.5%	148
abstract_inverted_index.also	119
abstract_inverted_index.from	11, 123
abstract_inverted_index.long	8
abstract_inverted_index.only	113
abstract_inverted_index.real	134
abstract_inverted_index.text	126
abstract_inverted_index.that	41, 66
abstract_inverted_index.with	91, 127, 139
abstract_inverted_index.("set	79
abstract_inverted_index.(VAD)	40
abstract_inverted_index.RNN-T	170
abstract_inverted_index.alarm	81
abstract_inverted_index.audio	31, 137
abstract_inverted_index.based	46
abstract_inverted_index.e.g.,	62
abstract_inverted_index.extra	129
abstract_inverted_index.hours	14
abstract_inverted_index.model	95
abstract_inverted_index.taken	69
abstract_inverted_index.using	34
abstract_inverted_index.voice	37
abstract_inverted_index.whole	72
abstract_inverted_index.world	135
abstract_inverted_index.better	115
abstract_inverted_index.common	25
abstract_inverted_index.for...	82
abstract_inverted_index.length	16
abstract_inverted_index.median	157
abstract_inverted_index.middle	78
abstract_inverted_index.model.	171
abstract_inverted_index.models	6
abstract_inverted_index.purely	47
abstract_inverted_index.should	67
abstract_inverted_index.speech	22, 60
abstract_inverted_index.where,	61
abstract_inverted_index.advance	33
abstract_inverted_index.capable	96
abstract_inverted_index.contain	74
abstract_inverted_index.decides	42
abstract_inverted_index.decoded	125
abstract_inverted_index.latency	159
abstract_inverted_index.lengths	140
abstract_inverted_index.minutes	12
abstract_inverted_index.ongoing	19
abstract_inverted_index.propose	86
abstract_inverted_index.ranging	10
abstract_inverted_index.replace	88
abstract_inverted_index.segment	29, 43, 99
abstract_inverted_index.acoustic	49, 116
abstract_inverted_index.activity	38
abstract_inverted_index.allowing	105
abstract_inverted_index.baseline	165
abstract_inverted_index.boundary	44
abstract_inverted_index.compared	160
abstract_inverted_index.complete	64
abstract_inverted_index.decision	108
abstract_inverted_index.detector	39
abstract_inverted_index.fashion,	104
abstract_inverted_index.features	117, 122
abstract_inverted_index.however,	54
abstract_inverted_index.minutes,	145
abstract_inverted_index.relative	149
abstract_inverted_index.semantic	121
abstract_inverted_index.sentence	65
abstract_inverted_index.separate	36
abstract_inverted_index.solution	26
abstract_inverted_index.(YouTube)	138
abstract_inverted_index.Conformer	169
abstract_inverted_index.Improving	0
abstract_inverted_index.challenge	20
abstract_inverted_index.locations	45
abstract_inverted_index.long-form	136
abstract_inverted_index.reduction	155
abstract_inverted_index.segmenter	164
abstract_inverted_index.streaming	103
abstract_inverted_index.boundaries	100
abstract_inverted_index.end-to-end	4, 93
abstract_inverted_index.negligible	128
abstract_inverted_index.o'clock").	84
abstract_inverted_index.predicting	98
abstract_inverted_index.real-world	59
abstract_inverted_index.utterances	9
abstract_inverted_index.conditioned	111
abstract_inverted_index.demonstrate	147
abstract_inverted_index.experiments	132
abstract_inverted_index.hesitations	75
abstract_inverted_index.improvement	151
abstract_inverted_index.performance	2
abstract_inverted_index.segmenters,	53
abstract_inverted_index.sub-optimal	57
abstract_inverted_index.computation.	130
abstract_inverted_index.information.	51
abstract_inverted_index.recognition.	23
abstract_inverted_index.segmentation	107
abstract_inverted_index.end-of-segment	158
abstract_inverted_index.state-of-the-art	168
abstract_inverted_index.speech/non-speech	50
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	8
sustainable_development_goals[0].id	https://metadata.un.org/sdg/16
sustainable_development_goals[0].score	0.7699999809265137
sustainable_development_goals[0].display_name	Peace, Justice and strong institutions
citation_normalized_percentile