Learning Speaker-specific Lip-to-Speech Generation Article Swipe

PDF

Munender Varshney , Ravindra Yadav , Vinay P. Namboodiri , Rajesh M. Hegde ·

YOU? · · 2022 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2206.02050

Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as a prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in an unconstrained natural setting. Extensive evaluation using various qualitative and quantitative metrics with human evaluation also shows that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in an unconstrained setting) by a good margin across almost all evaluation metrics and marginally outperforms the state-of-the-art on GRID dataset.

Related Topics

Computer Science

Artificial Intelligence

Machine Learning

Philosophy

Concepts

Computer science Speech recognition Vocabulary Hidden Markov model Artificial intelligence Chunking (psychology) Discriminative model Margin (machine learning) Natural language processing Machine learning Linguistics Philosophy

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2206.02050
PDF: https://arxiv.org/pdf/2206.02050
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4281723261

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4281723261

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2206.02050

Digital Object Identifier
Title: Learning Speaker-specific Lip-to-Speech Generation

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2022

Year of publication
Publication date: 2022-06-04

Full publication date if available
Authors: Munender Varshney, Ravindra Yadav, Vinay P. Namboodiri, Rajesh M. Hegde

List of authors in order
Landing page: https://arxiv.org/abs/2206.02050

Publisher landing page
PDF URL: https://arxiv.org/pdf/2206.02050

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2206.02050

Direct OA link when available
Concepts: Computer science, Speech recognition, Vocabulary, Hidden Markov model, Artificial intelligence, Chunking (psychology), Discriminative model, Margin (machine learning), Natural language processing, Machine learning, Linguistics, Philosophy

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4281723261
doi	https://doi.org/10.48550/arxiv.2206.02050
ids.doi	https://doi.org/10.48550/arxiv.2206.02050
ids.openalex	https://openalex.org/W4281723261
fwci
type	preprint
title	Learning Speaker-specific Lip-to-Speech Generation
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10860
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9994999766349792
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1711
topics[0].subfield.display_name	Signal Processing
topics[0].display_name	Speech and Audio Processing
topics[1].id	https://openalex.org/T11448
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9700000286102295
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1707
topics[1].subfield.display_name	Computer Vision and Pattern Recognition
topics[1].display_name	Face recognition and analysis
topics[2].id	https://openalex.org/T13289
topics[2].field.id	https://openalex.org/fields/36
topics[2].field.display_name	Health Professions
topics[2].score	0.9129999876022339
topics[2].domain.id	https://openalex.org/domains/4
topics[2].domain.display_name	Health Sciences
topics[2].subfield.id	https://openalex.org/subfields/3611
topics[2].subfield.display_name	Pharmacy
topics[2].display_name	Infant Health and Development
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.8115311861038208
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C28490314
concepts[1].level	1
concepts[1].score	0.7135622501373291
concepts[1].wikidata	https://www.wikidata.org/wiki/Q189436
concepts[1].display_name	Speech recognition
concepts[2].id	https://openalex.org/C2777601683
concepts[2].level	2
concepts[2].score	0.520641028881073
concepts[2].wikidata	https://www.wikidata.org/wiki/Q6499736
concepts[2].display_name	Vocabulary
concepts[3].id	https://openalex.org/C23224414
concepts[3].level	2
concepts[3].score	0.5153509974479675
concepts[3].wikidata	https://www.wikidata.org/wiki/Q176769
concepts[3].display_name	Hidden Markov model
concepts[4].id	https://openalex.org/C154945302
concepts[4].level	1
concepts[4].score	0.46791011095046997
concepts[4].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[4].display_name	Artificial intelligence
concepts[5].id	https://openalex.org/C203357204
concepts[5].level	2
concepts[5].score	0.4672197103500366
concepts[5].wikidata	https://www.wikidata.org/wiki/Q1089605
concepts[5].display_name	Chunking (psychology)
concepts[6].id	https://openalex.org/C97931131
concepts[6].level	2
concepts[6].score	0.4615843892097473
concepts[6].wikidata	https://www.wikidata.org/wiki/Q5282087
concepts[6].display_name	Discriminative model
concepts[7].id	https://openalex.org/C774472
concepts[7].level	2
concepts[7].score	0.4523240029811859
concepts[7].wikidata	https://www.wikidata.org/wiki/Q6760393
concepts[7].display_name	Margin (machine learning)
concepts[8].id	https://openalex.org/C204321447
concepts[8].level	1
concepts[8].score	0.34478959441185
concepts[8].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[8].display_name	Natural language processing
concepts[9].id	https://openalex.org/C119857082
concepts[9].level	1
concepts[9].score	0.1962626576423645
concepts[9].wikidata	https://www.wikidata.org/wiki/Q2539
concepts[9].display_name	Machine learning
concepts[10].id	https://openalex.org/C41895202
concepts[10].level	1
concepts[10].score	0.0
concepts[10].wikidata	https://www.wikidata.org/wiki/Q8162
concepts[10].display_name	Linguistics
concepts[11].id	https://openalex.org/C138885662
concepts[11].level	0
concepts[11].score	0.0
concepts[11].wikidata	https://www.wikidata.org/wiki/Q5891
concepts[11].display_name	Philosophy
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.8115311861038208
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/speech-recognition
keywords[1].score	0.7135622501373291
keywords[1].display_name	Speech recognition
keywords[2].id	https://openalex.org/keywords/vocabulary
keywords[2].score	0.520641028881073
keywords[2].display_name	Vocabulary
keywords[3].id	https://openalex.org/keywords/hidden-markov-model
keywords[3].score	0.5153509974479675
keywords[3].display_name	Hidden Markov model
keywords[4].id	https://openalex.org/keywords/artificial-intelligence
keywords[4].score	0.46791011095046997
keywords[4].display_name	Artificial intelligence
keywords[5].id	https://openalex.org/keywords/chunking
keywords[5].score	0.4672197103500366
keywords[5].display_name	Chunking (psychology)
keywords[6].id	https://openalex.org/keywords/discriminative-model
keywords[6].score	0.4615843892097473
keywords[6].display_name	Discriminative model
keywords[7].id	https://openalex.org/keywords/margin
keywords[7].score	0.4523240029811859
keywords[7].display_name	Margin (machine learning)
keywords[8].id	https://openalex.org/keywords/natural-language-processing
keywords[8].score	0.34478959441185
keywords[8].display_name	Natural language processing
keywords[9].id	https://openalex.org/keywords/machine-learning
keywords[9].score	0.1962626576423645
keywords[9].display_name	Machine learning
language	en
locations[0].id	pmh:oai:arXiv.org:2206.02050
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license	cc-by
locations[0].pdf_url	https://arxiv.org/pdf/2206.02050
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id	https://openalex.org/licenses/cc-by
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2206.02050
locations[1].id	doi:10.48550/arxiv.2206.02050
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2206.02050
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5036559623
authorships[0].author.orcid	https://orcid.org/0000-0002-3061-5757
authorships[0].author.display_name	Munender Varshney
authorships[0].author_position	first
authorships[0].raw_author_name	Varshney, Munender
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5010648323
authorships[1].author.orcid	https://orcid.org/0000-0003-4628-0688
authorships[1].author.display_name	Ravindra Yadav
authorships[1].author_position	middle
authorships[1].raw_author_name	Yadav, Ravindra
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5007109424
authorships[2].author.orcid	https://orcid.org/0000-0001-5262-9722
authorships[2].author.display_name	Vinay P. Namboodiri
authorships[2].author_position	middle
authorships[2].raw_author_name	Namboodiri, Vinay P.
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5085503354
authorships[3].author.orcid	https://orcid.org/0000-0002-6142-7724
authorships[3].author.display_name	Rajesh M. Hegde
authorships[3].author_position	last
authorships[3].raw_author_name	Hegde, Rajesh M
authorships[3].is_corresponding	False
has_content.pdf	True
has_content.grobid_xml	True
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2206.02050
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Learning Speaker-specific Lip-to-Speech Generation
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10860
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9994999766349792
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1711
primary_topic.subfield.display_name	Signal Processing
primary_topic.display_name	Speech and Audio Processing
related_works	https://openalex.org/W2384729545, https://openalex.org/W2198395236, https://openalex.org/W2800417007, https://openalex.org/W147604216, https://openalex.org/W4389116644, https://openalex.org/W2153315159, https://openalex.org/W3103844505, https://openalex.org/W2161080928, https://openalex.org/W259157601, https://openalex.org/W2167155152
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2206.02050
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license	cc-by
best_oa_location.pdf_url	https://arxiv.org/pdf/2206.02050
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id	https://openalex.org/licenses/cc-by
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2206.02050
primary_location.id	pmh:oai:arXiv.org:2206.02050
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license	cc-by
primary_location.pdf_url	https://arxiv.org/pdf/2206.02050
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id	https://openalex.org/licenses/cc-by
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2206.02050
publication_date	2022-06-04
publication_year	2022
referenced_works_count	0
abstract_inverted_index.a	39, 85, 96, 198
abstract_inverted_index.We	79, 108, 142
abstract_inverted_index.an	74, 91, 167, 194
abstract_inverted_index.as	84
abstract_inverted_index.be	47
abstract_inverted_index.by	197
abstract_inverted_index.in	73, 90, 123, 138, 166, 193
abstract_inverted_index.is	10
abstract_inverted_index.it	9
abstract_inverted_index.of	19, 27, 67, 70, 103
abstract_inverted_index.on	147, 211
abstract_inverted_index.or	33
abstract_inverted_index.to	58, 87, 120, 155
abstract_inverted_index.us	134
abstract_inverted_index.The	17, 129
abstract_inverted_index.all	203
abstract_inverted_index.and	4, 30, 42, 52, 64, 76, 94, 106, 150, 176, 206
abstract_inverted_index.can	46
abstract_inverted_index.for	13
abstract_inverted_index.has	38
abstract_inverted_index.its	31
abstract_inverted_index.lip	2, 68, 127, 164
abstract_inverted_index.our	145, 185
abstract_inverted_index.the	1, 6, 14, 28, 60, 65, 81, 88, 118, 135, 148, 188, 209
abstract_inverted_index.GRID	212
abstract_inverted_index.Grid	149
abstract_inverted_index.This	55
abstract_inverted_index.aims	57
abstract_inverted_index.also	182
abstract_inverted_index.both	104
abstract_inverted_index.cues	26
abstract_inverted_index.deep	113
abstract_inverted_index.from	8, 24, 49, 163
abstract_inverted_index.gets	22
abstract_inverted_index.good	199
abstract_inverted_index.have	143
abstract_inverted_index.help	23
abstract_inverted_index.sync	124
abstract_inverted_index.task	18
abstract_inverted_index.that	99, 184
abstract_inverted_index.thus	132
abstract_inverted_index.with	125, 179
abstract_inverted_index.work	56
abstract_inverted_index.Every	36
abstract_inverted_index.audio	105
abstract_inverted_index.frame	82
abstract_inverted_index.gives	133
abstract_inverted_index.human	180
abstract_inverted_index.input	126
abstract_inverted_index.joint	97
abstract_inverted_index.large	77
abstract_inverted_index.learn	109
abstract_inverted_index.model	80, 146
abstract_inverted_index.prior	86
abstract_inverted_index.shows	183
abstract_inverted_index.tasks	162
abstract_inverted_index.their	50
abstract_inverted_index.using	112, 173
abstract_inverted_index.which	45, 116
abstract_inverted_index.accent	41
abstract_inverted_index.across	201
abstract_inverted_index.almost	202
abstract_inverted_index.common	15
abstract_inverted_index.guides	117
abstract_inverted_index.margin	200
abstract_inverted_index.method	186
abstract_inverted_index.metric	114
abstract_inverted_index.single	157
abstract_inverted_index.speech	7, 53, 63, 122, 137, 160
abstract_inverted_index.style,	44
abstract_inverted_index.style.	141
abstract_inverted_index.video.	107
abstract_inverted_index.visual	51
abstract_inverted_index.Lip2Wav	151, 189
abstract_inverted_index.between	62
abstract_inverted_index.dataset	154
abstract_inverted_index.decoder	119
abstract_inverted_index.learned	95
abstract_inverted_index.lecture	153
abstract_inverted_index.metrics	178, 205
abstract_inverted_index.natural	159, 169
abstract_inverted_index.person.	16
abstract_inverted_index.setting	93
abstract_inverted_index.speaker	29, 37, 139, 158
abstract_inverted_index.trained	144
abstract_inverted_index.various	25, 174
abstract_inverted_index.accurate	20
abstract_inverted_index.dataset.	213
abstract_inverted_index.evaluate	156
abstract_inverted_index.exploits	100
abstract_inverted_index.generate	121
abstract_inverted_index.inferred	48
abstract_inverted_index.movement	3, 69, 165
abstract_inverted_index.sequence	66, 83
abstract_inverted_index.setting)	196
abstract_inverted_index.setting.	35, 170
abstract_inverted_index.speakers	72
abstract_inverted_index.speaking	43, 140
abstract_inverted_index.temporal	101, 110
abstract_inverted_index.Chemistry	152, 190
abstract_inverted_index.Extensive	171
abstract_inverted_index.different	40
abstract_inverted_index.difficult	12
abstract_inverted_index.embedding	98
abstract_inverted_index.features.	54
abstract_inverted_index.generated	136
abstract_inverted_index.inferring	5
abstract_inverted_index.learning,	115
abstract_inverted_index.posterior	131
abstract_inverted_index.contextual	32
abstract_inverted_index.evaluation	172, 181, 204
abstract_inverted_index.generation	161
abstract_inverted_index.individual	71
abstract_inverted_index.marginally	207
abstract_inverted_index.movements.	128
abstract_inverted_index.predictive	130
abstract_inverted_index.properties	102
abstract_inverted_index.understand	59
abstract_inverted_index.vocabulary	192
abstract_inverted_index.lip-reading	21
abstract_inverted_index.notoriously	11
abstract_inverted_index.outperforms	187, 208
abstract_inverted_index.qualitative	175
abstract_inverted_index.transformer	89
abstract_inverted_index.vocabulary.	78
abstract_inverted_index.auto-encoder	92
abstract_inverted_index.quantitative	177
abstract_inverted_index.Understanding	0
abstract_inverted_index.dataset(large	191
abstract_inverted_index.environmental	34
abstract_inverted_index.unconstrained	75, 168, 195
abstract_inverted_index.synchronization	111
abstract_inverted_index.state-of-the-art	210
abstract_inverted_index.correlation/mapping	61
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	4
sustainable_development_goals[0].id	https://metadata.un.org/sdg/4
sustainable_development_goals[0].score	0.8500000238418579
sustainable_development_goals[0].display_name	Quality Education
citation_normalized_percentile