Prompting Large Language Models with Speech Recognition Abilities Article Swipe

PDF

Yassir Fathullah , Chunyang Wu , Egor Lakomkin , Junteng Jia , Yuan Shangguan , Ke Li , Jinxi Guo , Wenhan Xiong , Jay Mahadeokar , Ozlem Kalinli , Christian Fuegen , Mike Seltzer ·

YOU? · · 2023 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2307.11795

Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

Related Topics

Computer Science

Security Token

Artificial Intelligence

Computer Security

Concepts

Computer science Encoder Speech recognition Automatic summarization Security token Language model Natural language processing Acoustic model Artificial intelligence Speech processing Operating system Computer security

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2307.11795
PDF: https://arxiv.org/pdf/2307.11795
OA Status: green
Cited By: 2
Related Works: 10
OpenAlex ID: https://openalex.org/W4385260920

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4385260920

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2307.11795

Digital Object Identifier
Title: Prompting Large Language Models with Speech Recognition Abilities

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2023

Year of publication
Publication date: 2023-07-21

Full publication date if available
Authors: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer

List of authors in order
Landing page: https://arxiv.org/abs/2307.11795

Publisher landing page
PDF URL: https://arxiv.org/pdf/2307.11795

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2307.11795

Direct OA link when available
Concepts: Computer science, Encoder, Speech recognition, Automatic summarization, Security token, Language model, Natural language processing, Acoustic model, Artificial intelligence, Speech processing, Operating system, Computer security

Top concepts (fields/topics) attached by OpenAlex
Cited by: 2

Total citation count in OpenAlex
Citations by year (recent): 2025: 1, 2024: 1

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4385260920
doi	https://doi.org/10.48550/arxiv.2307.11795
ids.doi	https://doi.org/10.48550/arxiv.2307.11795
ids.openalex	https://openalex.org/W4385260920
fwci
type	preprint
title	Prompting Large Language Models with Speech Recognition Abilities
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10201
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9987000226974487
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Speech Recognition and Synthesis
topics[1].id	https://openalex.org/T10028
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9987000226974487
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Topic Modeling
topics[2].id	https://openalex.org/T10181
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9975000023841858
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Natural Language Processing Techniques
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.7731028199195862
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C118505674
concepts[1].level	2
concepts[1].score	0.735184371471405
concepts[1].wikidata	https://www.wikidata.org/wiki/Q42586063
concepts[1].display_name	Encoder
concepts[2].id	https://openalex.org/C28490314
concepts[2].level	1
concepts[2].score	0.6685544848442078
concepts[2].wikidata	https://www.wikidata.org/wiki/Q189436
concepts[2].display_name	Speech recognition
concepts[3].id	https://openalex.org/C170858558
concepts[3].level	2
concepts[3].score	0.618772566318512
concepts[3].wikidata	https://www.wikidata.org/wiki/Q1394144
concepts[3].display_name	Automatic summarization
concepts[4].id	https://openalex.org/C48145219
concepts[4].level	2
concepts[4].score	0.6089869737625122
concepts[4].wikidata	https://www.wikidata.org/wiki/Q1335365
concepts[4].display_name	Security token
concepts[5].id	https://openalex.org/C137293760
concepts[5].level	2
concepts[5].score	0.5435717701911926
concepts[5].wikidata	https://www.wikidata.org/wiki/Q3621696
concepts[5].display_name	Language model
concepts[6].id	https://openalex.org/C204321447
concepts[6].level	1
concepts[6].score	0.49280837178230286
concepts[6].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[6].display_name	Natural language processing
concepts[7].id	https://openalex.org/C155635449
concepts[7].level	3
concepts[7].score	0.4548299312591553
concepts[7].wikidata	https://www.wikidata.org/wiki/Q4674699
concepts[7].display_name	Acoustic model
concepts[8].id	https://openalex.org/C154945302
concepts[8].level	1
concepts[8].score	0.39188697934150696
concepts[8].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[8].display_name	Artificial intelligence
concepts[9].id	https://openalex.org/C61328038
concepts[9].level	2
concepts[9].score	0.285081684589386
concepts[9].wikidata	https://www.wikidata.org/wiki/Q3358061
concepts[9].display_name	Speech processing
concepts[10].id	https://openalex.org/C111919701
concepts[10].level	1
concepts[10].score	0.0
concepts[10].wikidata	https://www.wikidata.org/wiki/Q9135
concepts[10].display_name	Operating system
concepts[11].id	https://openalex.org/C38652104
concepts[11].level	1
concepts[11].score	0.0
concepts[11].wikidata	https://www.wikidata.org/wiki/Q3510521
concepts[11].display_name	Computer security
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.7731028199195862
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/encoder
keywords[1].score	0.735184371471405
keywords[1].display_name	Encoder
keywords[2].id	https://openalex.org/keywords/speech-recognition
keywords[2].score	0.6685544848442078
keywords[2].display_name	Speech recognition
keywords[3].id	https://openalex.org/keywords/automatic-summarization
keywords[3].score	0.618772566318512
keywords[3].display_name	Automatic summarization
keywords[4].id	https://openalex.org/keywords/security-token
keywords[4].score	0.6089869737625122
keywords[4].display_name	Security token
keywords[5].id	https://openalex.org/keywords/language-model
keywords[5].score	0.5435717701911926
keywords[5].display_name	Language model
keywords[6].id	https://openalex.org/keywords/natural-language-processing
keywords[6].score	0.49280837178230286
keywords[6].display_name	Natural language processing
keywords[7].id	https://openalex.org/keywords/acoustic-model
keywords[7].score	0.4548299312591553
keywords[7].display_name	Acoustic model
keywords[8].id	https://openalex.org/keywords/artificial-intelligence
keywords[8].score	0.39188697934150696
keywords[8].display_name	Artificial intelligence
keywords[9].id	https://openalex.org/keywords/speech-processing
keywords[9].score	0.285081684589386
keywords[9].display_name	Speech processing
language	en
locations[0].id	pmh:oai:arXiv.org:2307.11795
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license	other-oa
locations[0].pdf_url	https://arxiv.org/pdf/2307.11795
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id	https://openalex.org/licenses/other-oa
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2307.11795
locations[1].id	doi:10.48550/arxiv.2307.11795
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2307.11795
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5006814826
authorships[0].author.orcid
authorships[0].author.display_name	Yassir Fathullah
authorships[0].author_position	first
authorships[0].raw_author_name	Fathullah, Yassir
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5103012144
authorships[1].author.orcid	https://orcid.org/0000-0002-5796-8288
authorships[1].author.display_name	Chunyang Wu
authorships[1].author_position	middle
authorships[1].raw_author_name	Wu, Chunyang
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5045428440
authorships[2].author.orcid
authorships[2].author.display_name	Egor Lakomkin
authorships[2].author_position	middle
authorships[2].raw_author_name	Lakomkin, Egor
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5113970008
authorships[3].author.orcid
authorships[3].author.display_name	Junteng Jia
authorships[3].author_position	middle
authorships[3].raw_author_name	Jia, Junteng
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5047358828
authorships[4].author.orcid
authorships[4].author.display_name	Yuan Shangguan
authorships[4].author_position	middle
authorships[4].raw_author_name	Shangguan, Yuan
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5100343555
authorships[5].author.orcid	https://orcid.org/0009-0006-9192-0487
authorships[5].author.display_name	Ke Li
authorships[5].author_position	middle
authorships[5].raw_author_name	Li, Ke
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5103232491
authorships[6].author.orcid	https://orcid.org/0000-0001-9563-7351
authorships[6].author.display_name	Jinxi Guo
authorships[6].author_position	middle
authorships[6].raw_author_name	Guo, Jinxi
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5110635444
authorships[7].author.orcid
authorships[7].author.display_name	Wenhan Xiong
authorships[7].author_position	middle
authorships[7].raw_author_name	Xiong, Wenhan
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A5074237839
authorships[8].author.orcid
authorships[8].author.display_name	Jay Mahadeokar
authorships[8].author_position	middle
authorships[8].raw_author_name	Mahadeokar, Jay
authorships[8].is_corresponding	False
authorships[9].author.id	https://openalex.org/A5066166549
authorships[9].author.orcid
authorships[9].author.display_name	Ozlem Kalinli
authorships[9].author_position	middle
authorships[9].raw_author_name	Kalinli, Ozlem
authorships[9].is_corresponding	False
authorships[10].author.id	https://openalex.org/A5047073253
authorships[10].author.orcid
authorships[10].author.display_name	Christian Fuegen
authorships[10].author_position	middle
authorships[10].raw_author_name	Fuegen, Christian
authorships[10].is_corresponding	False
authorships[11].author.id	https://openalex.org/A5113773386
authorships[11].author.orcid
authorships[11].author.display_name	Mike Seltzer
authorships[11].author_position	last
authorships[11].raw_author_name	Seltzer, Mike
authorships[11].is_corresponding	False
has_content.pdf	True
has_content.grobid_xml	True
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2307.11795
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Prompting Large Language Models with Speech Recognition Abilities
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10201
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9987000226974487
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Speech Recognition and Synthesis
related_works	https://openalex.org/W2126322296, https://openalex.org/W2163537793, https://openalex.org/W2916997151, https://openalex.org/W2781555308, https://openalex.org/W3021690593, https://openalex.org/W2125343999, https://openalex.org/W4200200210, https://openalex.org/W2161188302, https://openalex.org/W2888189389, https://openalex.org/W2949174760
cited_by_count	2
counts_by_year[0].year	2025
counts_by_year[0].cited_by_count	1
counts_by_year[1].year	2024
counts_by_year[1].cited_by_count	1
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2307.11795
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license	other-oa
best_oa_location.pdf_url	https://arxiv.org/pdf/2307.11795
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id	https://openalex.org/licenses/other-oa
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2307.11795
primary_location.id	pmh:oai:arXiv.org:2307.11795
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license	other-oa
primary_location.pdf_url	https://arxiv.org/pdf/2307.11795
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id	https://openalex.org/licenses/other-oa
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2307.11795
publication_date	2023-07-21
publication_year	2023
referenced_works_count	0
abstract_inverted_index.1	179
abstract_inverted_index.a	11, 37, 50, 92
abstract_inverted_index.By	47
abstract_inverted_index.In	25
abstract_inverted_index.an	66
abstract_inverted_index.as	18, 80
abstract_inverted_index.be	63, 73, 132
abstract_inverted_index.by	34, 106
abstract_inverted_index.in	75, 183
abstract_inverted_index.is	166, 172
abstract_inverted_index.it	42, 101
abstract_inverted_index.of	14, 32, 52, 177
abstract_inverted_index.on	85, 118, 195
abstract_inverted_index.or	174
abstract_inverted_index.to	9, 43, 55, 65, 102, 126, 137, 153, 193
abstract_inverted_index.up	143, 188
abstract_inverted_index.we	28, 122
abstract_inverted_index.18%	107
abstract_inverted_index.ASR	165
abstract_inverted_index.LLM	61, 130, 171
abstract_inverted_index.The	157
abstract_inverted_index.and	21, 72, 108, 147
abstract_inverted_index.are	181
abstract_inverted_index.can	62, 131
abstract_inverted_index.for	191
abstract_inverted_index.its	81, 139
abstract_inverted_index.the	30, 56, 60, 76, 96, 129, 144, 149, 170, 184, 189
abstract_inverted_index.LLMs	33, 192
abstract_inverted_index.able	8
abstract_inverted_index.even	168
abstract_inverted_index.from	159
abstract_inverted_index.have	3
abstract_inverted_index.into	95
abstract_inverted_index.open	97
abstract_inverted_index.same	78
abstract_inverted_index.show	89, 162
abstract_inverted_index.such	17
abstract_inverted_index.text	57
abstract_inverted_index.that	90, 163
abstract_inverted_index.this	26
abstract_inverted_index.used	74, 182
abstract_inverted_index.when	169, 175
abstract_inverted_index.wide	12
abstract_inverted_index.(ASR)	70
abstract_inverted_index.(MLS)	88
abstract_inverted_index.LLaMA	114
abstract_inverted_index.Large	0
abstract_inverted_index.audio	39, 145, 150, 185
abstract_inverted_index.being	115
abstract_inverted_index.exact	77
abstract_inverted_index.fewer	155
abstract_inverted_index.paper	27
abstract_inverted_index.range	13
abstract_inverted_index.small	38
abstract_inverted_index.solve	10
abstract_inverted_index.text.	120
abstract_inverted_index.these	160
abstract_inverted_index.token	58
abstract_inverted_index.allows	100
abstract_inverted_index.almost	178
abstract_inverted_index.audial	53
abstract_inverted_index.audio.	197
abstract_inverted_index.during	135
abstract_inverted_index.extend	29
abstract_inverted_index.frozen	134, 173
abstract_inverted_index.highly	6
abstract_inverted_index.manner	79
abstract_inverted_index.models	2
abstract_inverted_index.proven	4
abstract_inverted_index.second	180
abstract_inverted_index.speech	45, 68, 111
abstract_inverted_index.tasks,	16
abstract_inverted_index.English	119
abstract_inverted_index.despite	113
abstract_inverted_index.encoder	40, 94, 151, 186
abstract_inverted_index.opening	187
abstract_inverted_index.operate	194
abstract_inverted_index.perform	44, 109, 123
abstract_inverted_index.results	158
abstract_inverted_index.scaling	142
abstract_inverted_index.sourced	98
abstract_inverted_index.strides	176
abstract_inverted_index.studies	125, 161
abstract_inverted_index.system,	71
abstract_inverted_index.textual	82
abstract_inverted_index.trained	116
abstract_inverted_index.whether	128
abstract_inverted_index.LLaMA-7B	99
abstract_inverted_index.ablation	124
abstract_inverted_index.allowing	41
abstract_inverted_index.directly	35, 48
abstract_inverted_index.encoder,	146
abstract_inverted_index.generate	154
abstract_inverted_index.language	1
abstract_inverted_index.maintain	138
abstract_inverted_index.original	140
abstract_inverted_index.possible	167
abstract_inverted_index.question	23
abstract_inverted_index.sequence	51
abstract_inverted_index.striding	152
abstract_inverted_index.training	136
abstract_inverted_index.attaching	36
abstract_inverted_index.automatic	67
abstract_inverted_index.baselines	105
abstract_inverted_index.conformer	93
abstract_inverted_index.converted	64
abstract_inverted_index.flexible,	7
abstract_inverted_index.long-form	196
abstract_inverted_index.answering.	24
abstract_inverted_index.completely	133
abstract_inverted_index.embeddings	54
abstract_inverted_index.generative	15
abstract_inverted_index.increasing	148
abstract_inverted_index.open-ended	22
abstract_inverted_index.outperform	103
abstract_inverted_index.prepending	49
abstract_inverted_index.themselves	5
abstract_inverted_index.Experiments	84
abstract_inverted_index.LibriSpeech	87
abstract_inverted_index.abstractive	19
abstract_inverted_index.embeddings,	59
abstract_inverted_index.embeddings.	156
abstract_inverted_index.investigate	127
abstract_inverted_index.monolingual	104
abstract_inverted_index.possibility	190
abstract_inverted_index.recognition	69, 112
abstract_inverted_index.Furthermore,	121
abstract_inverted_index.Multilingual	86
abstract_inverted_index.capabilities	31
abstract_inverted_index.counterpart.	83
abstract_inverted_index.multilingual	110, 164
abstract_inverted_index.recognition.	46
abstract_inverted_index.capabilities,	141
abstract_inverted_index.incorporating	91
abstract_inverted_index.summarization	20
abstract_inverted_index.overwhelmingly	117
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	12
sustainable_development_goals[0].id	https://metadata.un.org/sdg/4
sustainable_development_goals[0].score	0.6200000047683716
sustainable_development_goals[0].display_name	Quality Education
citation_normalized_percentile