Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length Article Swipe

PDF

Xuezhe Ma , Xiaomeng Yang , Wenhan Xiong , Beidi Chen , Lili Yu , Hao Zhang , Jonathan May , Luke Zettlemoyer , Omer Levy , Chunting Zhou ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2404.08801

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code: https://github.com/XuezheMax/megalodon

Related Topics

Computer Science

Artificial Intelligence

History

Archaeology

Concepts

Inference Context (archaeology) Computer science Artificial intelligence Psychology History Archaeology

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2404.08801
PDF: https://arxiv.org/pdf/2404.08801
OA Status: green
Cited By: 1
Related Works: 10
OpenAlex ID: https://openalex.org/W4394906676

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4394906676

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2404.08801

Digital Object Identifier
Title: Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-04-12

Full publication date if available
Authors: Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou

List of authors in order
Landing page: https://arxiv.org/abs/2404.08801

Publisher landing page
PDF URL: https://arxiv.org/pdf/2404.08801

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2404.08801

Direct OA link when available
Concepts: Inference, Context (archaeology), Computer science, Artificial intelligence, Psychology, History, Archaeology

Top concepts (fields/topics) attached by OpenAlex
Cited by: 1

Total citation count in OpenAlex
Citations by year (recent): 2025: 1

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4394906676
doi	https://doi.org/10.48550/arxiv.2404.08801
ids.doi	https://doi.org/10.48550/arxiv.2404.08801
ids.openalex	https://openalex.org/W4394906676
fwci
type	preprint
title	Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10181
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9180999994277954
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Natural Language Processing Techniques
topics[1].id	https://openalex.org/T10601
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9013000130653381
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1707
topics[1].subfield.display_name	Computer Vision and Pattern Recognition
topics[1].display_name	Handwritten Text Recognition Techniques
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C2776214188
concepts[0].level	2
concepts[0].score	0.7314821481704712
concepts[0].wikidata	https://www.wikidata.org/wiki/Q408386
concepts[0].display_name	Inference
concepts[1].id	https://openalex.org/C2779343474
concepts[1].level	2
concepts[1].score	0.6485859155654907
concepts[1].wikidata	https://www.wikidata.org/wiki/Q3109175
concepts[1].display_name	Context (archaeology)
concepts[2].id	https://openalex.org/C41008148
concepts[2].level	0
concepts[2].score	0.5210389494895935
concepts[2].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[2].display_name	Computer science
concepts[3].id	https://openalex.org/C154945302
concepts[3].level	1
concepts[3].score	0.35123568773269653
concepts[3].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[3].display_name	Artificial intelligence
concepts[4].id	https://openalex.org/C15744967
concepts[4].level	0
concepts[4].score	0.3364328145980835
concepts[4].wikidata	https://www.wikidata.org/wiki/Q9418
concepts[4].display_name	Psychology
concepts[5].id	https://openalex.org/C95457728
concepts[5].level	0
concepts[5].score	0.08699268102645874
concepts[5].wikidata	https://www.wikidata.org/wiki/Q309
concepts[5].display_name	History
concepts[6].id	https://openalex.org/C166957645
concepts[6].level	1
concepts[6].score	0.0
concepts[6].wikidata	https://www.wikidata.org/wiki/Q23498
concepts[6].display_name	Archaeology
keywords[0].id	https://openalex.org/keywords/inference
keywords[0].score	0.7314821481704712
keywords[0].display_name	Inference
keywords[1].id	https://openalex.org/keywords/context
keywords[1].score	0.6485859155654907
keywords[1].display_name	Context (archaeology)
keywords[2].id	https://openalex.org/keywords/computer-science
keywords[2].score	0.5210389494895935
keywords[2].display_name	Computer science
keywords[3].id	https://openalex.org/keywords/artificial-intelligence
keywords[3].score	0.35123568773269653
keywords[3].display_name	Artificial intelligence
keywords[4].id	https://openalex.org/keywords/psychology
keywords[4].score	0.3364328145980835
keywords[4].display_name	Psychology
keywords[5].id	https://openalex.org/keywords/history
keywords[5].score	0.08699268102645874
keywords[5].display_name	History
language	en
locations[0].id	pmh:oai:arXiv.org:2404.08801
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2404.08801
locations[0].version	submittedVersion
locations[0].raw_type
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2404.08801
locations[1].id	doi:10.48550/arxiv.2404.08801
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2404.08801
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5078672329
authorships[0].author.orcid	https://orcid.org/0000-0001-7582-1653
authorships[0].author.display_name	Xuezhe Ma
authorships[0].author_position	first
authorships[0].raw_author_name	Ma, Xuezhe
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5100671724
authorships[1].author.orcid	https://orcid.org/0009-0003-8421-1613
authorships[1].author.display_name	Xiaomeng Yang
authorships[1].author_position	middle
authorships[1].raw_author_name	Yang, Xiaomeng
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5110635444
authorships[2].author.orcid
authorships[2].author.display_name	Wenhan Xiong
authorships[2].author_position	middle
authorships[2].raw_author_name	Xiong, Wenhan
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5113181027
authorships[3].author.orcid
authorships[3].author.display_name	Beidi Chen
authorships[3].author_position	middle
authorships[3].raw_author_name	Chen, Beidi
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5100609003
authorships[4].author.orcid	https://orcid.org/0000-0001-5832-1024
authorships[4].author.display_name	Lili Yu
authorships[4].author_position	middle
authorships[4].raw_author_name	Yu, Lili
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5100396911
authorships[5].author.orcid	https://orcid.org/0000-0002-4527-9610
authorships[5].author.display_name	Hao Zhang
authorships[5].author_position	middle
authorships[5].raw_author_name	Zhang, Hao
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5000874697
authorships[6].author.orcid	https://orcid.org/0000-0002-5284-477X
authorships[6].author.display_name	Jonathan May
authorships[6].author_position	middle
authorships[6].raw_author_name	May, Jonathan
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5067919401
authorships[7].author.orcid	https://orcid.org/0009-0008-8296-0764
authorships[7].author.display_name	Luke Zettlemoyer
authorships[7].author_position	middle
authorships[7].raw_author_name	Zettlemoyer, Luke
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A5024311574
authorships[8].author.orcid	https://orcid.org/0000-0001-7300-8191
authorships[8].author.display_name	Omer Levy
authorships[8].author_position	middle
authorships[8].raw_author_name	Levy, Omer
authorships[8].is_corresponding	False
authorships[9].author.id	https://openalex.org/A5110252779
authorships[9].author.orcid
authorships[9].author.display_name	Chunting Zhou
authorships[9].author_position	last
authorships[9].raw_author_name	Zhou, Chunting
authorships[9].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2404.08801
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10181
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9180999994277954
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Natural Language Processing Techniques
related_works	https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2382290278, https://openalex.org/W4391913857, https://openalex.org/W2350741829, https://openalex.org/W2530322880
cited_by_count	1
counts_by_year[0].year	2025
counts_by_year[0].cited_by_count	1
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2404.08801
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2404.08801
best_oa_location.version	submittedVersion
best_oa_location.raw_type
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2404.08801
primary_location.id	pmh:oai:arXiv.org:2404.08801
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2404.08801
primary_location.version	submittedVersion
primary_location.raw_type
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2404.08801
publication_date	2024-04-12
publication_year	2024
referenced_works_count	0
abstract_inverted_index.2	117
abstract_inverted_index.7	113
abstract_inverted_index.a	43, 97, 123
abstract_inverted_index.In	96
abstract_inverted_index.We	40
abstract_inverted_index.in	33, 109
abstract_inverted_index.of	7, 58, 112, 126
abstract_inverted_index.to	12, 14, 72
abstract_inverted_index.13B	134
abstract_inverted_index.The	0
abstract_inverted_index.and	3, 17, 24, 36, 66, 76, 90, 116, 133
abstract_inverted_index.for	46
abstract_inverted_index.its	74
abstract_inverted_index.the	56, 110
abstract_inverted_index.Mega	59
abstract_inverted_index.like	21
abstract_inverted_index.long	15
abstract_inverted_index.loss	125
abstract_inverted_index.task	38
abstract_inverted_index.than	107
abstract_inverted_index.they	29
abstract_inverted_index.weak	4
abstract_inverted_index.with	50, 63, 92, 101
abstract_inverted_index.1.70,	127
abstract_inverted_index.Code:	136
abstract_inverted_index.gated	64
abstract_inverted_index.scale	13, 111
abstract_inverted_index.space	26
abstract_inverted_index.state	25
abstract_inverted_index.their	10
abstract_inverted_index.while	18
abstract_inverted_index.(1.75)	132
abstract_inverted_index.better	105
abstract_inverted_index.exist,	28
abstract_inverted_index.layer,	86
abstract_inverted_index.length	5
abstract_inverted_index.limits	9
abstract_inverted_index.linear	22
abstract_inverted_index.models	27
abstract_inverted_index.moving	61, 81
abstract_inverted_index.neural	44
abstract_inverted_index.(1.67).	135
abstract_inverted_index.(CEMA),	83
abstract_inverted_index.Llama2,	102
abstract_inverted_index.ability	11
abstract_inverted_index.average	62, 82
abstract_inverted_index.between	130
abstract_inverted_index.billion	114
abstract_inverted_index.complex	79
abstract_inverted_index.context	52
abstract_inverted_index.further	67
abstract_inverted_index.improve	73
abstract_inverted_index.landing	128
abstract_inverted_index.length.	53
abstract_inverted_index.mid-way	129
abstract_inverted_index.reaches	122
abstract_inverted_index.tokens.	120
abstract_inverted_index.two-hop	93
abstract_inverted_index.achieves	104
abstract_inverted_index.inherits	55
abstract_inverted_index.modeling	49
abstract_inverted_index.multiple	69
abstract_inverted_index.pre-norm	91
abstract_inverted_index.residual	94
abstract_inverted_index.sequence	48
abstract_inverted_index.timestep	84
abstract_inverted_index.training	119, 124
abstract_inverted_index.trillion	118
abstract_inverted_index.Llama2-7B	131
abstract_inverted_index.Megalodon	54, 103, 121
abstract_inverted_index.accuracy.	39
abstract_inverted_index.attention	23, 88
abstract_inverted_index.efficient	47
abstract_inverted_index.including	78
abstract_inverted_index.introduce	41
abstract_inverted_index.mechanism	89
abstract_inverted_index.quadratic	1
abstract_inverted_index.solutions	20
abstract_inverted_index.technical	70
abstract_inverted_index.unlimited	51
abstract_inverted_index.Megalodon,	42
abstract_inverted_index.capability	75
abstract_inverted_index.comparison	100
abstract_inverted_index.complexity	2
abstract_inverted_index.components	71
abstract_inverted_index.controlled	98
abstract_inverted_index.downstream	37
abstract_inverted_index.efficiency	35, 106
abstract_inverted_index.introduces	68
abstract_inverted_index.normalized	87
abstract_inverted_index.parameters	115
abstract_inverted_index.sequences,	16
abstract_inverted_index.stability,	77
abstract_inverted_index.Transformer	108
abstract_inverted_index.attention),	65
abstract_inverted_index.empirically	30
abstract_inverted_index.exponential	80
abstract_inverted_index.pretraining	34
abstract_inverted_index.(exponential	60
abstract_inverted_index.Transformers	8, 32
abstract_inverted_index.architecture	45, 57
abstract_inverted_index.head-to-head	99
abstract_inverted_index.underperform	31
abstract_inverted_index.extrapolation	6
abstract_inverted_index.normalization	85
abstract_inverted_index.sub-quadratic	19
abstract_inverted_index.configuration.	95
abstract_inverted_index.https://github.com/XuezheMax/megalodon	137
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	10
citation_normalized_percentile