Heptapod: Language Modeling on Visual Signals Article Swipe

PDF

Yongxin Zhu , Jiawei Chen , Yuanzhe Chen , Zhuo Chen , Dongya Jia , Jian Cong , Xiaobin Zhuang , Yu‐Ping Wang , Yuxuan Wang ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2510.06673

We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbf{causal attention}, \textbf{eliminates reliance on CFG}, and \textbf{eschews the trend of semantic tokenizers}. Our key innovation is \textit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

Related Topics

Language

Java (Programming Language)

Concepts

No concepts available.

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2510.06673
PDF: https://arxiv.org/pdf/2510.06673
OA Status: green
OpenAlex ID: https://openalex.org/W4415316600

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4415316600

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2510.06673

Digital Object Identifier
Title: Heptapod: Language Modeling on Visual Signals

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-10-08

Full publication date if available
Authors: Yongxin Zhu, Jiawei Chen, Yuanzhe Chen, Zhuo Chen, Dongya Jia, Jian Cong, Xiaobin Zhuang, Yu‐Ping Wang, Yuxuan Wang

List of authors in order
Landing page: https://arxiv.org/abs/2510.06673

Publisher landing page
PDF URL: https://arxiv.org/pdf/2510.06673

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2510.06673

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4415316600
doi	https://doi.org/10.48550/arxiv.2510.06673
ids.doi	https://doi.org/10.48550/arxiv.2510.06673
ids.openalex	https://openalex.org/W4415316600
fwci
type	preprint
title	Heptapod: Language Modeling on Visual Signals
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T11714
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.10899999737739563
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1707
topics[0].subfield.display_name	Computer Vision and Pattern Recognition
topics[0].display_name	Multimodal Machine Learning Applications
is_xpac	False
apc_list
apc_paid
language	en
locations[0].id	pmh:oai:arXiv.org:2510.06673
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license	cc-by
locations[0].pdf_url	https://arxiv.org/pdf/2510.06673
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id	https://openalex.org/licenses/cc-by
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2510.06673
locations[1].id	doi:10.48550/arxiv.2510.06673
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2510.06673
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5086260915
authorships[0].author.orcid	https://orcid.org/0000-0002-1813-1792
authorships[0].author.display_name	Yongxin Zhu
authorships[0].author_position	first
authorships[0].raw_author_name	Zhu, Yongxin
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5100362822
authorships[1].author.orcid	https://orcid.org/0000-0002-8195-1582
authorships[1].author.display_name	Jiawei Chen
authorships[1].author_position	middle
authorships[1].raw_author_name	Chen, Jiawei
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5055175414
authorships[2].author.orcid
authorships[2].author.display_name	Yuanzhe Chen
authorships[2].author_position	middle
authorships[2].raw_author_name	Chen, Yuanzhe
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5100345062
authorships[3].author.orcid	https://orcid.org/0000-0001-9991-6892
authorships[3].author.display_name	Zhuo Chen
authorships[3].author_position	middle
authorships[3].raw_author_name	Chen, Zhuo
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5060687079
authorships[4].author.orcid	https://orcid.org/0000-0002-6307-8580
authorships[4].author.display_name	Dongya Jia
authorships[4].author_position	middle
authorships[4].raw_author_name	Jia, Dongya
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5110372538
authorships[5].author.orcid	https://orcid.org/0000-0003-3775-6883
authorships[5].author.display_name	Jian Cong
authorships[5].author_position	middle
authorships[5].raw_author_name	Cong, Jian
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5001776951
authorships[6].author.orcid	https://orcid.org/0000-0002-0285-6705
authorships[6].author.display_name	Xiaobin Zhuang
authorships[6].author_position	middle
authorships[6].raw_author_name	Zhuang, Xiaobin
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5100339106
authorships[7].author.orcid	https://orcid.org/0000-0001-9340-5864
authorships[7].author.display_name	Yu‐Ping Wang
authorships[7].author_position	middle
authorships[7].raw_author_name	Wang, Yuping
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A5100375979
authorships[8].author.orcid	https://orcid.org/0000-0002-1649-6974
authorships[8].author.display_name	Yuxuan Wang
authorships[8].author_position	last
authorships[8].raw_author_name	Wang, Yuxuan
authorships[8].is_corresponding	False
has_content.pdf	True
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2510.06673
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-18T00:00:00
display_name	Heptapod: Language Modeling on Visual Signals
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T11714
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.10899999737739563
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1707
primary_topic.subfield.display_name	Computer Vision and Pattern Recognition
primary_topic.display_name	Multimodal Machine Learning Applications
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2510.06673
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license	cc-by
best_oa_location.pdf_url	https://arxiv.org/pdf/2510.06673
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id	https://openalex.org/licenses/cc-by
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2510.06673
primary_location.id	pmh:oai:arXiv.org:2510.06673
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license	cc-by
primary_location.pdf_url	https://arxiv.org/pdf/2510.06673
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id	https://openalex.org/licenses/cc-by
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2510.06673
publication_date	2025-10-08
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	39, 113
abstract_inverted_index.2D	36, 54
abstract_inverted_index.On	91
abstract_inverted_index.We	0, 108
abstract_inverted_index.an	3, 98
abstract_inverted_index.at	59
abstract_inverted_index.is	34
abstract_inverted_index.of	13, 28, 57, 69, 77, 100, 116
abstract_inverted_index.on	22, 119
abstract_inverted_index.to	9, 47, 83
abstract_inverted_index.FID	99
abstract_inverted_index.Our	31
abstract_inverted_index.and	24, 122
abstract_inverted_index.key	32
abstract_inverted_index.our	110
abstract_inverted_index.the	10, 26, 49, 52, 66, 73, 81, 92
abstract_inverted_index.via	88
abstract_inverted_index.This	62
abstract_inverted_index.each	60
abstract_inverted_index.grid	56
abstract_inverted_index.hope	109
abstract_inverted_index.over	51
abstract_inverted_index.that	7
abstract_inverted_index.with	42, 72
abstract_inverted_index.work	111
abstract_inverted_index.CFG},	23
abstract_inverted_index.image	4, 86
abstract_inverted_index.model	6, 82
abstract_inverted_index.trend	27
abstract_inverted_index.causal	40, 105
abstract_inverted_index.entire	53
abstract_inverted_index.images	58
abstract_inverted_index.learns	46
abstract_inverted_index.masked	78
abstract_inverted_index.visual	44, 120
abstract_inverted_index.$2.70$,	101
abstract_inverted_index.adheres	8
abstract_inverted_index.beyond.	123
abstract_inverted_index.capture	84
abstract_inverted_index.employs	17
abstract_inverted_index.predict	48
abstract_inverted_index.signals	121
abstract_inverted_index.spatial	55
abstract_inverted_index.unifies	65
abstract_inverted_index.Heptapod	16, 96
abstract_inverted_index.ImageNet	93
abstract_inverted_index.achieves	97
abstract_inverted_index.enabling	80
abstract_inverted_index.holistic	74
abstract_inverted_index.inspires	112
abstract_inverted_index.language	14, 117
abstract_inverted_index.learning	63, 76
abstract_inverted_index.modeling	68, 118
abstract_inverted_index.previous	104
abstract_inverted_index.reliance	21
abstract_inverted_index.semantic	29
abstract_inverted_index.Heptapod,	2
abstract_inverted_index.framework	71
abstract_inverted_index.introduce	1
abstract_inverted_index.modeling.	15
abstract_inverted_index.objective	64
abstract_inverted_index.semantics	87
abstract_inverted_index.timestep.	61
abstract_inverted_index.training.	90
abstract_inverted_index.benchmark,	95
abstract_inverted_index.generation	94
abstract_inverted_index.generative	89
abstract_inverted_index.innovation	33
abstract_inverted_index.principled	114
abstract_inverted_index.principles	12
abstract_inverted_index.rethinking	115
abstract_inverted_index.sequential	67
abstract_inverted_index.tokenizer,	45
abstract_inverted_index.Transformer	41
abstract_inverted_index.approaches.	107
abstract_inverted_index.attention},	19
abstract_inverted_index.\textit{next	35
abstract_inverted_index.distribution	37, 50
abstract_inverted_index.foundational	11
abstract_inverted_index.prediction}:	38
abstract_inverted_index.tokenizers}.	30
abstract_inverted_index.autoencoding,	79
abstract_inverted_index.comprehensive	85
abstract_inverted_index.outperforming	103
abstract_inverted_index.significantly	102
abstract_inverted_index.\textbf{causal	18
abstract_inverted_index.autoregressive	5, 70, 106
abstract_inverted_index.\textbf{eschews	25
abstract_inverted_index.self-supervised	75
abstract_inverted_index.\textbf{eliminates	20
abstract_inverted_index.reconstruction-focused	43
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	9
citation_normalized_percentile