Compress image to patches for Vision Transformer Article Swipe

PDF

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2502.10120

The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged dramatically. This paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT models. Compared to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the original. This design not only significantly reduces the computational cost of the ViT model but also effectively enhances the model's accuracy by introducing the inductive bias properties of CNN. The ViT model's precision is markedly enhanced. When trained from the ground up on the Animals-10 dataset, CI2P-ViT achieved an accuracy rate of 92.37%, representing a 3.3% improvement over the ViT-B/16 baseline. Additionally, the model's computational operations, measured in floating-point operations per second (FLOPs), were diminished by 63.35%, and it exhibited a 2-fold increase in training velocity on identical hardware configurations.

Related Topics

Transformer

Computer Vision

Artificial Intelligence

Computer Science

Engineering

Electrical Engineering

Voltage

Concepts

Transformer Computer vision Artificial intelligence Image (mathematics) Computer science Engineering Electrical engineering Voltage

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2502.10120
PDF: https://arxiv.org/pdf/2502.10120
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4407632581

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4407632581

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2502.10120

Digital Object Identifier
Title: Compress image to patches for Vision Transformer

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-02-14

Full publication date if available
Authors: Xinfeng Zhao, Yaoru Sun

List of authors in order
Landing page: https://arxiv.org/abs/2502.10120

Publisher landing page
PDF URL: https://arxiv.org/pdf/2502.10120

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2502.10120

Direct OA link when available
Concepts: Transformer, Computer vision, Artificial intelligence, Image (mathematics), Computer science, Engineering, Electrical engineering, Voltage

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4407632581
doi	https://doi.org/10.48550/arxiv.2502.10120
ids.doi	https://doi.org/10.48550/arxiv.2502.10120
ids.openalex	https://openalex.org/W4407632581
fwci
type	preprint
title	Compress image to patches for Vision Transformer
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T13114
topics[0].field.id	https://openalex.org/fields/22
topics[0].field.display_name	Engineering
topics[0].score	0.9441999793052673
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/2214
topics[0].subfield.display_name	Media Technology
topics[0].display_name	Image Processing Techniques and Applications
topics[1].id	https://openalex.org/T11992
topics[1].field.id	https://openalex.org/fields/22
topics[1].field.display_name	Engineering
topics[1].score	0.9251999855041504
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/2208
topics[1].subfield.display_name	Electrical and Electronic Engineering
topics[1].display_name	CCD and CMOS Imaging Sensors
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C66322947
concepts[0].level	3
concepts[0].score	0.5535224676132202
concepts[0].wikidata	https://www.wikidata.org/wiki/Q11658
concepts[0].display_name	Transformer
concepts[1].id	https://openalex.org/C31972630
concepts[1].level	1
concepts[1].score	0.5527215003967285
concepts[1].wikidata	https://www.wikidata.org/wiki/Q844240
concepts[1].display_name	Computer vision
concepts[2].id	https://openalex.org/C154945302
concepts[2].level	1
concepts[2].score	0.49322885274887085
concepts[2].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[2].display_name	Artificial intelligence
concepts[3].id	https://openalex.org/C115961682
concepts[3].level	2
concepts[3].score	0.47320717573165894
concepts[3].wikidata	https://www.wikidata.org/wiki/Q860623
concepts[3].display_name	Image (mathematics)
concepts[4].id	https://openalex.org/C41008148
concepts[4].level	0
concepts[4].score	0.46167656779289246
concepts[4].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[4].display_name	Computer science
concepts[5].id	https://openalex.org/C127413603
concepts[5].level	0
concepts[5].score	0.18103134632110596
concepts[5].wikidata	https://www.wikidata.org/wiki/Q11023
concepts[5].display_name	Engineering
concepts[6].id	https://openalex.org/C119599485
concepts[6].level	1
concepts[6].score	0.10958865284919739
concepts[6].wikidata	https://www.wikidata.org/wiki/Q43035
concepts[6].display_name	Electrical engineering
concepts[7].id	https://openalex.org/C165801399
concepts[7].level	2
concepts[7].score	0.07948896288871765
concepts[7].wikidata	https://www.wikidata.org/wiki/Q25428
concepts[7].display_name	Voltage
keywords[0].id	https://openalex.org/keywords/transformer
keywords[0].score	0.5535224676132202
keywords[0].display_name	Transformer
keywords[1].id	https://openalex.org/keywords/computer-vision
keywords[1].score	0.5527215003967285
keywords[1].display_name	Computer vision
keywords[2].id	https://openalex.org/keywords/artificial-intelligence
keywords[2].score	0.49322885274887085
keywords[2].display_name	Artificial intelligence
keywords[3].id	https://openalex.org/keywords/image
keywords[3].score	0.47320717573165894
keywords[3].display_name	Image (mathematics)
keywords[4].id	https://openalex.org/keywords/computer-science
keywords[4].score	0.46167656779289246
keywords[4].display_name	Computer science
keywords[5].id	https://openalex.org/keywords/engineering
keywords[5].score	0.18103134632110596
keywords[5].display_name	Engineering
keywords[6].id	https://openalex.org/keywords/electrical-engineering
keywords[6].score	0.10958865284919739
keywords[6].display_name	Electrical engineering
keywords[7].id	https://openalex.org/keywords/voltage
keywords[7].score	0.07948896288871765
keywords[7].display_name	Voltage
language	en
locations[0].id	pmh:oai:arXiv.org:2502.10120
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2502.10120
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2502.10120
locations[1].id	doi:10.48550/arxiv.2502.10120
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2502.10120
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5031636414
authorships[0].author.orcid	https://orcid.org/0000-0001-8824-738X
authorships[0].author.display_name	Xinfeng Zhao
authorships[0].author_position	first
authorships[0].raw_author_name	Zhao, Xinfeng
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5041454001
authorships[1].author.orcid
authorships[1].author.display_name	Yaoru Sun
authorships[1].author_position	last
authorships[1].raw_author_name	Sun, Yaoru
authorships[1].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2502.10120
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	Compress image to patches for Vision Transformer
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T13114
primary_topic.field.id	https://openalex.org/fields/22
primary_topic.field.display_name	Engineering
primary_topic.score	0.9441999793052673
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/2214
primary_topic.subfield.display_name	Media Technology
primary_topic.display_name	Image Processing Techniques and Applications
related_works	https://openalex.org/W2772917594, https://openalex.org/W2036807459, https://openalex.org/W2058170566, https://openalex.org/W2755342338, https://openalex.org/W2166024367, https://openalex.org/W3116076068, https://openalex.org/W2229312674, https://openalex.org/W2951359407, https://openalex.org/W2079911747, https://openalex.org/W1969923398
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2502.10120
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2502.10120
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2502.10120
primary_location.id	pmh:oai:arXiv.org:2502.10120
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2502.10120
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2502.10120
publication_date	2025-02-14
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	45, 59, 74, 79, 117, 175, 201
abstract_inverted_index.an	169
abstract_inverted_index.as	15
abstract_inverted_index.by	142, 196
abstract_inverted_index.in	8, 90, 188, 204
abstract_inverted_index.is	154
abstract_inverted_index.it	199
abstract_inverted_index.of	11, 18, 24, 76, 81, 108, 119, 131, 148, 172
abstract_inverted_index.on	49, 163, 207
abstract_inverted_index.to	68, 102, 111, 116
abstract_inverted_index.up	162
abstract_inverted_index.CNN	50
abstract_inverted_index.The	0, 56, 150
abstract_inverted_index.ViT	37, 92, 99, 133, 151
abstract_inverted_index.and	21, 35, 51, 71, 198
abstract_inverted_index.but	135
abstract_inverted_index.can	84
abstract_inverted_index.has	4, 39, 105
abstract_inverted_index.not	124
abstract_inverted_index.per	191
abstract_inverted_index.the	9, 16, 19, 22, 25, 29, 65, 86, 91, 106, 112, 120, 128, 132, 139, 144, 160, 164, 179, 183
abstract_inverted_index.3.3%	176
abstract_inverted_index.CI2P	83
abstract_inverted_index.CNN.	149
abstract_inverted_index.This	42, 122
abstract_inverted_index.When	157
abstract_inverted_index.also	136
abstract_inverted_index.bias	146
abstract_inverted_index.cost	31, 130
abstract_inverted_index.from	159
abstract_inverted_index.into	97
abstract_inverted_index.made	5
abstract_inverted_index.only	125
abstract_inverted_index.over	178
abstract_inverted_index.rate	171
abstract_inverted_index.were	194
abstract_inverted_index.with	33
abstract_inverted_index.(ViT)	3
abstract_inverted_index.CI2P,	62
abstract_inverted_index.Patch	87
abstract_inverted_index.based	48
abstract_inverted_index.depth	17
abstract_inverted_index.field	10
abstract_inverted_index.input	26, 110
abstract_inverted_index.layer	114
abstract_inverted_index.model	20, 47, 57, 134
abstract_inverted_index.named	54
abstract_inverted_index.paper	43
abstract_inverted_index.which	63
abstract_inverted_index.2-fold	202
abstract_inverted_index.Vision	1, 52
abstract_inverted_index.called	61
abstract_inverted_index.design	123
abstract_inverted_index.ground	161
abstract_inverted_index.hybrid	46
abstract_inverted_index.images	27, 70
abstract_inverted_index.model,	93
abstract_inverted_index.models	38
abstract_inverted_index.module	60
abstract_inverted_index.number	107
abstract_inverted_index.second	192
abstract_inverted_index.series	80
abstract_inverted_index.surged	40
abstract_inverted_index.63.35%,	197
abstract_inverted_index.92.37%,	173
abstract_inverted_index.encoder	67
abstract_inverted_index.model's	140, 152, 184
abstract_inverted_index.models.	100
abstract_inverted_index.patches	77, 109
abstract_inverted_index.quarter	118
abstract_inverted_index.reduced	115
abstract_inverted_index.reduces	127
abstract_inverted_index.replace	85
abstract_inverted_index.running	36
abstract_inverted_index.strides	7
abstract_inverted_index.through	78
abstract_inverted_index.trained	158
abstract_inverted_index.vision.	13
abstract_inverted_index.(FLOPs),	193
abstract_inverted_index.CI2P-ViT	104, 167
abstract_inverted_index.Compared	101
abstract_inverted_index.However,	14
abstract_inverted_index.ViT-B/16	180
abstract_inverted_index.accuracy	141, 170
abstract_inverted_index.achieved	168
abstract_inverted_index.compress	69
abstract_inverted_index.computer	12
abstract_inverted_index.dataset,	166
abstract_inverted_index.enabling	94
abstract_inverted_index.enhances	138
abstract_inverted_index.existing	98
abstract_inverted_index.hardware	209
abstract_inverted_index.increase	203
abstract_inverted_index.markedly	155
abstract_inverted_index.measured	187
abstract_inverted_index.proposes	44
abstract_inverted_index.seamless	95
abstract_inverted_index.sequence	75
abstract_inverted_index.training	34, 205
abstract_inverted_index.utilizes	64
abstract_inverted_index.velocity	206
abstract_inverted_index.CI2P-ViT.	55
abstract_inverted_index.Embedding	88
abstract_inverted_index.ViT-B/16,	103
abstract_inverted_index.baseline.	181
abstract_inverted_index.component	89
abstract_inverted_index.enhanced.	156
abstract_inverted_index.exhibited	200
abstract_inverted_index.generates	73
abstract_inverted_index.identical	208
abstract_inverted_index.increase,	28
abstract_inverted_index.inductive	145
abstract_inverted_index.original.	121
abstract_inverted_index.precision	153
abstract_inverted_index.Animals-10	165
abstract_inverted_index.CompressAI	66
abstract_inverted_index.associated	32
abstract_inverted_index.diminished	195
abstract_inverted_index.operations	190
abstract_inverted_index.properties	147
abstract_inverted_index.resolution	23
abstract_inverted_index.Transformer	2
abstract_inverted_index.effectively	137
abstract_inverted_index.improvement	177
abstract_inverted_index.integration	96
abstract_inverted_index.introducing	143
abstract_inverted_index.operations,	186
abstract_inverted_index.significant	6
abstract_inverted_index.Transformer,	53
abstract_inverted_index.incorporates	58
abstract_inverted_index.representing	174
abstract_inverted_index.subsequently	72
abstract_inverted_index.Additionally,	182
abstract_inverted_index.computational	30, 129, 185
abstract_inverted_index.convolutions.	82
abstract_inverted_index.dramatically.	41
abstract_inverted_index.significantly	126
abstract_inverted_index.floating-point	189
abstract_inverted_index.self-attention	113
abstract_inverted_index.configurations.	210
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	2
citation_normalized_percentile