Compress image to patches for Vision Transformer Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2502.10120
The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged dramatically. This paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT models. Compared to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the original. This design not only significantly reduces the computational cost of the ViT model but also effectively enhances the model's accuracy by introducing the inductive bias properties of CNN. The ViT model's precision is markedly enhanced. When trained from the ground up on the Animals-10 dataset, CI2P-ViT achieved an accuracy rate of 92.37%, representing a 3.3% improvement over the ViT-B/16 baseline. Additionally, the model's computational operations, measured in floating-point operations per second (FLOPs), were diminished by 63.35%, and it exhibited a 2-fold increase in training velocity on identical hardware configurations.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2502.10120
- https://arxiv.org/pdf/2502.10120
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4407632581
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4407632581Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2502.10120Digital Object Identifier
- Title
-
Compress image to patches for Vision TransformerWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-02-14Full publication date if available
- Authors
-
Xinfeng Zhao, Yaoru SunList of authors in order
- Landing page
-
https://arxiv.org/abs/2502.10120Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2502.10120Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2502.10120Direct OA link when available
- Concepts
-
Transformer, Computer vision, Artificial intelligence, Image (mathematics), Computer science, Engineering, Electrical engineering, VoltageTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4407632581 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2502.10120 |
| ids.doi | https://doi.org/10.48550/arxiv.2502.10120 |
| ids.openalex | https://openalex.org/W4407632581 |
| fwci | |
| type | preprint |
| title | Compress image to patches for Vision Transformer |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T13114 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.9441999793052673 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2214 |
| topics[0].subfield.display_name | Media Technology |
| topics[0].display_name | Image Processing Techniques and Applications |
| topics[1].id | https://openalex.org/T11992 |
| topics[1].field.id | https://openalex.org/fields/22 |
| topics[1].field.display_name | Engineering |
| topics[1].score | 0.9251999855041504 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/2208 |
| topics[1].subfield.display_name | Electrical and Electronic Engineering |
| topics[1].display_name | CCD and CMOS Imaging Sensors |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C66322947 |
| concepts[0].level | 3 |
| concepts[0].score | 0.5535224676132202 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q11658 |
| concepts[0].display_name | Transformer |
| concepts[1].id | https://openalex.org/C31972630 |
| concepts[1].level | 1 |
| concepts[1].score | 0.5527215003967285 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q844240 |
| concepts[1].display_name | Computer vision |
| concepts[2].id | https://openalex.org/C154945302 |
| concepts[2].level | 1 |
| concepts[2].score | 0.49322885274887085 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[2].display_name | Artificial intelligence |
| concepts[3].id | https://openalex.org/C115961682 |
| concepts[3].level | 2 |
| concepts[3].score | 0.47320717573165894 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q860623 |
| concepts[3].display_name | Image (mathematics) |
| concepts[4].id | https://openalex.org/C41008148 |
| concepts[4].level | 0 |
| concepts[4].score | 0.46167656779289246 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[4].display_name | Computer science |
| concepts[5].id | https://openalex.org/C127413603 |
| concepts[5].level | 0 |
| concepts[5].score | 0.18103134632110596 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11023 |
| concepts[5].display_name | Engineering |
| concepts[6].id | https://openalex.org/C119599485 |
| concepts[6].level | 1 |
| concepts[6].score | 0.10958865284919739 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q43035 |
| concepts[6].display_name | Electrical engineering |
| concepts[7].id | https://openalex.org/C165801399 |
| concepts[7].level | 2 |
| concepts[7].score | 0.07948896288871765 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q25428 |
| concepts[7].display_name | Voltage |
| keywords[0].id | https://openalex.org/keywords/transformer |
| keywords[0].score | 0.5535224676132202 |
| keywords[0].display_name | Transformer |
| keywords[1].id | https://openalex.org/keywords/computer-vision |
| keywords[1].score | 0.5527215003967285 |
| keywords[1].display_name | Computer vision |
| keywords[2].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[2].score | 0.49322885274887085 |
| keywords[2].display_name | Artificial intelligence |
| keywords[3].id | https://openalex.org/keywords/image |
| keywords[3].score | 0.47320717573165894 |
| keywords[3].display_name | Image (mathematics) |
| keywords[4].id | https://openalex.org/keywords/computer-science |
| keywords[4].score | 0.46167656779289246 |
| keywords[4].display_name | Computer science |
| keywords[5].id | https://openalex.org/keywords/engineering |
| keywords[5].score | 0.18103134632110596 |
| keywords[5].display_name | Engineering |
| keywords[6].id | https://openalex.org/keywords/electrical-engineering |
| keywords[6].score | 0.10958865284919739 |
| keywords[6].display_name | Electrical engineering |
| keywords[7].id | https://openalex.org/keywords/voltage |
| keywords[7].score | 0.07948896288871765 |
| keywords[7].display_name | Voltage |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2502.10120 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2502.10120 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2502.10120 |
| locations[1].id | doi:10.48550/arxiv.2502.10120 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2502.10120 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5031636414 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-8824-738X |
| authorships[0].author.display_name | Xinfeng Zhao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zhao, Xinfeng |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5041454001 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Yaoru Sun |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Sun, Yaoru |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2502.10120 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Compress image to patches for Vision Transformer |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T13114 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.9441999793052673 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2214 |
| primary_topic.subfield.display_name | Media Technology |
| primary_topic.display_name | Image Processing Techniques and Applications |
| related_works | https://openalex.org/W2772917594, https://openalex.org/W2036807459, https://openalex.org/W2058170566, https://openalex.org/W2755342338, https://openalex.org/W2166024367, https://openalex.org/W3116076068, https://openalex.org/W2229312674, https://openalex.org/W2951359407, https://openalex.org/W2079911747, https://openalex.org/W1969923398 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2502.10120 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2502.10120 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2502.10120 |
| primary_location.id | pmh:oai:arXiv.org:2502.10120 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2502.10120 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2502.10120 |
| publication_date | 2025-02-14 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 45, 59, 74, 79, 117, 175, 201 |
| abstract_inverted_index.an | 169 |
| abstract_inverted_index.as | 15 |
| abstract_inverted_index.by | 142, 196 |
| abstract_inverted_index.in | 8, 90, 188, 204 |
| abstract_inverted_index.is | 154 |
| abstract_inverted_index.it | 199 |
| abstract_inverted_index.of | 11, 18, 24, 76, 81, 108, 119, 131, 148, 172 |
| abstract_inverted_index.on | 49, 163, 207 |
| abstract_inverted_index.to | 68, 102, 111, 116 |
| abstract_inverted_index.up | 162 |
| abstract_inverted_index.CNN | 50 |
| abstract_inverted_index.The | 0, 56, 150 |
| abstract_inverted_index.ViT | 37, 92, 99, 133, 151 |
| abstract_inverted_index.and | 21, 35, 51, 71, 198 |
| abstract_inverted_index.but | 135 |
| abstract_inverted_index.can | 84 |
| abstract_inverted_index.has | 4, 39, 105 |
| abstract_inverted_index.not | 124 |
| abstract_inverted_index.per | 191 |
| abstract_inverted_index.the | 9, 16, 19, 22, 25, 29, 65, 86, 91, 106, 112, 120, 128, 132, 139, 144, 160, 164, 179, 183 |
| abstract_inverted_index.3.3% | 176 |
| abstract_inverted_index.CI2P | 83 |
| abstract_inverted_index.CNN. | 149 |
| abstract_inverted_index.This | 42, 122 |
| abstract_inverted_index.When | 157 |
| abstract_inverted_index.also | 136 |
| abstract_inverted_index.bias | 146 |
| abstract_inverted_index.cost | 31, 130 |
| abstract_inverted_index.from | 159 |
| abstract_inverted_index.into | 97 |
| abstract_inverted_index.made | 5 |
| abstract_inverted_index.only | 125 |
| abstract_inverted_index.over | 178 |
| abstract_inverted_index.rate | 171 |
| abstract_inverted_index.were | 194 |
| abstract_inverted_index.with | 33 |
| abstract_inverted_index.(ViT) | 3 |
| abstract_inverted_index.CI2P, | 62 |
| abstract_inverted_index.Patch | 87 |
| abstract_inverted_index.based | 48 |
| abstract_inverted_index.depth | 17 |
| abstract_inverted_index.field | 10 |
| abstract_inverted_index.input | 26, 110 |
| abstract_inverted_index.layer | 114 |
| abstract_inverted_index.model | 20, 47, 57, 134 |
| abstract_inverted_index.named | 54 |
| abstract_inverted_index.paper | 43 |
| abstract_inverted_index.which | 63 |
| abstract_inverted_index.2-fold | 202 |
| abstract_inverted_index.Vision | 1, 52 |
| abstract_inverted_index.called | 61 |
| abstract_inverted_index.design | 123 |
| abstract_inverted_index.ground | 161 |
| abstract_inverted_index.hybrid | 46 |
| abstract_inverted_index.images | 27, 70 |
| abstract_inverted_index.model, | 93 |
| abstract_inverted_index.models | 38 |
| abstract_inverted_index.module | 60 |
| abstract_inverted_index.number | 107 |
| abstract_inverted_index.second | 192 |
| abstract_inverted_index.series | 80 |
| abstract_inverted_index.surged | 40 |
| abstract_inverted_index.63.35%, | 197 |
| abstract_inverted_index.92.37%, | 173 |
| abstract_inverted_index.encoder | 67 |
| abstract_inverted_index.model's | 140, 152, 184 |
| abstract_inverted_index.models. | 100 |
| abstract_inverted_index.patches | 77, 109 |
| abstract_inverted_index.quarter | 118 |
| abstract_inverted_index.reduced | 115 |
| abstract_inverted_index.reduces | 127 |
| abstract_inverted_index.replace | 85 |
| abstract_inverted_index.running | 36 |
| abstract_inverted_index.strides | 7 |
| abstract_inverted_index.through | 78 |
| abstract_inverted_index.trained | 158 |
| abstract_inverted_index.vision. | 13 |
| abstract_inverted_index.(FLOPs), | 193 |
| abstract_inverted_index.CI2P-ViT | 104, 167 |
| abstract_inverted_index.Compared | 101 |
| abstract_inverted_index.However, | 14 |
| abstract_inverted_index.ViT-B/16 | 180 |
| abstract_inverted_index.accuracy | 141, 170 |
| abstract_inverted_index.achieved | 168 |
| abstract_inverted_index.compress | 69 |
| abstract_inverted_index.computer | 12 |
| abstract_inverted_index.dataset, | 166 |
| abstract_inverted_index.enabling | 94 |
| abstract_inverted_index.enhances | 138 |
| abstract_inverted_index.existing | 98 |
| abstract_inverted_index.hardware | 209 |
| abstract_inverted_index.increase | 203 |
| abstract_inverted_index.markedly | 155 |
| abstract_inverted_index.measured | 187 |
| abstract_inverted_index.proposes | 44 |
| abstract_inverted_index.seamless | 95 |
| abstract_inverted_index.sequence | 75 |
| abstract_inverted_index.training | 34, 205 |
| abstract_inverted_index.utilizes | 64 |
| abstract_inverted_index.velocity | 206 |
| abstract_inverted_index.CI2P-ViT. | 55 |
| abstract_inverted_index.Embedding | 88 |
| abstract_inverted_index.ViT-B/16, | 103 |
| abstract_inverted_index.baseline. | 181 |
| abstract_inverted_index.component | 89 |
| abstract_inverted_index.enhanced. | 156 |
| abstract_inverted_index.exhibited | 200 |
| abstract_inverted_index.generates | 73 |
| abstract_inverted_index.identical | 208 |
| abstract_inverted_index.increase, | 28 |
| abstract_inverted_index.inductive | 145 |
| abstract_inverted_index.original. | 121 |
| abstract_inverted_index.precision | 153 |
| abstract_inverted_index.Animals-10 | 165 |
| abstract_inverted_index.CompressAI | 66 |
| abstract_inverted_index.associated | 32 |
| abstract_inverted_index.diminished | 195 |
| abstract_inverted_index.operations | 190 |
| abstract_inverted_index.properties | 147 |
| abstract_inverted_index.resolution | 23 |
| abstract_inverted_index.Transformer | 2 |
| abstract_inverted_index.effectively | 137 |
| abstract_inverted_index.improvement | 177 |
| abstract_inverted_index.integration | 96 |
| abstract_inverted_index.introducing | 143 |
| abstract_inverted_index.operations, | 186 |
| abstract_inverted_index.significant | 6 |
| abstract_inverted_index.Transformer, | 53 |
| abstract_inverted_index.incorporates | 58 |
| abstract_inverted_index.representing | 174 |
| abstract_inverted_index.subsequently | 72 |
| abstract_inverted_index.Additionally, | 182 |
| abstract_inverted_index.computational | 30, 129, 185 |
| abstract_inverted_index.convolutions. | 82 |
| abstract_inverted_index.dramatically. | 41 |
| abstract_inverted_index.significantly | 126 |
| abstract_inverted_index.floating-point | 189 |
| abstract_inverted_index.self-attention | 113 |
| abstract_inverted_index.configurations. | 210 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile |