CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2404.15653
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable $2.7\times$ acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at \url{https://github.com/apple/corenet}.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2404.15653
- https://arxiv.org/pdf/2404.15653
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4395483126
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4395483126Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2404.15653Digital Object Identifier
- Title
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text DataWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-04-24Full publication date if available
- Authors
-
Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad RastegariList of authors in order
- Landing page
-
https://arxiv.org/abs/2404.15653Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2404.15653Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2404.15653Direct OA link when available
- Concepts
-
Computer science, Scale (ratio), Information retrieval, Training set, Image (mathematics), Artificial intelligence, Pattern recognition (psychology), World Wide Web, Cartography, GeographyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4395483126 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2404.15653 |
| ids.doi | https://doi.org/10.48550/arxiv.2404.15653 |
| ids.openalex | https://openalex.org/W4395483126 |
| fwci | |
| type | preprint |
| title | CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10601 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9223999977111816 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Handwritten Text Recognition Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.6727643013000488 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C2778755073 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5963653922080994 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q10858537 |
| concepts[1].display_name | Scale (ratio) |
| concepts[2].id | https://openalex.org/C23123220 |
| concepts[2].level | 1 |
| concepts[2].score | 0.45667406916618347 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q816826 |
| concepts[2].display_name | Information retrieval |
| concepts[3].id | https://openalex.org/C51632099 |
| concepts[3].level | 2 |
| concepts[3].score | 0.4392157196998596 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q3985153 |
| concepts[3].display_name | Training set |
| concepts[4].id | https://openalex.org/C115961682 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4370816648006439 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q860623 |
| concepts[4].display_name | Image (mathematics) |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.3720581531524658 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C153180895 |
| concepts[6].level | 2 |
| concepts[6].score | 0.34096676111221313 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q7148389 |
| concepts[6].display_name | Pattern recognition (psychology) |
| concepts[7].id | https://openalex.org/C136764020 |
| concepts[7].level | 1 |
| concepts[7].score | 0.3284267783164978 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q466 |
| concepts[7].display_name | World Wide Web |
| concepts[8].id | https://openalex.org/C58640448 |
| concepts[8].level | 1 |
| concepts[8].score | 0.094379723072052 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q42515 |
| concepts[8].display_name | Cartography |
| concepts[9].id | https://openalex.org/C205649164 |
| concepts[9].level | 0 |
| concepts[9].score | 0.049579352140426636 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[9].display_name | Geography |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.6727643013000488 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/scale |
| keywords[1].score | 0.5963653922080994 |
| keywords[1].display_name | Scale (ratio) |
| keywords[2].id | https://openalex.org/keywords/information-retrieval |
| keywords[2].score | 0.45667406916618347 |
| keywords[2].display_name | Information retrieval |
| keywords[3].id | https://openalex.org/keywords/training-set |
| keywords[3].score | 0.4392157196998596 |
| keywords[3].display_name | Training set |
| keywords[4].id | https://openalex.org/keywords/image |
| keywords[4].score | 0.4370816648006439 |
| keywords[4].display_name | Image (mathematics) |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.3720581531524658 |
| keywords[5].display_name | Artificial intelligence |
| keywords[6].id | https://openalex.org/keywords/pattern-recognition |
| keywords[6].score | 0.34096676111221313 |
| keywords[6].display_name | Pattern recognition (psychology) |
| keywords[7].id | https://openalex.org/keywords/world-wide-web |
| keywords[7].score | 0.3284267783164978 |
| keywords[7].display_name | World Wide Web |
| keywords[8].id | https://openalex.org/keywords/cartography |
| keywords[8].score | 0.094379723072052 |
| keywords[8].display_name | Cartography |
| keywords[9].id | https://openalex.org/keywords/geography |
| keywords[9].score | 0.049579352140426636 |
| keywords[9].display_name | Geography |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2404.15653 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2404.15653 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2404.15653 |
| locations[1].id | doi:10.48550/arxiv.2404.15653 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2404.15653 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5074132108 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-5420-4725 |
| authorships[0].author.display_name | Sachin Mehta |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Mehta, Sachin |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5012428670 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Maxwell Horton |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Horton, Maxwell |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5036601505 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-5975-5158 |
| authorships[2].author.display_name | Fartash Faghri |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Faghri, Fartash |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5095886473 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Mohammad Hossein Sekhavat |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Sekhavat, Mohammad Hossein |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5021900923 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Mahyar Najibi |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Najibi, Mahyar |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5050499655 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-5510-518X |
| authorships[5].author.display_name | Mehrdad Farajtabar |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Farajtabar, Mehrdad |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5028613002 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Oncel Tuzel |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Tuzel, Oncel |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5056246621 |
| authorships[7].author.orcid | https://orcid.org/0000-0001-9606-3687 |
| authorships[7].author.display_name | Mohammad Rastegari |
| authorships[7].author_position | last |
| authorships[7].raw_author_name | Rastegari, Mohammad |
| authorships[7].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2404.15653 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-04-26T00:00:00 |
| display_name | CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10601 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9223999977111816 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Handwritten Text Recognition Techniques |
| related_works | https://openalex.org/W4213212078, https://openalex.org/W2187227032, https://openalex.org/W2112788825, https://openalex.org/W1921169094, https://openalex.org/W1963735073, https://openalex.org/W4233129888, https://openalex.org/W366410996, https://openalex.org/W106707639, https://openalex.org/W2793742470, https://openalex.org/W2146247781 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2404.15653 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2404.15653 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2404.15653 |
| primary_location.id | pmh:oai:arXiv.org:2404.15653 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2404.15653 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2404.15653 |
| publication_date | 2024-04-24 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 5, 39, 60, 76 |
| abstract_inverted_index.as | 4, 59 |
| abstract_inverted_index.at | 124 |
| abstract_inverted_index.in | 25, 72, 80 |
| abstract_inverted_index.is | 122 |
| abstract_inverted_index.it | 64 |
| abstract_inverted_index.of | 16, 44 |
| abstract_inverted_index.on | 47, 56, 87 |
| abstract_inverted_index.to | 84 |
| abstract_inverted_index.we | 101 |
| abstract_inverted_index.Our | 111 |
| abstract_inverted_index.The | 51 |
| abstract_inverted_index.and | 18, 30, 99, 119 |
| abstract_inverted_index.for | 8, 68 |
| abstract_inverted_index.has | 2 |
| abstract_inverted_index.the | 14, 66, 104 |
| abstract_inverted_index.This | 36 |
| abstract_inverted_index.code | 113 |
| abstract_inverted_index.data | 58 |
| abstract_inverted_index.high | 108 |
| abstract_inverted_index.loss | 27 |
| abstract_inverted_index.need | 67 |
| abstract_inverted_index.text | 19, 31 |
| abstract_inverted_index.that | 103 |
| abstract_inverted_index.with | 115 |
| abstract_inverted_index.along | 114 |
| abstract_inverted_index.data. | 50, 89 |
| abstract_inverted_index.image | 17, 29 |
| abstract_inverted_index.loss, | 74 |
| abstract_inverted_index.model | 117 |
| abstract_inverted_index.novel | 40 |
| abstract_inverted_index.pairs | 32 |
| abstract_inverted_index.paper | 37 |
| abstract_inverted_index.poses | 33 |
| abstract_inverted_index.speed | 82 |
| abstract_inverted_index.task. | 62 |
| abstract_inverted_index.method | 7, 53, 106 |
| abstract_inverted_index.models | 46 |
| abstract_inverted_index.source | 112 |
| abstract_inverted_index.tasks, | 96 |
| abstract_inverted_index.vision | 45, 95 |
| abstract_inverted_index.visual | 11 |
| abstract_inverted_index.weakly | 41 |
| abstract_inverted_index.Through | 90 |
| abstract_inverted_index.between | 28 |
| abstract_inverted_index.diverse | 94 |
| abstract_inverted_index.emerged | 3 |
| abstract_inverted_index.recipes | 121 |
| abstract_inverted_index.through | 13 |
| abstract_inverted_index.weights | 118 |
| abstract_inverted_index.However, | 21 |
| abstract_inverted_index.compared | 83 |
| abstract_inverted_index.learning | 1, 9, 86 |
| abstract_inverted_index.pairwise | 22, 69 |
| abstract_inverted_index.presents | 38 |
| abstract_inverted_index.proposed | 52, 105 |
| abstract_inverted_index.quality. | 110 |
| abstract_inverted_index.reframes | 54 |
| abstract_inverted_index.spanning | 93 |
| abstract_inverted_index.training | 81, 120 |
| abstract_inverted_index.achieving | 75 |
| abstract_inverted_index.alignment | 15 |
| abstract_inverted_index.available | 123 |
| abstract_inverted_index.detection | 98 |
| abstract_inverted_index.effective | 10 |
| abstract_inverted_index.extensive | 91 |
| abstract_inverted_index.including | 97 |
| abstract_inverted_index.maintains | 107 |
| abstract_inverted_index.web-scale | 48, 88 |
| abstract_inverted_index.eliminates | 65 |
| abstract_inverted_index.image-text | 49, 57 |
| abstract_inverted_index.remarkable | 77 |
| abstract_inverted_index.similarity | 23, 70 |
| abstract_inverted_index.supervised | 42 |
| abstract_inverted_index.$2.7\times$ | 78 |
| abstract_inverted_index.Contrastive | 0 |
| abstract_inverted_index.challenges. | 35 |
| abstract_inverted_index.computation | 24 |
| abstract_inverted_index.contrastive | 26, 73, 85 |
| abstract_inverted_index.demonstrate | 102 |
| abstract_inverted_index.embeddings. | 20 |
| abstract_inverted_index.experiments | 92 |
| abstract_inverted_index.pre-trained | 116 |
| abstract_inverted_index.acceleration | 79 |
| abstract_inverted_index.computations | 71 |
| abstract_inverted_index.pre-training | 43, 55 |
| abstract_inverted_index.Consequently, | 63 |
| abstract_inverted_index.computational | 34 |
| abstract_inverted_index.segmentation, | 100 |
| abstract_inverted_index.classification | 61 |
| abstract_inverted_index.representation | 109 |
| abstract_inverted_index.transformative | 6 |
| abstract_inverted_index.representations | 12 |
| abstract_inverted_index.\url{https://github.com/apple/corenet}. | 125 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 8 |
| citation_normalized_percentile |