Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval Article Swipe

PDF

Keyu Wen , Zhenshan Tan , Qingrong Cheng , Cheng Chen , Xiaodong Gu ·

YOU? · · 2022 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2207.00733

Recently, the cross-modal pre-training task has been a hotspot because of its wide application in various down-streaming researches including retrieval, captioning, question answering and so on. However, exiting methods adopt a one-stream pre-training model to explore the united vision-language representation for conducting cross-modal retrieval, which easily suffer from the calculation explosion. Moreover, although the conventional double-stream structures are quite efficient, they still lack the vital cross-modal interactions, resulting in low performances. Motivated by these challenges, we put forward a Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) to grasp the joint text-image representations. Structurally, COOKIE adopts the traditional double-stream structure because of the acceptable time consumption. To overcome the inherent defects of double-stream structure as mentioned above, we elaborately design two effective modules. Concretely, the first module is a weight-sharing transformer that builds on the head of the visual and textual encoders, aiming to semantically align text and image. This design enables visual and textual paths focus on the same semantics. The other one is three specially designed contrastive learning, aiming to share knowledge between different models. The shared cross-modal knowledge develops the study of unimodal representation greatly, promoting the single-modal retrieval tasks. Extensive experimental results on multi-modal matching researches that includes cross-modal retrieval, text matching, and image retrieval reveal the superiors in calculation efficiency and statistical indicators of our pre-training model.

Related Topics

Computer Science

Rayon

Artificial Intelligence

Concepts

Computer science Modal Natural language processing Artificial intelligence Feature learning GRASP Closed captioning Transformer Image (mathematics) Polymer chemistry Programming language Quantum mechanics Chemistry Physics Voltage

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2207.00733
PDF: https://arxiv.org/pdf/2207.00733
OA Status: green
Related Works: 10
OpenAlex ID: https://openalex.org/W4283823073

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4283823073

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2207.00733

Digital Object Identifier
Title: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2022

Year of publication
Publication date: 2022-07-02

Full publication date if available
Authors: Keyu Wen, Zhenshan Tan, Qingrong Cheng, Cheng Chen, Xiaodong Gu

List of authors in order
Landing page: https://arxiv.org/abs/2207.00733

Publisher landing page
PDF URL: https://arxiv.org/pdf/2207.00733

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2207.00733

Direct OA link when available
Concepts: Computer science, Modal, Natural language processing, Artificial intelligence, Feature learning, GRASP, Closed captioning, Transformer, Image (mathematics), Polymer chemistry, Programming language, Quantum mechanics, Chemistry, Physics, Voltage

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4283823073
doi	https://doi.org/10.48550/arxiv.2207.00733
ids.doi	https://doi.org/10.48550/arxiv.2207.00733
ids.openalex	https://openalex.org/W4283823073
fwci
type	preprint
title	Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T11714
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	1.0
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1707
topics[0].subfield.display_name	Computer Vision and Pattern Recognition
topics[0].display_name	Multimodal Machine Learning Applications
topics[1].id	https://openalex.org/T10627
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9983999729156494
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1707
topics[1].subfield.display_name	Computer Vision and Pattern Recognition
topics[1].display_name	Advanced Image and Video Retrieval Techniques
topics[2].id	https://openalex.org/T11307
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9854999780654907
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Domain Adaptation and Few-Shot Learning
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.8460104465484619
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C71139939
concepts[1].level	2
concepts[1].score	0.6826146245002747
concepts[1].wikidata	https://www.wikidata.org/wiki/Q910194
concepts[1].display_name	Modal
concepts[2].id	https://openalex.org/C204321447
concepts[2].level	1
concepts[2].score	0.49586352705955505
concepts[2].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[2].display_name	Natural language processing
concepts[3].id	https://openalex.org/C154945302
concepts[3].level	1
concepts[3].score	0.49534526467323303
concepts[3].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[3].display_name	Artificial intelligence
concepts[4].id	https://openalex.org/C59404180
concepts[4].level	2
concepts[4].score	0.48526284098625183
concepts[4].wikidata	https://www.wikidata.org/wiki/Q17013334
concepts[4].display_name	Feature learning
concepts[5].id	https://openalex.org/C171268870
concepts[5].level	2
concepts[5].score	0.4510499835014343
concepts[5].wikidata	https://www.wikidata.org/wiki/Q1486676
concepts[5].display_name	GRASP
concepts[6].id	https://openalex.org/C157657479
concepts[6].level	3
concepts[6].score	0.43807587027549744
concepts[6].wikidata	https://www.wikidata.org/wiki/Q2367247
concepts[6].display_name	Closed captioning
concepts[7].id	https://openalex.org/C66322947
concepts[7].level	3
concepts[7].score	0.4326328635215759
concepts[7].wikidata	https://www.wikidata.org/wiki/Q11658
concepts[7].display_name	Transformer
concepts[8].id	https://openalex.org/C115961682
concepts[8].level	2
concepts[8].score	0.1610700786113739
concepts[8].wikidata	https://www.wikidata.org/wiki/Q860623
concepts[8].display_name	Image (mathematics)
concepts[9].id	https://openalex.org/C188027245
concepts[9].level	1
concepts[9].score	0.0
concepts[9].wikidata	https://www.wikidata.org/wiki/Q750446
concepts[9].display_name	Polymer chemistry
concepts[10].id	https://openalex.org/C199360897
concepts[10].level	1
concepts[10].score	0.0
concepts[10].wikidata	https://www.wikidata.org/wiki/Q9143
concepts[10].display_name	Programming language
concepts[11].id	https://openalex.org/C62520636
concepts[11].level	1
concepts[11].score	0.0
concepts[11].wikidata	https://www.wikidata.org/wiki/Q944
concepts[11].display_name	Quantum mechanics
concepts[12].id	https://openalex.org/C185592680
concepts[12].level	0
concepts[12].score	0.0
concepts[12].wikidata	https://www.wikidata.org/wiki/Q2329
concepts[12].display_name	Chemistry
concepts[13].id	https://openalex.org/C121332964
concepts[13].level	0
concepts[13].score	0.0
concepts[13].wikidata	https://www.wikidata.org/wiki/Q413
concepts[13].display_name	Physics
concepts[14].id	https://openalex.org/C165801399
concepts[14].level	2
concepts[14].score	0.0
concepts[14].wikidata	https://www.wikidata.org/wiki/Q25428
concepts[14].display_name	Voltage
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.8460104465484619
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/modal
keywords[1].score	0.6826146245002747
keywords[1].display_name	Modal
keywords[2].id	https://openalex.org/keywords/natural-language-processing
keywords[2].score	0.49586352705955505
keywords[2].display_name	Natural language processing
keywords[3].id	https://openalex.org/keywords/artificial-intelligence
keywords[3].score	0.49534526467323303
keywords[3].display_name	Artificial intelligence
keywords[4].id	https://openalex.org/keywords/feature-learning
keywords[4].score	0.48526284098625183
keywords[4].display_name	Feature learning
keywords[5].id	https://openalex.org/keywords/grasp
keywords[5].score	0.4510499835014343
keywords[5].display_name	GRASP
keywords[6].id	https://openalex.org/keywords/closed-captioning
keywords[6].score	0.43807587027549744
keywords[6].display_name	Closed captioning
keywords[7].id	https://openalex.org/keywords/transformer
keywords[7].score	0.4326328635215759
keywords[7].display_name	Transformer
keywords[8].id	https://openalex.org/keywords/image
keywords[8].score	0.1610700786113739
keywords[8].display_name	Image (mathematics)
language	en
locations[0].id	pmh:oai:arXiv.org:2207.00733
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2207.00733
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2207.00733
locations[1].id	doi:10.48550/arxiv.2207.00733
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2207.00733
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5004061050
authorships[0].author.orcid	https://orcid.org/0000-0002-5048-9014
authorships[0].author.display_name	Keyu Wen
authorships[0].author_position	first
authorships[0].raw_author_name	Wen, Keyu
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5064787764
authorships[1].author.orcid	https://orcid.org/0000-0003-3466-5417
authorships[1].author.display_name	Zhenshan Tan
authorships[1].author_position	middle
authorships[1].raw_author_name	Tan, Zhenshan
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5101178676
authorships[2].author.orcid
authorships[2].author.display_name	Qingrong Cheng
authorships[2].author_position	middle
authorships[2].raw_author_name	Cheng, Qingrong
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5100420600
authorships[3].author.orcid	https://orcid.org/0000-0003-3662-0263
authorships[3].author.display_name	Cheng Chen
authorships[3].author_position	middle
authorships[3].raw_author_name	Chen, Cheng
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5101294804
authorships[4].author.orcid
authorships[4].author.display_name	Xiaodong Gu
authorships[4].author_position	last
authorships[4].raw_author_name	Gu, Xiaodong
authorships[4].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2207.00733
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2022-07-07T00:00:00
display_name	Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T11714
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	1.0
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1707
primary_topic.subfield.display_name	Computer Vision and Pattern Recognition
primary_topic.display_name	Multimodal Machine Learning Applications
related_works	https://openalex.org/W4210416330, https://openalex.org/W2775506363, https://openalex.org/W3088136942, https://openalex.org/W4290852288, https://openalex.org/W4310447809, https://openalex.org/W4200243030, https://openalex.org/W2800782462, https://openalex.org/W3209117276, https://openalex.org/W4388184981, https://openalex.org/W4323777661
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2207.00733
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2207.00733
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2207.00733
primary_location.id	pmh:oai:arXiv.org:2207.00733
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2207.00733
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2207.00733
publication_date	2022-07-02
publication_year	2022
referenced_works_count	0
abstract_inverted_index.a	7, 30, 78, 126
abstract_inverted_index.To	104
abstract_inverted_index.as	112
abstract_inverted_index.by	72
abstract_inverted_index.in	14, 68, 210
abstract_inverted_index.is	125, 162
abstract_inverted_index.of	10, 99, 109, 134, 182, 216
abstract_inverted_index.on	131, 155, 194
abstract_inverted_index.so	24
abstract_inverted_index.to	34, 85, 141, 169
abstract_inverted_index.we	75, 115
abstract_inverted_index.The	159, 175
abstract_inverted_index.and	23, 137, 145, 151, 204, 213
abstract_inverted_index.are	57
abstract_inverted_index.for	40
abstract_inverted_index.has	5
abstract_inverted_index.its	11
abstract_inverted_index.low	69
abstract_inverted_index.on.	25
abstract_inverted_index.one	161
abstract_inverted_index.our	217
abstract_inverted_index.put	76
abstract_inverted_index.the	1, 36, 48, 53, 63, 87, 94, 100, 106, 122, 132, 135, 156, 180, 187, 208
abstract_inverted_index.two	118
abstract_inverted_index.This	147
abstract_inverted_index.been	6
abstract_inverted_index.from	47
abstract_inverted_index.head	133
abstract_inverted_index.lack	62
abstract_inverted_index.same	157
abstract_inverted_index.task	4
abstract_inverted_index.text	144, 202
abstract_inverted_index.that	129, 198
abstract_inverted_index.they	60
abstract_inverted_index.time	102
abstract_inverted_index.wide	12
abstract_inverted_index.adopt	29
abstract_inverted_index.align	143
abstract_inverted_index.first	123
abstract_inverted_index.focus	154
abstract_inverted_index.grasp	86
abstract_inverted_index.image	205
abstract_inverted_index.joint	88
abstract_inverted_index.model	33
abstract_inverted_index.other	160
abstract_inverted_index.paths	153
abstract_inverted_index.quite	58
abstract_inverted_index.share	170
abstract_inverted_index.still	61
abstract_inverted_index.study	181
abstract_inverted_index.these	73
abstract_inverted_index.three	163
abstract_inverted_index.vital	64
abstract_inverted_index.which	44
abstract_inverted_index.COOKIE	92
abstract_inverted_index.above,	114
abstract_inverted_index.adopts	93
abstract_inverted_index.aiming	140, 168
abstract_inverted_index.builds	130
abstract_inverted_index.design	117, 148
abstract_inverted_index.easily	45
abstract_inverted_index.image.	146
abstract_inverted_index.model.	219
abstract_inverted_index.module	124
abstract_inverted_index.reveal	207
abstract_inverted_index.shared	176
abstract_inverted_index.suffer	46
abstract_inverted_index.tasks.	190
abstract_inverted_index.united	37
abstract_inverted_index.visual	136, 150
abstract_inverted_index.Sharing	82
abstract_inverted_index.because	9, 98
abstract_inverted_index.between	172
abstract_inverted_index.defects	108
abstract_inverted_index.enables	149
abstract_inverted_index.exiting	27
abstract_inverted_index.explore	35
abstract_inverted_index.forward	77
abstract_inverted_index.hotspot	8
abstract_inverted_index.methods	28
abstract_inverted_index.models.	174
abstract_inverted_index.results	193
abstract_inverted_index.textual	138, 152
abstract_inverted_index.various	15
abstract_inverted_index.(COOKIE)	84
abstract_inverted_index.However,	26
abstract_inverted_index.although	52
abstract_inverted_index.designed	165
abstract_inverted_index.develops	179
abstract_inverted_index.greatly,	185
abstract_inverted_index.includes	199
abstract_inverted_index.inherent	107
abstract_inverted_index.matching	196
abstract_inverted_index.modules.	120
abstract_inverted_index.overcome	105
abstract_inverted_index.question	21
abstract_inverted_index.unimodal	183
abstract_inverted_index.Extensive	191
abstract_inverted_index.Knowledge	81
abstract_inverted_index.Moreover,	51
abstract_inverted_index.Motivated	71
abstract_inverted_index.Recently,	0
abstract_inverted_index.answering	22
abstract_inverted_index.different	173
abstract_inverted_index.effective	119
abstract_inverted_index.encoders,	139
abstract_inverted_index.including	18
abstract_inverted_index.knowledge	171, 178
abstract_inverted_index.learning,	167
abstract_inverted_index.matching,	203
abstract_inverted_index.mentioned	113
abstract_inverted_index.promoting	186
abstract_inverted_index.resulting	67
abstract_inverted_index.retrieval	189, 206
abstract_inverted_index.specially	164
abstract_inverted_index.structure	97, 111
abstract_inverted_index.superiors	209
abstract_inverted_index.acceptable	101
abstract_inverted_index.conducting	41
abstract_inverted_index.efficiency	212
abstract_inverted_index.efficient,	59
abstract_inverted_index.explosion.	50
abstract_inverted_index.indicators	215
abstract_inverted_index.one-stream	31
abstract_inverted_index.researches	17, 197
abstract_inverted_index.retrieval,	19, 43, 201
abstract_inverted_index.semantics.	158
abstract_inverted_index.structures	56
abstract_inverted_index.text-image	89
abstract_inverted_index.Concretely,	121
abstract_inverted_index.Contrastive	79
abstract_inverted_index.Cross-Modal	80
abstract_inverted_index.application	13
abstract_inverted_index.calculation	49, 211
abstract_inverted_index.captioning,	20
abstract_inverted_index.challenges,	74
abstract_inverted_index.contrastive	166
abstract_inverted_index.cross-modal	2, 42, 65, 177, 200
abstract_inverted_index.elaborately	116
abstract_inverted_index.multi-modal	195
abstract_inverted_index.statistical	214
abstract_inverted_index.traditional	95
abstract_inverted_index.transformer	128
abstract_inverted_index.Pre-training	83
abstract_inverted_index.consumption.	103
abstract_inverted_index.conventional	54
abstract_inverted_index.experimental	192
abstract_inverted_index.pre-training	3, 32, 218
abstract_inverted_index.semantically	142
abstract_inverted_index.single-modal	188
abstract_inverted_index.Structurally,	91
abstract_inverted_index.double-stream	55, 96, 110
abstract_inverted_index.interactions,	66
abstract_inverted_index.performances.	70
abstract_inverted_index.down-streaming	16
abstract_inverted_index.representation	39, 184
abstract_inverted_index.weight-sharing	127
abstract_inverted_index.vision-language	38
abstract_inverted_index.representations.	90
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	5
sustainable_development_goals[0].id	https://metadata.un.org/sdg/4
sustainable_development_goals[0].score	0.75
sustainable_development_goals[0].display_name	Quality Education
citation_normalized_percentile