Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.1109/tmm.2023.3340062
Referring segmentation aims to segment a target object related to a natural language expression. Key challenges of this task are understanding the meaning of complex and ambiguous language expressions and determining the relevant regions in the image with multiple objects by referring to the expression. Recent models have focused on the early fusion with the language features at the intermediate stage of the vision encoder, but these approaches have a limitation that the language features cannot refer to the visual information. To address this issue, this paper proposes a novel architecture, Cross-aware early fusion with stage-divided Vision and Language Transformer encoders (CrossVLT), which allows both language and vision encoders to perform the early fusion for improving the ability of the cross-modal context modeling. Unlike previous methods, our method enables the vision and language features to refer to each other's information at each stage to mutually enhance the robustness of both encoders. Furthermore, unlike the conventional scheme that relies solely on the high-level features for the cross-modal alignment, we introduce a feature-based alignment scheme that enables the low-level to high-level features of the vision and language encoders to engage in the cross-modal alignment. By aligning the intermediate cross-modal features in all encoder stages, this scheme leads to effective cross-modal fusion. In this way, the proposed approach is simple but effective for referring image segmentation, and it outperforms the previous state-of-the-art methods on three public benchmarks.
Related Topics
- Type
- article
- Language
- en
- Landing Page
- https://doi.org/10.1109/tmm.2023.3340062
- OA Status
- green
- Cited By
- 19
- References
- 45
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4389371459
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4389371459Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.1109/tmm.2023.3340062Digital Object Identifier
- Title
-
Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image SegmentationWork title
- Type
-
articleOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-12-06Full publication date if available
- Authors
-
Yubin Cho, Hyunwoo Yu, Suk‐Ju KangList of authors in order
- Landing page
-
https://doi.org/10.1109/tmm.2023.3340062Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2408.07539Direct OA link when available
- Concepts
-
Computer science, Encoder, Artificial intelligence, Computer vision, Segmentation, Natural language, Image segmentation, Robustness (evolution), Pattern recognition (psychology), Natural language processing, Operating system, Gene, Biochemistry, ChemistryTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
19Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 11, 2024: 8Per-year citation counts (last 5 years)
- References (count)
-
45Number of works referenced by this work
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4389371459 |
|---|---|
| doi | https://doi.org/10.1109/tmm.2023.3340062 |
| ids.doi | https://doi.org/10.1109/tmm.2023.3340062 |
| ids.openalex | https://openalex.org/W4389371459 |
| fwci | 3.45739846 |
| type | article |
| title | Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation |
| awards[0].id | https://openalex.org/G6103761548 |
| awards[0].funder_id | https://openalex.org/F4320322120 |
| awards[0].display_name | |
| awards[0].funder_award_id | 2020M3H4A1A02084899 |
| awards[0].funder_display_name | National Research Foundation of Korea |
| awards[1].id | https://openalex.org/G1098145971 |
| awards[1].funder_id | https://openalex.org/F4320322120 |
| awards[1].display_name | |
| awards[1].funder_award_id | 2021R1A2C1004208 |
| awards[1].funder_display_name | National Research Foundation of Korea |
| awards[2].id | https://openalex.org/G5510059365 |
| awards[2].funder_id | https://openalex.org/F4320332195 |
| awards[2].display_name | |
| awards[2].funder_award_id | IO201218-08232-01 |
| awards[2].funder_display_name | Samsung |
| awards[3].id | https://openalex.org/G6194603363 |
| awards[3].funder_id | https://openalex.org/F4320328359 |
| awards[3].display_name | |
| awards[3].funder_award_id | IITP-2023-RS-2023-00260091 |
| awards[3].funder_display_name | Ministry of Science and ICT, South Korea |
| biblio.issue | |
| biblio.volume | 26 |
| biblio.last_page | 5833 |
| biblio.first_page | 5823 |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 1.0 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T10036 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9994000196456909 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Advanced Neural Network Applications |
| topics[2].id | https://openalex.org/T11307 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9987999796867371 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Domain Adaptation and Few-Shot Learning |
| funders[0].id | https://openalex.org/F4320322120 |
| funders[0].ror | https://ror.org/013aysd81 |
| funders[0].display_name | National Research Foundation of Korea |
| funders[1].id | https://openalex.org/F4320328359 |
| funders[1].ror | https://ror.org/01wpjm123 |
| funders[1].display_name | Ministry of Science and ICT, South Korea |
| funders[2].id | https://openalex.org/F4320332195 |
| funders[2].ror | https://ror.org/04w3jy968 |
| funders[2].display_name | Samsung |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.8359405994415283 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C118505674 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7193880677223206 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q42586063 |
| concepts[1].display_name | Encoder |
| concepts[2].id | https://openalex.org/C154945302 |
| concepts[2].level | 1 |
| concepts[2].score | 0.6096083521842957 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[2].display_name | Artificial intelligence |
| concepts[3].id | https://openalex.org/C31972630 |
| concepts[3].level | 1 |
| concepts[3].score | 0.5021419525146484 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q844240 |
| concepts[3].display_name | Computer vision |
| concepts[4].id | https://openalex.org/C89600930 |
| concepts[4].level | 2 |
| concepts[4].score | 0.45959174633026123 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q1423946 |
| concepts[4].display_name | Segmentation |
| concepts[5].id | https://openalex.org/C195324797 |
| concepts[5].level | 2 |
| concepts[5].score | 0.45365846157073975 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q33742 |
| concepts[5].display_name | Natural language |
| concepts[6].id | https://openalex.org/C124504099 |
| concepts[6].level | 3 |
| concepts[6].score | 0.4297046661376953 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q56933 |
| concepts[6].display_name | Image segmentation |
| concepts[7].id | https://openalex.org/C63479239 |
| concepts[7].level | 3 |
| concepts[7].score | 0.41959455609321594 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q7353546 |
| concepts[7].display_name | Robustness (evolution) |
| concepts[8].id | https://openalex.org/C153180895 |
| concepts[8].level | 2 |
| concepts[8].score | 0.34320586919784546 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q7148389 |
| concepts[8].display_name | Pattern recognition (psychology) |
| concepts[9].id | https://openalex.org/C204321447 |
| concepts[9].level | 1 |
| concepts[9].score | 0.32223203778266907 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[9].display_name | Natural language processing |
| concepts[10].id | https://openalex.org/C111919701 |
| concepts[10].level | 1 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q9135 |
| concepts[10].display_name | Operating system |
| concepts[11].id | https://openalex.org/C104317684 |
| concepts[11].level | 2 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q7187 |
| concepts[11].display_name | Gene |
| concepts[12].id | https://openalex.org/C55493867 |
| concepts[12].level | 1 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q7094 |
| concepts[12].display_name | Biochemistry |
| concepts[13].id | https://openalex.org/C185592680 |
| concepts[13].level | 0 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q2329 |
| concepts[13].display_name | Chemistry |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.8359405994415283 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/encoder |
| keywords[1].score | 0.7193880677223206 |
| keywords[1].display_name | Encoder |
| keywords[2].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[2].score | 0.6096083521842957 |
| keywords[2].display_name | Artificial intelligence |
| keywords[3].id | https://openalex.org/keywords/computer-vision |
| keywords[3].score | 0.5021419525146484 |
| keywords[3].display_name | Computer vision |
| keywords[4].id | https://openalex.org/keywords/segmentation |
| keywords[4].score | 0.45959174633026123 |
| keywords[4].display_name | Segmentation |
| keywords[5].id | https://openalex.org/keywords/natural-language |
| keywords[5].score | 0.45365846157073975 |
| keywords[5].display_name | Natural language |
| keywords[6].id | https://openalex.org/keywords/image-segmentation |
| keywords[6].score | 0.4297046661376953 |
| keywords[6].display_name | Image segmentation |
| keywords[7].id | https://openalex.org/keywords/robustness |
| keywords[7].score | 0.41959455609321594 |
| keywords[7].display_name | Robustness (evolution) |
| keywords[8].id | https://openalex.org/keywords/pattern-recognition |
| keywords[8].score | 0.34320586919784546 |
| keywords[8].display_name | Pattern recognition (psychology) |
| keywords[9].id | https://openalex.org/keywords/natural-language-processing |
| keywords[9].score | 0.32223203778266907 |
| keywords[9].display_name | Natural language processing |
| language | en |
| locations[0].id | doi:10.1109/tmm.2023.3340062 |
| locations[0].is_oa | False |
| locations[0].source.id | https://openalex.org/S137030581 |
| locations[0].source.issn | 1520-9210, 1941-0077 |
| locations[0].source.type | journal |
| locations[0].source.is_oa | False |
| locations[0].source.issn_l | 1520-9210 |
| locations[0].source.is_core | True |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | IEEE Transactions on Multimedia |
| locations[0].source.host_organization | https://openalex.org/P4310319808 |
| locations[0].source.host_organization_name | Institute of Electrical and Electronics Engineers |
| locations[0].source.host_organization_lineage | https://openalex.org/P4310319808 |
| locations[0].source.host_organization_lineage_names | Institute of Electrical and Electronics Engineers |
| locations[0].license | |
| locations[0].pdf_url | |
| locations[0].version | publishedVersion |
| locations[0].raw_type | journal-article |
| locations[0].license_id | |
| locations[0].is_accepted | True |
| locations[0].is_published | True |
| locations[0].raw_source_name | IEEE Transactions on Multimedia |
| locations[0].landing_page_url | https://doi.org/10.1109/tmm.2023.3340062 |
| locations[1].id | pmh:oai:arXiv.org:2408.07539 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | https://arxiv.org/pdf/2408.07539 |
| locations[1].version | submittedVersion |
| locations[1].raw_type | text |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | False |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | http://arxiv.org/abs/2408.07539 |
| indexed_in | arxiv, crossref |
| authorships[0].author.id | https://openalex.org/A5060160801 |
| authorships[0].author.orcid | https://orcid.org/0009-0001-8604-5431 |
| authorships[0].author.display_name | Yubin Cho |
| authorships[0].countries | KR |
| authorships[0].affiliations[0].institution_ids | https://openalex.org/I148751991 |
| authorships[0].affiliations[0].raw_affiliation_string | School of Artificial Intelligence, Sogang University, Seoul, Republic of Korea |
| authorships[0].institutions[0].id | https://openalex.org/I148751991 |
| authorships[0].institutions[0].ror | https://ror.org/056tn4839 |
| authorships[0].institutions[0].type | education |
| authorships[0].institutions[0].lineage | https://openalex.org/I148751991 |
| authorships[0].institutions[0].country_code | KR |
| authorships[0].institutions[0].display_name | Sogang University |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Yubin Cho |
| authorships[0].is_corresponding | False |
| authorships[0].raw_affiliation_strings | School of Artificial Intelligence, Sogang University, Seoul, Republic of Korea |
| authorships[1].author.id | https://openalex.org/A5015387461 |
| authorships[1].author.orcid | https://orcid.org/0009-0009-4426-8272 |
| authorships[1].author.display_name | Hyunwoo Yu |
| authorships[1].countries | KR |
| authorships[1].affiliations[0].institution_ids | https://openalex.org/I148751991 |
| authorships[1].affiliations[0].raw_affiliation_string | School of Electronic Engineering, Sogang University, Seoul, Republic of Korea |
| authorships[1].institutions[0].id | https://openalex.org/I148751991 |
| authorships[1].institutions[0].ror | https://ror.org/056tn4839 |
| authorships[1].institutions[0].type | education |
| authorships[1].institutions[0].lineage | https://openalex.org/I148751991 |
| authorships[1].institutions[0].country_code | KR |
| authorships[1].institutions[0].display_name | Sogang University |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Hyunwoo Yu |
| authorships[1].is_corresponding | False |
| authorships[1].raw_affiliation_strings | School of Electronic Engineering, Sogang University, Seoul, Republic of Korea |
| authorships[2].author.id | https://openalex.org/A5084904773 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-4809-956X |
| authorships[2].author.display_name | Suk‐Ju Kang |
| authorships[2].countries | KR |
| authorships[2].affiliations[0].institution_ids | https://openalex.org/I148751991 |
| authorships[2].affiliations[0].raw_affiliation_string | School of Electronic Engineering, Sogang University, Seoul, Republic of Korea |
| authorships[2].institutions[0].id | https://openalex.org/I148751991 |
| authorships[2].institutions[0].ror | https://ror.org/056tn4839 |
| authorships[2].institutions[0].type | education |
| authorships[2].institutions[0].lineage | https://openalex.org/I148751991 |
| authorships[2].institutions[0].country_code | KR |
| authorships[2].institutions[0].display_name | Sogang University |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Suk-Ju Kang |
| authorships[2].is_corresponding | False |
| authorships[2].raw_affiliation_strings | School of Electronic Engineering, Sogang University, Seoul, Republic of Korea |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2408.07539 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Cross-Aware Early Fusion With Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 1.0 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W4390516098, https://openalex.org/W2181948922, https://openalex.org/W2384362569, https://openalex.org/W2142795561, https://openalex.org/W4205302943, https://openalex.org/W2770593030, https://openalex.org/W2561132942, https://openalex.org/W3154990682, https://openalex.org/W3155418658, https://openalex.org/W1522196789 |
| cited_by_count | 19 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 11 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 8 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2408.07539 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2408.07539 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2408.07539 |
| primary_location.id | doi:10.1109/tmm.2023.3340062 |
| primary_location.is_oa | False |
| primary_location.source.id | https://openalex.org/S137030581 |
| primary_location.source.issn | 1520-9210, 1941-0077 |
| primary_location.source.type | journal |
| primary_location.source.is_oa | False |
| primary_location.source.issn_l | 1520-9210 |
| primary_location.source.is_core | True |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | IEEE Transactions on Multimedia |
| primary_location.source.host_organization | https://openalex.org/P4310319808 |
| primary_location.source.host_organization_name | Institute of Electrical and Electronics Engineers |
| primary_location.source.host_organization_lineage | https://openalex.org/P4310319808 |
| primary_location.source.host_organization_lineage_names | Institute of Electrical and Electronics Engineers |
| primary_location.license | |
| primary_location.pdf_url | |
| primary_location.version | publishedVersion |
| primary_location.raw_type | journal-article |
| primary_location.license_id | |
| primary_location.is_accepted | True |
| primary_location.is_published | True |
| primary_location.raw_source_name | IEEE Transactions on Multimedia |
| primary_location.landing_page_url | https://doi.org/10.1109/tmm.2023.3340062 |
| publication_date | 2023-12-06 |
| publication_year | 2023 |
| referenced_works | https://openalex.org/W3201770677, https://openalex.org/W3216551675, https://openalex.org/W4200631575, https://openalex.org/W4283029876, https://openalex.org/W2302548814, https://openalex.org/W3143320354, https://openalex.org/W2980088508, https://openalex.org/W2973233205, https://openalex.org/W3023463084, https://openalex.org/W3003423830, https://openalex.org/W3156800342, https://openalex.org/W4318954130, https://openalex.org/W2899002405, https://openalex.org/W2507296351, https://openalex.org/W2999725795, https://openalex.org/W6797399245, https://openalex.org/W6842806116, https://openalex.org/W4382449692, https://openalex.org/W4312877428, https://openalex.org/W4312543911, https://openalex.org/W3172522282, https://openalex.org/W6791353385, https://openalex.org/W6790019176, https://openalex.org/W6798805250, https://openalex.org/W2946417913, https://openalex.org/W2888329843, https://openalex.org/W2798556392, https://openalex.org/W3034325957, https://openalex.org/W4224988000, https://openalex.org/W2293634267, https://openalex.org/W6684191040, https://openalex.org/W4309181071, https://openalex.org/W2896457183, https://openalex.org/W3138516171, https://openalex.org/W4385245566, https://openalex.org/W2964345792, https://openalex.org/W3034692043, https://openalex.org/W3108748824, https://openalex.org/W3035097537, https://openalex.org/W3169998662, https://openalex.org/W3187664142, https://openalex.org/W2489434015, https://openalex.org/W2963109634, https://openalex.org/W4386076034, https://openalex.org/W3099166112 |
| referenced_works_count | 45 |
| abstract_inverted_index.a | 5, 10, 69, 88, 169 |
| abstract_inverted_index.By | 192 |
| abstract_inverted_index.In | 209 |
| abstract_inverted_index.To | 81 |
| abstract_inverted_index.at | 57, 140 |
| abstract_inverted_index.by | 40 |
| abstract_inverted_index.in | 34, 188, 198 |
| abstract_inverted_index.is | 215 |
| abstract_inverted_index.it | 224 |
| abstract_inverted_index.of | 16, 23, 61, 118, 148, 180 |
| abstract_inverted_index.on | 49, 159, 230 |
| abstract_inverted_index.to | 3, 9, 42, 77, 109, 134, 136, 143, 177, 186, 205 |
| abstract_inverted_index.we | 167 |
| abstract_inverted_index.Key | 14 |
| abstract_inverted_index.all | 199 |
| abstract_inverted_index.and | 25, 29, 97, 106, 131, 183, 223 |
| abstract_inverted_index.are | 19 |
| abstract_inverted_index.but | 65, 217 |
| abstract_inverted_index.for | 114, 163, 219 |
| abstract_inverted_index.our | 126 |
| abstract_inverted_index.the | 21, 31, 35, 43, 50, 54, 58, 62, 72, 78, 111, 116, 119, 129, 146, 153, 160, 164, 175, 181, 189, 194, 212, 226 |
| abstract_inverted_index.aims | 2 |
| abstract_inverted_index.both | 104, 149 |
| abstract_inverted_index.each | 137, 141 |
| abstract_inverted_index.have | 47, 68 |
| abstract_inverted_index.task | 18 |
| abstract_inverted_index.that | 71, 156, 173 |
| abstract_inverted_index.this | 17, 83, 85, 202, 210 |
| abstract_inverted_index.way, | 211 |
| abstract_inverted_index.with | 37, 53, 94 |
| abstract_inverted_index.early | 51, 92, 112 |
| abstract_inverted_index.image | 36, 221 |
| abstract_inverted_index.leads | 204 |
| abstract_inverted_index.novel | 89 |
| abstract_inverted_index.paper | 86 |
| abstract_inverted_index.refer | 76, 135 |
| abstract_inverted_index.stage | 60, 142 |
| abstract_inverted_index.these | 66 |
| abstract_inverted_index.three | 231 |
| abstract_inverted_index.which | 102 |
| abstract_inverted_index.Recent | 45 |
| abstract_inverted_index.Unlike | 123 |
| abstract_inverted_index.Vision | 96 |
| abstract_inverted_index.allows | 103 |
| abstract_inverted_index.cannot | 75 |
| abstract_inverted_index.engage | 187 |
| abstract_inverted_index.fusion | 52, 93, 113 |
| abstract_inverted_index.issue, | 84 |
| abstract_inverted_index.method | 127 |
| abstract_inverted_index.models | 46 |
| abstract_inverted_index.object | 7 |
| abstract_inverted_index.public | 232 |
| abstract_inverted_index.relies | 157 |
| abstract_inverted_index.scheme | 155, 172, 203 |
| abstract_inverted_index.simple | 216 |
| abstract_inverted_index.solely | 158 |
| abstract_inverted_index.target | 6 |
| abstract_inverted_index.unlike | 152 |
| abstract_inverted_index.vision | 63, 107, 130, 182 |
| abstract_inverted_index.visual | 79 |
| abstract_inverted_index.ability | 117 |
| abstract_inverted_index.address | 82 |
| abstract_inverted_index.complex | 24 |
| abstract_inverted_index.context | 121 |
| abstract_inverted_index.enables | 128, 174 |
| abstract_inverted_index.encoder | 200 |
| abstract_inverted_index.enhance | 145 |
| abstract_inverted_index.focused | 48 |
| abstract_inverted_index.fusion. | 208 |
| abstract_inverted_index.meaning | 22 |
| abstract_inverted_index.methods | 229 |
| abstract_inverted_index.natural | 11 |
| abstract_inverted_index.objects | 39 |
| abstract_inverted_index.other's | 138 |
| abstract_inverted_index.perform | 110 |
| abstract_inverted_index.regions | 33 |
| abstract_inverted_index.related | 8 |
| abstract_inverted_index.segment | 4 |
| abstract_inverted_index.stages, | 201 |
| abstract_inverted_index.Language | 98 |
| abstract_inverted_index.aligning | 193 |
| abstract_inverted_index.approach | 214 |
| abstract_inverted_index.encoder, | 64 |
| abstract_inverted_index.encoders | 100, 108, 185 |
| abstract_inverted_index.features | 56, 74, 133, 162, 179, 197 |
| abstract_inverted_index.language | 12, 27, 55, 73, 105, 132, 184 |
| abstract_inverted_index.methods, | 125 |
| abstract_inverted_index.multiple | 38 |
| abstract_inverted_index.mutually | 144 |
| abstract_inverted_index.previous | 124, 227 |
| abstract_inverted_index.proposed | 213 |
| abstract_inverted_index.proposes | 87 |
| abstract_inverted_index.relevant | 32 |
| abstract_inverted_index.Referring | 0 |
| abstract_inverted_index.alignment | 171 |
| abstract_inverted_index.ambiguous | 26 |
| abstract_inverted_index.effective | 206, 218 |
| abstract_inverted_index.encoders. | 150 |
| abstract_inverted_index.improving | 115 |
| abstract_inverted_index.introduce | 168 |
| abstract_inverted_index.low-level | 176 |
| abstract_inverted_index.modeling. | 122 |
| abstract_inverted_index.referring | 41, 220 |
| abstract_inverted_index.alignment, | 166 |
| abstract_inverted_index.alignment. | 191 |
| abstract_inverted_index.approaches | 67 |
| abstract_inverted_index.challenges | 15 |
| abstract_inverted_index.high-level | 161, 178 |
| abstract_inverted_index.limitation | 70 |
| abstract_inverted_index.robustness | 147 |
| abstract_inverted_index.(CrossVLT), | 101 |
| abstract_inverted_index.Cross-aware | 91 |
| abstract_inverted_index.Transformer | 99 |
| abstract_inverted_index.benchmarks. | 233 |
| abstract_inverted_index.cross-modal | 120, 165, 190, 196, 207 |
| abstract_inverted_index.determining | 30 |
| abstract_inverted_index.expression. | 13, 44 |
| abstract_inverted_index.expressions | 28 |
| abstract_inverted_index.information | 139 |
| abstract_inverted_index.outperforms | 225 |
| abstract_inverted_index.Furthermore, | 151 |
| abstract_inverted_index.conventional | 154 |
| abstract_inverted_index.information. | 80 |
| abstract_inverted_index.intermediate | 59, 195 |
| abstract_inverted_index.segmentation | 1 |
| abstract_inverted_index.architecture, | 90 |
| abstract_inverted_index.feature-based | 170 |
| abstract_inverted_index.segmentation, | 222 |
| abstract_inverted_index.stage-divided | 95 |
| abstract_inverted_index.understanding | 20 |
| abstract_inverted_index.state-of-the-art | 228 |
| cited_by_percentile_year.max | 99 |
| cited_by_percentile_year.min | 98 |
| countries_distinct_count | 1 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile.value | 0.92279725 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | True |