ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2511.12249
Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT
Related Topics
- Type
- preprint
- Landing Page
- http://arxiv.org/abs/2511.12249
- https://arxiv.org/pdf/2511.12249
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416354441
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416354441Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2511.12249Digital Object Identifier
- Title
-
ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware RepresentationsWork title
- Type
-
preprintOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-11-15Full publication date if available
- Authors
-
Khang T. Huynh, Binh Thanh NguyenList of authors in order
- Landing page
-
https://arxiv.org/abs/2511.12249Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2511.12249Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2511.12249Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416354441 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2511.12249 |
| ids.doi | https://doi.org/10.48550/arxiv.2511.12249 |
| ids.openalex | https://openalex.org/W4416354441 |
| fwci | |
| type | preprint |
| title | ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | |
| locations[0].id | pmh:oai:arXiv.org:2511.12249 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2511.12249 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2511.12249 |
| locations[1].id | doi:10.48550/arxiv.2511.12249 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2511.12249 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5026014342 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-2585-5756 |
| authorships[0].author.display_name | Khang T. Huynh |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Huynh, Khang T. |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5073338061 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-5925-5273 |
| authorships[1].author.display_name | Binh Thanh Nguyen |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Nguyen, Binh T. |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2511.12249 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-11-19T00:00:00 |
| display_name | ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T11:59:20.734326 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2511.12249 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2511.12249 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2511.12249 |
| primary_location.id | pmh:oai:arXiv.org:2511.12249 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2511.12249 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2511.12249 |
| publication_date | 2025-11-15 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.= | 104, 113, 119 |
| abstract_inverted_index.a | 51 |
| abstract_inverted_index.In | 45 |
| abstract_inverted_index.We | 72 |
| abstract_inverted_index.as | 12 |
| abstract_inverted_index.at | 140 |
| abstract_inverted_index.in | 2, 32, 85, 124 |
| abstract_inverted_index.on | 101, 110 |
| abstract_inverted_index.to | 26, 67 |
| abstract_inverted_index.we | 48 |
| abstract_inverted_index.(AP | 112 |
| abstract_inverted_index.(F1 | 103 |
| abstract_inverted_index.Our | 133 |
| abstract_inverted_index.WSD | 89, 102 |
| abstract_inverted_index.and | 17, 38, 64, 90, 106, 115, 129, 136 |
| abstract_inverted_index.are | 138 |
| abstract_inverted_index.but | 20 |
| abstract_inverted_index.for | 41, 54, 81 |
| abstract_inverted_index.has | 23 |
| abstract_inverted_index.its | 122 |
| abstract_inverted_index.rho | 118 |
| abstract_inverted_index.the | 76 |
| abstract_inverted_index.Word | 13 |
| abstract_inverted_index.also | 73 |
| abstract_inverted_index.been | 24 |
| abstract_inverted_index.both | 88, 126 |
| abstract_inverted_index.data | 137 |
| abstract_inverted_index.have | 6 |
| abstract_inverted_index.like | 29 |
| abstract_inverted_index.most | 21 |
| abstract_inverted_index.show | 95 |
| abstract_inverted_index.such | 11 |
| abstract_inverted_index.that | 59, 96 |
| abstract_inverted_index.this | 46 |
| abstract_inverted_index.word | 4, 70 |
| abstract_inverted_index.(WSD) | 16 |
| abstract_inverted_index.0.87) | 105 |
| abstract_inverted_index.0.88) | 114 |
| abstract_inverted_index.Sense | 14 |
| abstract_inverted_index.ViCon | 111 |
| abstract_inverted_index.code, | 134 |
| abstract_inverted_index.first | 77 |
| abstract_inverted_index.lacks | 35 |
| abstract_inverted_index.novel | 52 |
| abstract_inverted_index.still | 34 |
| abstract_inverted_index.tasks | 10 |
| abstract_inverted_index.0.60), | 120 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.better | 68 |
| abstract_inverted_index.graded | 130 |
| abstract_inverted_index.models | 37 |
| abstract_inverted_index.paper, | 47 |
| abstract_inverted_index.robust | 36 |
| abstract_inverted_index.senses | 128 |
| abstract_inverted_index.strong | 99 |
| abstract_inverted_index.capture | 69 |
| abstract_inverted_index.dataset | 80 |
| abstract_inverted_index.greatly | 7 |
| abstract_inverted_index.limited | 25 |
| abstract_inverted_index.models, | 135 |
| abstract_inverted_index.present | 49 |
| abstract_inverted_index.results | 94 |
| abstract_inverted_index.(SimCLR) | 63 |
| abstract_inverted_index.English. | 30 |
| abstract_inverted_index.achieves | 107 |
| abstract_inverted_index.advances | 1 |
| abstract_inverted_index.covering | 87 |
| abstract_inverted_index.discrete | 127 |
| abstract_inverted_index.improved | 8 |
| abstract_inverted_index.learning | 55, 62 |
| abstract_inverted_index.meaning. | 71 |
| abstract_inverted_index.modeling | 125 |
| abstract_inverted_index.progress | 22 |
| abstract_inverted_index.semantic | 9, 43, 83, 131 |
| abstract_inverted_index.ViConBERT | 97 |
| abstract_inverted_index.ViConWSD, | 75 |
| abstract_inverted_index.ViSim-400 | 116 |
| abstract_inverted_index.available | 139 |
| abstract_inverted_index.baselines | 100 |
| abstract_inverted_index.contrast, | 33 |
| abstract_inverted_index.framework | 53 |
| abstract_inverted_index.introduce | 74 |
| abstract_inverted_index.languages | 28 |
| abstract_inverted_index.resources | 40 |
| abstract_inverted_index.synthetic | 79 |
| abstract_inverted_index.ViConBERT, | 50 |
| abstract_inverted_index.Vietnamese | 56 |
| abstract_inverted_index.contextual | 18, 91 |
| abstract_inverted_index.embeddings | 5, 58 |
| abstract_inverted_index.evaluating | 82 |
| abstract_inverted_index.evaluation | 39 |
| abstract_inverted_index.integrates | 60 |
| abstract_inverted_index.relations. | 132 |
| abstract_inverted_index.(Spearman's | 117 |
| abstract_inverted_index.Vietnamese, | 31, 86 |
| abstract_inverted_index.competitive | 108 |
| abstract_inverted_index.contrastive | 61 |
| abstract_inverted_index.gloss-based | 65 |
| abstract_inverted_index.large-scale | 78 |
| abstract_inverted_index.outperforms | 98 |
| abstract_inverted_index.performance | 109 |
| abstract_inverted_index.similarity, | 19 |
| abstract_inverted_index.similarity. | 92 |
| abstract_inverted_index.Experimental | 93 |
| abstract_inverted_index.distillation | 66 |
| abstract_inverted_index.fine-grained | 42 |
| abstract_inverted_index.demonstrating | 121 |
| abstract_inverted_index.effectiveness | 123 |
| abstract_inverted_index.high-resource | 27 |
| abstract_inverted_index.understanding | 84 |
| abstract_inverted_index.Disambiguation | 15 |
| abstract_inverted_index.contextualized | 3, 57 |
| abstract_inverted_index.understanding. | 44 |
| abstract_inverted_index.https://github.com/tkhangg0910/ViConBERT | 141 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile |