ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations Article Swipe

PDF

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2511.12249

Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT

Related Topics

Truth And Reconciliation Commission Of Canada

2025 Nba Draft

28 Years Later

Reich Ministry Of Public Enlightenment And Propaganda

Concepts

No concepts available.

Metadata

Type: preprint
Landing Page: http://arxiv.org/abs/2511.12249
PDF: https://arxiv.org/pdf/2511.12249
OA Status: green
OpenAlex ID: https://openalex.org/W4416354441

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4416354441

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2511.12249

Digital Object Identifier
Title: ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

Work title
Type: preprint

OpenAlex work type
Publication year: 2025

Year of publication
Publication date: 2025-11-15

Full publication date if available
Authors: Khang T. Huynh, Binh Thanh Nguyen

List of authors in order
Landing page: https://arxiv.org/abs/2511.12249

Publisher landing page
PDF URL: https://arxiv.org/pdf/2511.12249

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2511.12249

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4416354441
doi	https://doi.org/10.48550/arxiv.2511.12249
ids.doi	https://doi.org/10.48550/arxiv.2511.12249
ids.openalex	https://openalex.org/W4416354441
fwci
type	preprint
title	ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
language
locations[0].id	pmh:oai:arXiv.org:2511.12249
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2511.12249
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2511.12249
locations[1].id	doi:10.48550/arxiv.2511.12249
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license	cc-by
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id	https://openalex.org/licenses/cc-by
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2511.12249
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5026014342
authorships[0].author.orcid	https://orcid.org/0000-0002-2585-5756
authorships[0].author.display_name	Khang T. Huynh
authorships[0].author_position	first
authorships[0].raw_author_name	Huynh, Khang T.
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5073338061
authorships[1].author.orcid	https://orcid.org/0000-0002-5925-5273
authorships[1].author.display_name	Binh Thanh Nguyen
authorships[1].author_position	middle
authorships[1].raw_author_name	Nguyen, Binh T.
authorships[1].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2511.12249
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-11-19T00:00:00
display_name	ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations
has_fulltext	False
is_retracted	False
updated_date	2025-11-28T11:59:20.734326
primary_topic
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2511.12249
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2511.12249
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2511.12249
primary_location.id	pmh:oai:arXiv.org:2511.12249
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2511.12249
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2511.12249
publication_date	2025-11-15
publication_year	2025
referenced_works_count	0
abstract_inverted_index.=	104, 113, 119
abstract_inverted_index.a	51
abstract_inverted_index.In	45
abstract_inverted_index.We	72
abstract_inverted_index.as	12
abstract_inverted_index.at	140
abstract_inverted_index.in	2, 32, 85, 124
abstract_inverted_index.on	101, 110
abstract_inverted_index.to	26, 67
abstract_inverted_index.we	48
abstract_inverted_index.(AP	112
abstract_inverted_index.(F1	103
abstract_inverted_index.Our	133
abstract_inverted_index.WSD	89, 102
abstract_inverted_index.and	17, 38, 64, 90, 106, 115, 129, 136
abstract_inverted_index.are	138
abstract_inverted_index.but	20
abstract_inverted_index.for	41, 54, 81
abstract_inverted_index.has	23
abstract_inverted_index.its	122
abstract_inverted_index.rho	118
abstract_inverted_index.the	76
abstract_inverted_index.Word	13
abstract_inverted_index.also	73
abstract_inverted_index.been	24
abstract_inverted_index.both	88, 126
abstract_inverted_index.data	137
abstract_inverted_index.have	6
abstract_inverted_index.like	29
abstract_inverted_index.most	21
abstract_inverted_index.show	95
abstract_inverted_index.such	11
abstract_inverted_index.that	59, 96
abstract_inverted_index.this	46
abstract_inverted_index.word	4, 70
abstract_inverted_index.(WSD)	16
abstract_inverted_index.0.87)	105
abstract_inverted_index.0.88)	114
abstract_inverted_index.Sense	14
abstract_inverted_index.ViCon	111
abstract_inverted_index.code,	134
abstract_inverted_index.first	77
abstract_inverted_index.lacks	35
abstract_inverted_index.novel	52
abstract_inverted_index.still	34
abstract_inverted_index.tasks	10
abstract_inverted_index.0.60),	120
abstract_inverted_index.Recent	0
abstract_inverted_index.better	68
abstract_inverted_index.graded	130
abstract_inverted_index.models	37
abstract_inverted_index.paper,	47
abstract_inverted_index.robust	36
abstract_inverted_index.senses	128
abstract_inverted_index.strong	99
abstract_inverted_index.capture	69
abstract_inverted_index.dataset	80
abstract_inverted_index.greatly	7
abstract_inverted_index.limited	25
abstract_inverted_index.models,	135
abstract_inverted_index.present	49
abstract_inverted_index.results	94
abstract_inverted_index.(SimCLR)	63
abstract_inverted_index.English.	30
abstract_inverted_index.achieves	107
abstract_inverted_index.advances	1
abstract_inverted_index.covering	87
abstract_inverted_index.discrete	127
abstract_inverted_index.improved	8
abstract_inverted_index.learning	55, 62
abstract_inverted_index.meaning.	71
abstract_inverted_index.modeling	125
abstract_inverted_index.progress	22
abstract_inverted_index.semantic	9, 43, 83, 131
abstract_inverted_index.ViConBERT	97
abstract_inverted_index.ViConWSD,	75
abstract_inverted_index.ViSim-400	116
abstract_inverted_index.available	139
abstract_inverted_index.baselines	100
abstract_inverted_index.contrast,	33
abstract_inverted_index.framework	53
abstract_inverted_index.introduce	74
abstract_inverted_index.languages	28
abstract_inverted_index.resources	40
abstract_inverted_index.synthetic	79
abstract_inverted_index.ViConBERT,	50
abstract_inverted_index.Vietnamese	56
abstract_inverted_index.contextual	18, 91
abstract_inverted_index.embeddings	5, 58
abstract_inverted_index.evaluating	82
abstract_inverted_index.evaluation	39
abstract_inverted_index.integrates	60
abstract_inverted_index.relations.	132
abstract_inverted_index.(Spearman's	117
abstract_inverted_index.Vietnamese,	31, 86
abstract_inverted_index.competitive	108
abstract_inverted_index.contrastive	61
abstract_inverted_index.gloss-based	65
abstract_inverted_index.large-scale	78
abstract_inverted_index.outperforms	98
abstract_inverted_index.performance	109
abstract_inverted_index.similarity,	19
abstract_inverted_index.similarity.	92
abstract_inverted_index.Experimental	93
abstract_inverted_index.distillation	66
abstract_inverted_index.fine-grained	42
abstract_inverted_index.demonstrating	121
abstract_inverted_index.effectiveness	123
abstract_inverted_index.high-resource	27
abstract_inverted_index.understanding	84
abstract_inverted_index.Disambiguation	15
abstract_inverted_index.contextualized	3, 57
abstract_inverted_index.understanding.	44
abstract_inverted_index.https://github.com/tkhangg0910/ViConBERT	141
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	2
citation_normalized_percentile