Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean Article Swipe

PDF

Chang-Su Choi , Yongbin Jeong , Seoyoon Park , I. J. Won , HyeonSeok Lim , SangMin Kim , Yejee Kang , Chanhyuk Yoon , Jaewan Park , Yiseul Lee , Hyejin Lee , Younggyun Hahm , Hansaem Kim , KyungTae Lim ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2403.10882

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

Related Topics

Computer Science

Artificial Intelligence

Philosophy

Concepts

Computer science Natural language processing Linguistics Artificial intelligence Philosophy

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2403.10882
PDF: https://arxiv.org/pdf/2403.10882
OA Status: green
Cited By: 1
Related Works: 10
OpenAlex ID: https://openalex.org/W4393023353

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4393023353

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2403.10882

Digital Object Identifier
Title: Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-03-16

Full publication date if available
Authors: Chang-Su Choi, Yongbin Jeong, Seoyoon Park, I. J. Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, Hyejin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim

List of authors in order
Landing page: https://arxiv.org/abs/2403.10882

Publisher landing page
PDF URL: https://arxiv.org/pdf/2403.10882

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2403.10882

Direct OA link when available
Concepts: Computer science, Natural language processing, Linguistics, Artificial intelligence, Philosophy

Top concepts (fields/topics) attached by OpenAlex
Cited by: 1

Total citation count in OpenAlex
Citations by year (recent): 2025: 1

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4393023353
doi	https://doi.org/10.48550/arxiv.2403.10882
ids.doi	https://doi.org/10.48550/arxiv.2403.10882
ids.openalex	https://openalex.org/W4393023353
fwci
type	preprint
title	Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10028
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.8047999739646912
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Topic Modeling
topics[1].id	https://openalex.org/T10181
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.7616000175476074
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Natural Language Processing Techniques
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.6130470037460327
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C204321447
concepts[1].level	1
concepts[1].score	0.5331721901893616
concepts[1].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[1].display_name	Natural language processing
concepts[2].id	https://openalex.org/C41895202
concepts[2].level	1
concepts[2].score	0.5231803059577942
concepts[2].wikidata	https://www.wikidata.org/wiki/Q8162
concepts[2].display_name	Linguistics
concepts[3].id	https://openalex.org/C154945302
concepts[3].level	1
concepts[3].score	0.39673835039138794
concepts[3].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[3].display_name	Artificial intelligence
concepts[4].id	https://openalex.org/C138885662
concepts[4].level	0
concepts[4].score	0.04878994822502136
concepts[4].wikidata	https://www.wikidata.org/wiki/Q5891
concepts[4].display_name	Philosophy
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.6130470037460327
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/natural-language-processing
keywords[1].score	0.5331721901893616
keywords[1].display_name	Natural language processing
keywords[2].id	https://openalex.org/keywords/linguistics
keywords[2].score	0.5231803059577942
keywords[2].display_name	Linguistics
keywords[3].id	https://openalex.org/keywords/artificial-intelligence
keywords[3].score	0.39673835039138794
keywords[3].display_name	Artificial intelligence
keywords[4].id	https://openalex.org/keywords/philosophy
keywords[4].score	0.04878994822502136
keywords[4].display_name	Philosophy
language	en
locations[0].id	pmh:oai:arXiv.org:2403.10882
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2403.10882
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2403.10882
locations[1].id	doi:10.48550/arxiv.2403.10882
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2403.10882
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5026718067
authorships[0].author.orcid
authorships[0].author.display_name	Chang-Su Choi
authorships[0].author_position	first
authorships[0].raw_author_name	Choi, ChangSu
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5082589351
authorships[1].author.orcid	https://orcid.org/0000-0002-0311-6629
authorships[1].author.display_name	Yongbin Jeong
authorships[1].author_position	middle
authorships[1].raw_author_name	Jeong, Yongbin
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5080415861
authorships[2].author.orcid
authorships[2].author.display_name	Seoyoon Park
authorships[2].author_position	middle
authorships[2].raw_author_name	Park, Seoyoon
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5114137729
authorships[3].author.orcid
authorships[3].author.display_name	I. J. Won
authorships[3].author_position	middle
authorships[3].raw_author_name	Won, InHo
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5056276165
authorships[4].author.orcid
authorships[4].author.display_name	HyeonSeok Lim
authorships[4].author_position	middle
authorships[4].raw_author_name	Lim, HyeonSeok
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5109260466
authorships[5].author.orcid
authorships[5].author.display_name	SangMin Kim
authorships[5].author_position	middle
authorships[5].raw_author_name	Kim, SangMin
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5033003198
authorships[6].author.orcid
authorships[6].author.display_name	Yejee Kang
authorships[6].author_position	middle
authorships[6].raw_author_name	Kang, Yejee
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5105921933
authorships[7].author.orcid
authorships[7].author.display_name	Chanhyuk Yoon
authorships[7].author_position	middle
authorships[7].raw_author_name	Yoon, Chanhyuk
authorships[7].is_corresponding	False
authorships[8].author.id	https://openalex.org/A5072948361
authorships[8].author.orcid	https://orcid.org/0000-0003-1833-6621
authorships[8].author.display_name	Jaewan Park
authorships[8].author_position	middle
authorships[8].raw_author_name	Park, Jaewan
authorships[8].is_corresponding	False
authorships[9].author.id	https://openalex.org/A5028812274
authorships[9].author.orcid
authorships[9].author.display_name	Yiseul Lee
authorships[9].author_position	middle
authorships[9].raw_author_name	Lee, Yiseul
authorships[9].is_corresponding	False
authorships[10].author.id	https://openalex.org/A5100726930
authorships[10].author.orcid	https://orcid.org/0000-0003-4034-082X
authorships[10].author.display_name	Hyejin Lee
authorships[10].author_position	middle
authorships[10].raw_author_name	Lee, HyeJin
authorships[10].is_corresponding	False
authorships[11].author.id	https://openalex.org/A5019023497
authorships[11].author.orcid
authorships[11].author.display_name	Younggyun Hahm
authorships[11].author_position	middle
authorships[11].raw_author_name	Hahm, Younggyun
authorships[11].is_corresponding	False
authorships[12].author.id	https://openalex.org/A5074151814
authorships[12].author.orcid	https://orcid.org/0000-0003-1024-4052
authorships[12].author.display_name	Hansaem Kim
authorships[12].author_position	middle
authorships[12].raw_author_name	Kim, Hansaem
authorships[12].is_corresponding	False
authorships[13].author.id	https://openalex.org/A5003224328
authorships[13].author.orcid	https://orcid.org/0000-0002-5818-1161
authorships[13].author.display_name	KyungTae Lim
authorships[13].author_position	last
authorships[13].raw_author_name	Lim, KyungTae
authorships[13].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2403.10882
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2024-03-21T00:00:00
display_name	Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10028
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.8047999739646912
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Topic Modeling
related_works	https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2382290278, https://openalex.org/W2478288626, https://openalex.org/W4391913857, https://openalex.org/W2350741829, https://openalex.org/W3204019825
cited_by_count	1
counts_by_year[0].year	2025
counts_by_year[0].cited_by_count	1
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2403.10882
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2403.10882
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2403.10882
primary_location.id	pmh:oai:arXiv.org:2403.10882
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2403.10882
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2403.10882
publication_date	2024-03-16
publication_year	2024
referenced_works_count	0
abstract_inverted_index.a	81, 121
abstract_inverted_index.as	106
abstract_inverted_index.in	143
abstract_inverted_index.of	47, 59
abstract_inverted_index.on	50, 127
abstract_inverted_index.to	6, 30, 43, 63, 73, 92, 147
abstract_inverted_index.The	96
abstract_inverted_index.and	22, 77, 88, 102, 130
abstract_inverted_index.big	19
abstract_inverted_index.for	71
abstract_inverted_index.our	136
abstract_inverted_index.the	8, 45, 51, 56, 75, 94, 99, 107
abstract_inverted_index.use	4
abstract_inverted_index.was	86, 90, 104, 110, 124
abstract_inverted_index.LLMs	28, 116
abstract_inverted_index.LRL,	108
abstract_inverted_index.LRL.	95
abstract_inverted_index.LRLs	48, 60
abstract_inverted_index.MLLM	57
abstract_inverted_index.This	38
abstract_inverted_index.data	68
abstract_inverted_index.have	25
abstract_inverted_index.meet	31
abstract_inverted_index.tech	20
abstract_inverted_index.that	135
abstract_inverted_index.used	70, 105
abstract_inverted_index.were	61, 69
abstract_inverted_index.GPT4.	131
abstract_inverted_index.Large	0
abstract_inverted_index.align	74
abstract_inverted_index.based	49, 126
abstract_inverted_index.eight	118
abstract_inverted_index.high-	76
abstract_inverted_index.human	128
abstract_inverted_index.model	101, 139
abstract_inverted_index.other	114
abstract_inverted_index.study	39
abstract_inverted_index.their	12
abstract_inverted_index.three	41
abstract_inverted_index.which	109
abstract_inverted_index.word;	10
abstract_inverted_index.(LLMs)	3
abstract_inverted_index.First,	55
abstract_inverted_index.Korean	103, 150
abstract_inverted_index.Llama2	100
abstract_inverted_index.MLLMs.	54
abstract_inverted_index.Third,	80
abstract_inverted_index.across	117
abstract_inverted_index.models	2
abstract_inverted_index.showed	134
abstract_inverted_index.tasks.	119
abstract_inverted_index.(LRLs).	37
abstract_inverted_index.(MLLMs)	29
abstract_inverted_index.Second,	66
abstract_inverted_index.against	113
abstract_inverted_index.augment	93
abstract_inverted_index.current	32
abstract_inverted_index.dataset	85
abstract_inverted_index.enhance	44, 64
abstract_inverted_index.models.	152
abstract_inverted_index.predict	7
abstract_inverted_index.results	133
abstract_inverted_index.Bllossom	138
abstract_inverted_index.Numerous	18
abstract_inverted_index.analyses	145
abstract_inverted_index.compared	146
abstract_inverted_index.demands,	33
abstract_inverted_index.employed	98
abstract_inverted_index.expanded	62
abstract_inverted_index.however,	11
abstract_inverted_index.language	1
abstract_inverted_index.proposed	40, 137, 149
abstract_inverted_index.publicly	52
abstract_inverted_index.requires	14
abstract_inverted_index.research	23
abstract_inverted_index.superior	141
abstract_inverted_index.available	53
abstract_inverted_index.bilingual	67
abstract_inverted_index.companies	21
abstract_inverted_index.computing	16
abstract_inverted_index.developed	26, 115
abstract_inverted_index.evaluated	112
abstract_inverted_index.exhibited	140
abstract_inverted_index.expansion	13
abstract_inverted_index.languages	36
abstract_inverted_index.performed	91, 125
abstract_inverted_index.assessment	123
abstract_inverted_index.evaluation	129
abstract_inverted_index.institutes	24
abstract_inverted_index.languages.	79
abstract_inverted_index.previously	148
abstract_inverted_index.resources.	17
abstract_inverted_index.strategies	42
abstract_inverted_index.subsequent	9
abstract_inverted_index.constructed	87
abstract_inverted_index.experiments	97
abstract_inverted_index.instruction	84
abstract_inverted_index.monolingual	151
abstract_inverted_index.overlooking	34
abstract_inverted_index.performance	46, 142
abstract_inverted_index.pretraining	5, 72
abstract_inverted_index.qualitative	122, 144
abstract_inverted_index.significant	15
abstract_inverted_index.small-scale	83
abstract_inverted_index.Experimental	132
abstract_inverted_index.Furthermore,	120
abstract_inverted_index.high-quality	82
abstract_inverted_index.multilingual	27
abstract_inverted_index.vocabularies	58
abstract_inverted_index.less-resourced	35, 78
abstract_inverted_index.quantitatively	111
abstract_inverted_index.expressiveness.	65
abstract_inverted_index.instruction-tuning	89
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	14
citation_normalized_percentile