Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2403.10882
Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2403.10882
- https://arxiv.org/pdf/2403.10882
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4393023353
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4393023353Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2403.10882Digital Object Identifier
- Title
-
Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on KoreanWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-03-16Full publication date if available
- Authors
-
Chang-Su Choi, Yongbin Jeong, Seoyoon Park, I. J. Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, Hyejin Lee, Younggyun Hahm, Hansaem Kim, KyungTae LimList of authors in order
- Landing page
-
https://arxiv.org/abs/2403.10882Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2403.10882Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2403.10882Direct OA link when available
- Concepts
-
Computer science, Natural language processing, Linguistics, Artificial intelligence, PhilosophyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4393023353 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2403.10882 |
| ids.doi | https://doi.org/10.48550/arxiv.2403.10882 |
| ids.openalex | https://openalex.org/W4393023353 |
| fwci | |
| type | preprint |
| title | Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10028 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.8047999739646912 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Topic Modeling |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.7616000175476074 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.6130470037460327 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C204321447 |
| concepts[1].level | 1 |
| concepts[1].score | 0.5331721901893616 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[1].display_name | Natural language processing |
| concepts[2].id | https://openalex.org/C41895202 |
| concepts[2].level | 1 |
| concepts[2].score | 0.5231803059577942 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q8162 |
| concepts[2].display_name | Linguistics |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.39673835039138794 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C138885662 |
| concepts[4].level | 0 |
| concepts[4].score | 0.04878994822502136 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[4].display_name | Philosophy |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.6130470037460327 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/natural-language-processing |
| keywords[1].score | 0.5331721901893616 |
| keywords[1].display_name | Natural language processing |
| keywords[2].id | https://openalex.org/keywords/linguistics |
| keywords[2].score | 0.5231803059577942 |
| keywords[2].display_name | Linguistics |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.39673835039138794 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/philosophy |
| keywords[4].score | 0.04878994822502136 |
| keywords[4].display_name | Philosophy |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2403.10882 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2403.10882 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2403.10882 |
| locations[1].id | doi:10.48550/arxiv.2403.10882 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2403.10882 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5026718067 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Chang-Su Choi |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Choi, ChangSu |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5082589351 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-0311-6629 |
| authorships[1].author.display_name | Yongbin Jeong |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Jeong, Yongbin |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5080415861 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Seoyoon Park |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Park, Seoyoon |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5114137729 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | I. J. Won |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Won, InHo |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5056276165 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | HyeonSeok Lim |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Lim, HyeonSeok |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5109260466 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | SangMin Kim |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Kim, SangMin |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5033003198 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Yejee Kang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Kang, Yejee |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5105921933 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Chanhyuk Yoon |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Yoon, Chanhyuk |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5072948361 |
| authorships[8].author.orcid | https://orcid.org/0000-0003-1833-6621 |
| authorships[8].author.display_name | Jaewan Park |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Park, Jaewan |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5028812274 |
| authorships[9].author.orcid | |
| authorships[9].author.display_name | Yiseul Lee |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Lee, Yiseul |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5100726930 |
| authorships[10].author.orcid | https://orcid.org/0000-0003-4034-082X |
| authorships[10].author.display_name | Hyejin Lee |
| authorships[10].author_position | middle |
| authorships[10].raw_author_name | Lee, HyeJin |
| authorships[10].is_corresponding | False |
| authorships[11].author.id | https://openalex.org/A5019023497 |
| authorships[11].author.orcid | |
| authorships[11].author.display_name | Younggyun Hahm |
| authorships[11].author_position | middle |
| authorships[11].raw_author_name | Hahm, Younggyun |
| authorships[11].is_corresponding | False |
| authorships[12].author.id | https://openalex.org/A5074151814 |
| authorships[12].author.orcid | https://orcid.org/0000-0003-1024-4052 |
| authorships[12].author.display_name | Hansaem Kim |
| authorships[12].author_position | middle |
| authorships[12].raw_author_name | Kim, Hansaem |
| authorships[12].is_corresponding | False |
| authorships[13].author.id | https://openalex.org/A5003224328 |
| authorships[13].author.orcid | https://orcid.org/0000-0002-5818-1161 |
| authorships[13].author.display_name | KyungTae Lim |
| authorships[13].author_position | last |
| authorships[13].raw_author_name | Lim, KyungTae |
| authorships[13].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2403.10882 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-03-21T00:00:00 |
| display_name | Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10028 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.8047999739646912 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Topic Modeling |
| related_works | https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W2382290278, https://openalex.org/W2478288626, https://openalex.org/W4391913857, https://openalex.org/W2350741829, https://openalex.org/W3204019825 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2403.10882 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2403.10882 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2403.10882 |
| primary_location.id | pmh:oai:arXiv.org:2403.10882 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2403.10882 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2403.10882 |
| publication_date | 2024-03-16 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 81, 121 |
| abstract_inverted_index.as | 106 |
| abstract_inverted_index.in | 143 |
| abstract_inverted_index.of | 47, 59 |
| abstract_inverted_index.on | 50, 127 |
| abstract_inverted_index.to | 6, 30, 43, 63, 73, 92, 147 |
| abstract_inverted_index.The | 96 |
| abstract_inverted_index.and | 22, 77, 88, 102, 130 |
| abstract_inverted_index.big | 19 |
| abstract_inverted_index.for | 71 |
| abstract_inverted_index.our | 136 |
| abstract_inverted_index.the | 8, 45, 51, 56, 75, 94, 99, 107 |
| abstract_inverted_index.use | 4 |
| abstract_inverted_index.was | 86, 90, 104, 110, 124 |
| abstract_inverted_index.LLMs | 28, 116 |
| abstract_inverted_index.LRL, | 108 |
| abstract_inverted_index.LRL. | 95 |
| abstract_inverted_index.LRLs | 48, 60 |
| abstract_inverted_index.MLLM | 57 |
| abstract_inverted_index.This | 38 |
| abstract_inverted_index.data | 68 |
| abstract_inverted_index.have | 25 |
| abstract_inverted_index.meet | 31 |
| abstract_inverted_index.tech | 20 |
| abstract_inverted_index.that | 135 |
| abstract_inverted_index.used | 70, 105 |
| abstract_inverted_index.were | 61, 69 |
| abstract_inverted_index.GPT4. | 131 |
| abstract_inverted_index.Large | 0 |
| abstract_inverted_index.align | 74 |
| abstract_inverted_index.based | 49, 126 |
| abstract_inverted_index.eight | 118 |
| abstract_inverted_index.high- | 76 |
| abstract_inverted_index.human | 128 |
| abstract_inverted_index.model | 101, 139 |
| abstract_inverted_index.other | 114 |
| abstract_inverted_index.study | 39 |
| abstract_inverted_index.their | 12 |
| abstract_inverted_index.three | 41 |
| abstract_inverted_index.which | 109 |
| abstract_inverted_index.word; | 10 |
| abstract_inverted_index.(LLMs) | 3 |
| abstract_inverted_index.First, | 55 |
| abstract_inverted_index.Korean | 103, 150 |
| abstract_inverted_index.Llama2 | 100 |
| abstract_inverted_index.MLLMs. | 54 |
| abstract_inverted_index.Third, | 80 |
| abstract_inverted_index.across | 117 |
| abstract_inverted_index.models | 2 |
| abstract_inverted_index.showed | 134 |
| abstract_inverted_index.tasks. | 119 |
| abstract_inverted_index.(LRLs). | 37 |
| abstract_inverted_index.(MLLMs) | 29 |
| abstract_inverted_index.Second, | 66 |
| abstract_inverted_index.against | 113 |
| abstract_inverted_index.augment | 93 |
| abstract_inverted_index.current | 32 |
| abstract_inverted_index.dataset | 85 |
| abstract_inverted_index.enhance | 44, 64 |
| abstract_inverted_index.models. | 152 |
| abstract_inverted_index.predict | 7 |
| abstract_inverted_index.results | 133 |
| abstract_inverted_index.Bllossom | 138 |
| abstract_inverted_index.Numerous | 18 |
| abstract_inverted_index.analyses | 145 |
| abstract_inverted_index.compared | 146 |
| abstract_inverted_index.demands, | 33 |
| abstract_inverted_index.employed | 98 |
| abstract_inverted_index.expanded | 62 |
| abstract_inverted_index.however, | 11 |
| abstract_inverted_index.language | 1 |
| abstract_inverted_index.proposed | 40, 137, 149 |
| abstract_inverted_index.publicly | 52 |
| abstract_inverted_index.requires | 14 |
| abstract_inverted_index.research | 23 |
| abstract_inverted_index.superior | 141 |
| abstract_inverted_index.available | 53 |
| abstract_inverted_index.bilingual | 67 |
| abstract_inverted_index.companies | 21 |
| abstract_inverted_index.computing | 16 |
| abstract_inverted_index.developed | 26, 115 |
| abstract_inverted_index.evaluated | 112 |
| abstract_inverted_index.exhibited | 140 |
| abstract_inverted_index.expansion | 13 |
| abstract_inverted_index.languages | 36 |
| abstract_inverted_index.performed | 91, 125 |
| abstract_inverted_index.assessment | 123 |
| abstract_inverted_index.evaluation | 129 |
| abstract_inverted_index.institutes | 24 |
| abstract_inverted_index.languages. | 79 |
| abstract_inverted_index.previously | 148 |
| abstract_inverted_index.resources. | 17 |
| abstract_inverted_index.strategies | 42 |
| abstract_inverted_index.subsequent | 9 |
| abstract_inverted_index.constructed | 87 |
| abstract_inverted_index.experiments | 97 |
| abstract_inverted_index.instruction | 84 |
| abstract_inverted_index.monolingual | 151 |
| abstract_inverted_index.overlooking | 34 |
| abstract_inverted_index.performance | 46, 142 |
| abstract_inverted_index.pretraining | 5, 72 |
| abstract_inverted_index.qualitative | 122, 144 |
| abstract_inverted_index.significant | 15 |
| abstract_inverted_index.small-scale | 83 |
| abstract_inverted_index.Experimental | 132 |
| abstract_inverted_index.Furthermore, | 120 |
| abstract_inverted_index.high-quality | 82 |
| abstract_inverted_index.multilingual | 27 |
| abstract_inverted_index.vocabularies | 58 |
| abstract_inverted_index.less-resourced | 35, 78 |
| abstract_inverted_index.quantitatively | 111 |
| abstract_inverted_index.expressiveness. | 65 |
| abstract_inverted_index.instruction-tuning | 89 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 14 |
| citation_normalized_percentile |