Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language Article Swipe

PDF

Ronny Paul , Himanshu Buckchash , Shantipriya Parida , Dilip K. Prasad ·

YOU? · · 2024 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2405.05777

Sámi, an indigenous language group comprising multiple languages, faces digital marginalization due to the limited availability of data and sophisticated language models designed for its linguistic intricacies. This work focuses on increasing technological participation for the Sámi language. We draw the attention of the ML community towards the language modeling problem of Ultra Low Resource (ULR) languages. ULR languages are those for which the amount of available textual resources is very low, and the speaker count for them is also very low. ULRLs are also not supported by mainstream Large Language Models (LLMs) like ChatGPT, due to which gathering artificial training data for them becomes even more challenging. Mainstream AI foundational model development has given less attention to this category of languages. Generally, these languages have very few speakers, making it hard to find them. However, it is important to develop foundational models for these ULR languages to promote inclusion and the tangible abilities and impact of LLMs. To this end, we have compiled the available Sámi language resources from the web to create a clean dataset for training language models. In order to study the behavior of modern LLM models with ULR languages (Sámi), we have experimented with different kinds of LLMs, mainly at the order of $\sim$ seven billion parameters. We have also explored the effect of multilingual LLM training for ULRLs. We found that the decoder-only models under a sequential multilingual training scenario perform better than joint multilingual training, whereas multilingual training with high semantic overlap, in general, performs better than training from scratch.This is the first study on the Sámi language for adapting non-statistical language models that use the latest developments in the field of natural language processing (NLP).

Related Topics

Concepts

Training (meteorology) Linguistics Computer science Psychology Philosophy Geography Meteorology

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2405.05777
PDF: https://arxiv.org/pdf/2405.05777
OA Status: green
Cited By: 1
Related Works: 10
OpenAlex ID: https://openalex.org/W4396822384

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4396822384

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2405.05777

Digital Object Identifier
Title: Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2024

Year of publication
Publication date: 2024-05-09

Full publication date if available
Authors: Ronny Paul, Himanshu Buckchash, Shantipriya Parida, Dilip K. Prasad

List of authors in order
Landing page: https://arxiv.org/abs/2405.05777

Publisher landing page
PDF URL: https://arxiv.org/pdf/2405.05777

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2405.05777

Direct OA link when available
Concepts: Training (meteorology), Linguistics, Computer science, Psychology, Philosophy, Geography, Meteorology

Top concepts (fields/topics) attached by OpenAlex
Cited by: 1

Total citation count in OpenAlex
Citations by year (recent): 2024: 1

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4396822384
doi	https://doi.org/10.48550/arxiv.2405.05777
ids.doi	https://doi.org/10.48550/arxiv.2405.05777
ids.openalex	https://openalex.org/W4396822384
fwci
type	preprint
title	Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10181
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.9523000121116638
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Natural Language Processing Techniques
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C2777211547
concepts[0].level	2
concepts[0].score	0.5811136960983276
concepts[0].wikidata	https://www.wikidata.org/wiki/Q17141490
concepts[0].display_name	Training (meteorology)
concepts[1].id	https://openalex.org/C41895202
concepts[1].level	1
concepts[1].score	0.44974616169929504
concepts[1].wikidata	https://www.wikidata.org/wiki/Q8162
concepts[1].display_name	Linguistics
concepts[2].id	https://openalex.org/C41008148
concepts[2].level	0
concepts[2].score	0.43026870489120483
concepts[2].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[2].display_name	Computer science
concepts[3].id	https://openalex.org/C15744967
concepts[3].level	0
concepts[3].score	0.3632943034172058
concepts[3].wikidata	https://www.wikidata.org/wiki/Q9418
concepts[3].display_name	Psychology
concepts[4].id	https://openalex.org/C138885662
concepts[4].level	0
concepts[4].score	0.08818888664245605
concepts[4].wikidata	https://www.wikidata.org/wiki/Q5891
concepts[4].display_name	Philosophy
concepts[5].id	https://openalex.org/C205649164
concepts[5].level	0
concepts[5].score	0.07212471961975098
concepts[5].wikidata	https://www.wikidata.org/wiki/Q1071
concepts[5].display_name	Geography
concepts[6].id	https://openalex.org/C153294291
concepts[6].level	1
concepts[6].score	0.0
concepts[6].wikidata	https://www.wikidata.org/wiki/Q25261
concepts[6].display_name	Meteorology
keywords[0].id	https://openalex.org/keywords/training
keywords[0].score	0.5811136960983276
keywords[0].display_name	Training (meteorology)
keywords[1].id	https://openalex.org/keywords/linguistics
keywords[1].score	0.44974616169929504
keywords[1].display_name	Linguistics
keywords[2].id	https://openalex.org/keywords/computer-science
keywords[2].score	0.43026870489120483
keywords[2].display_name	Computer science
keywords[3].id	https://openalex.org/keywords/psychology
keywords[3].score	0.3632943034172058
keywords[3].display_name	Psychology
keywords[4].id	https://openalex.org/keywords/philosophy
keywords[4].score	0.08818888664245605
keywords[4].display_name	Philosophy
keywords[5].id	https://openalex.org/keywords/geography
keywords[5].score	0.07212471961975098
keywords[5].display_name	Geography
language	en
locations[0].id	pmh:oai:arXiv.org:2405.05777
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2405.05777
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2405.05777
locations[1].id	doi:10.48550/arxiv.2405.05777
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2405.05777
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5104298524
authorships[0].author.orcid
authorships[0].author.display_name	Ronny Paul
authorships[0].author_position	first
authorships[0].raw_author_name	Paul, Ronny
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5018753704
authorships[1].author.orcid	https://orcid.org/0000-0003-3679-3498
authorships[1].author.display_name	Himanshu Buckchash
authorships[1].author_position	middle
authorships[1].raw_author_name	Buckchash, Himanshu
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5015497545
authorships[2].author.orcid	https://orcid.org/0000-0003-3387-6300
authorships[2].author.display_name	Shantipriya Parida
authorships[2].author_position	middle
authorships[2].raw_author_name	Parida, Shantipriya
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5064348917
authorships[3].author.orcid	https://orcid.org/0000-0002-3693-6973
authorships[3].author.display_name	Dilip K. Prasad
authorships[3].author_position	last
authorships[3].raw_author_name	Prasad, Dilip K.
authorships[3].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2405.05777
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2024-05-11T00:00:00
display_name	Towards a More Inclusive AI: Progress and Perspectives in Large Language Model Training for the Sámi Language
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10181
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.9523000121116638
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Natural Language Processing Techniques
related_works	https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W230091440, https://openalex.org/W2390279801, https://openalex.org/W2233261550, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2810751659, https://openalex.org/W258997015, https://openalex.org/W2376932109
cited_by_count	1
counts_by_year[0].year	2024
counts_by_year[0].cited_by_count	1
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2405.05777
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2405.05777
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2405.05777
primary_location.id	pmh:oai:arXiv.org:2405.05777
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2405.05777
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2405.05777
publication_date	2024-05-09
publication_year	2024
referenced_works_count	0
abstract_inverted_index.a	174, 231
abstract_inverted_index.AI	109
abstract_inverted_index.In	181
abstract_inverted_index.ML	44
abstract_inverted_index.To	158
abstract_inverted_index.We	38, 212, 224
abstract_inverted_index.an	1
abstract_inverted_index.at	204
abstract_inverted_index.by	87
abstract_inverted_index.in	249, 275
abstract_inverted_index.is	69, 78, 137, 257
abstract_inverted_index.it	130, 136
abstract_inverted_index.of	16, 42, 51, 65, 120, 156, 187, 201, 207, 218, 278
abstract_inverted_index.on	30, 261
abstract_inverted_index.to	12, 96, 117, 132, 139, 147, 172, 183
abstract_inverted_index.we	161, 195
abstract_inverted_index.LLM	189, 220
abstract_inverted_index.Low	53
abstract_inverted_index.ULR	57, 145, 192
abstract_inverted_index.and	18, 72, 150, 154
abstract_inverted_index.are	59, 83
abstract_inverted_index.due	11, 95
abstract_inverted_index.few	127
abstract_inverted_index.for	23, 34, 61, 76, 102, 143, 177, 222, 265
abstract_inverted_index.has	113
abstract_inverted_index.its	24
abstract_inverted_index.not	85
abstract_inverted_index.the	13, 35, 40, 43, 47, 63, 73, 151, 164, 170, 185, 205, 216, 227, 258, 262, 272, 276
abstract_inverted_index.use	271
abstract_inverted_index.web	171
abstract_inverted_index.This	27
abstract_inverted_index.also	79, 84, 214
abstract_inverted_index.data	17, 101
abstract_inverted_index.draw	39
abstract_inverted_index.end,	160
abstract_inverted_index.even	105
abstract_inverted_index.find	133
abstract_inverted_index.from	169, 255
abstract_inverted_index.hard	131
abstract_inverted_index.have	125, 162, 196, 213
abstract_inverted_index.high	246
abstract_inverted_index.less	115
abstract_inverted_index.like	93
abstract_inverted_index.low,	71
abstract_inverted_index.low.	81
abstract_inverted_index.more	106
abstract_inverted_index.than	238, 253
abstract_inverted_index.that	226, 270
abstract_inverted_index.them	77, 103
abstract_inverted_index.this	118, 159
abstract_inverted_index.very	70, 80, 126
abstract_inverted_index.with	191, 198, 245
abstract_inverted_index.work	28
abstract_inverted_index.(ULR)	55
abstract_inverted_index.LLMs,	202
abstract_inverted_index.LLMs.	157
abstract_inverted_index.Large	89
abstract_inverted_index.Sámi	36, 166, 263
abstract_inverted_index.ULRLs	82
abstract_inverted_index.Ultra	52
abstract_inverted_index.clean	175
abstract_inverted_index.count	75
abstract_inverted_index.faces	8
abstract_inverted_index.field	277
abstract_inverted_index.first	259
abstract_inverted_index.found	225
abstract_inverted_index.given	114
abstract_inverted_index.group	4
abstract_inverted_index.joint	239
abstract_inverted_index.kinds	200
abstract_inverted_index.model	111
abstract_inverted_index.order	182, 206
abstract_inverted_index.seven	209
abstract_inverted_index.study	184, 260
abstract_inverted_index.them.	134
abstract_inverted_index.these	123, 144
abstract_inverted_index.those	60
abstract_inverted_index.under	230
abstract_inverted_index.which	62, 97
abstract_inverted_index.$\sim$	208
abstract_inverted_index.(LLMs)	92
abstract_inverted_index.(NLP).	282
abstract_inverted_index.Models	91
abstract_inverted_index.Sámi,	0
abstract_inverted_index.ULRLs.	223
abstract_inverted_index.amount	64
abstract_inverted_index.better	237, 252
abstract_inverted_index.create	173
abstract_inverted_index.effect	217
abstract_inverted_index.impact	155
abstract_inverted_index.latest	273
abstract_inverted_index.mainly	203
abstract_inverted_index.making	129
abstract_inverted_index.models	21, 142, 190, 229, 269
abstract_inverted_index.modern	188
abstract_inverted_index.becomes	104
abstract_inverted_index.billion	210
abstract_inverted_index.dataset	176
abstract_inverted_index.develop	140
abstract_inverted_index.digital	9
abstract_inverted_index.focuses	29
abstract_inverted_index.limited	14
abstract_inverted_index.models.	180
abstract_inverted_index.natural	279
abstract_inverted_index.perform	236
abstract_inverted_index.problem	50
abstract_inverted_index.promote	148
abstract_inverted_index.speaker	74
abstract_inverted_index.textual	67
abstract_inverted_index.towards	46
abstract_inverted_index.whereas	242
abstract_inverted_index.(Sámi),	194
abstract_inverted_index.ChatGPT,	94
abstract_inverted_index.However,	135
abstract_inverted_index.Language	90
abstract_inverted_index.Resource	54
abstract_inverted_index.adapting	266
abstract_inverted_index.behavior	186
abstract_inverted_index.category	119
abstract_inverted_index.compiled	163
abstract_inverted_index.designed	22
abstract_inverted_index.explored	215
abstract_inverted_index.general,	250
abstract_inverted_index.language	3, 20, 48, 167, 179, 264, 268, 280
abstract_inverted_index.modeling	49
abstract_inverted_index.multiple	6
abstract_inverted_index.overlap,	248
abstract_inverted_index.performs	251
abstract_inverted_index.scenario	235
abstract_inverted_index.semantic	247
abstract_inverted_index.tangible	152
abstract_inverted_index.training	100, 178, 221, 234, 244, 254
abstract_inverted_index.abilities	153
abstract_inverted_index.attention	41, 116
abstract_inverted_index.available	66, 165
abstract_inverted_index.community	45
abstract_inverted_index.different	199
abstract_inverted_index.gathering	98
abstract_inverted_index.important	138
abstract_inverted_index.inclusion	149
abstract_inverted_index.language.	37
abstract_inverted_index.languages	58, 124, 146, 193
abstract_inverted_index.resources	68, 168
abstract_inverted_index.speakers,	128
abstract_inverted_index.supported	86
abstract_inverted_index.training,	241
abstract_inverted_index.Generally,	122
abstract_inverted_index.Mainstream	108
abstract_inverted_index.artificial	99
abstract_inverted_index.comprising	5
abstract_inverted_index.increasing	31
abstract_inverted_index.indigenous	2
abstract_inverted_index.languages,	7
abstract_inverted_index.languages.	56, 121
abstract_inverted_index.linguistic	25
abstract_inverted_index.mainstream	88
abstract_inverted_index.processing	281
abstract_inverted_index.sequential	232
abstract_inverted_index.development	112
abstract_inverted_index.parameters.	211
abstract_inverted_index.availability	15
abstract_inverted_index.challenging.	107
abstract_inverted_index.decoder-only	228
abstract_inverted_index.developments	274
abstract_inverted_index.experimented	197
abstract_inverted_index.foundational	110, 141
abstract_inverted_index.intricacies.	26
abstract_inverted_index.multilingual	219, 233, 240, 243
abstract_inverted_index.scratch.This	256
abstract_inverted_index.participation	33
abstract_inverted_index.sophisticated	19
abstract_inverted_index.technological	32
abstract_inverted_index.marginalization	10
abstract_inverted_index.non-statistical	267
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	4
citation_normalized_percentile