S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning Article Swipe

PDF

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2503.23007

Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.

Related Topics

Concepts

No concepts available.

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2503.23007
PDF: https://arxiv.org/pdf/2503.23007
OA Status: green
OpenAlex ID: https://openalex.org/W4417068622

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4417068622

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2503.23007

Digital Object Identifier
Title: S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2025

Year of publication
Publication date: 2025-03-29

Full publication date if available
Authors: Giang Do, Hung M. Le, Truyen Tran

List of authors in order
Landing page: https://arxiv.org/abs/2503.23007

Publisher landing page
PDF URL: https://arxiv.org/pdf/2503.23007

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2503.23007

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4417068622
doi	https://doi.org/10.48550/arxiv.2503.23007
ids.doi	https://doi.org/10.48550/arxiv.2503.23007
ids.openalex	https://openalex.org/W4417068622
fwci
type	preprint
title	S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
language	en
locations[0].id	pmh:oai:arXiv.org:2503.23007
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2503.23007
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2503.23007
locations[1].id	doi:10.48550/arxiv.2503.23007
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2503.23007
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5035809896
authorships[0].author.orcid
authorships[0].author.display_name	Giang Do
authorships[0].author_position	first
authorships[0].raw_author_name	Do, Giang
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5038017332
authorships[1].author.orcid	https://orcid.org/0000-0003-4060-9008
authorships[1].author.display_name	Hung M. Le
authorships[1].author_position	middle
authorships[1].raw_author_name	Le, Hung
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5085471517
authorships[2].author.orcid	https://orcid.org/0000-0001-6531-8907
authorships[2].author.display_name	Truyen Tran
authorships[2].author_position	last
authorships[2].raw_author_name	Tran, Truyen
authorships[2].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2503.23007
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning
has_fulltext	False
is_retracted	False
updated_date	2025-12-06T10:42:06.643673
primary_topic
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2503.23007
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2503.23007
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2503.23007
primary_location.id	pmh:oai:arXiv.org:2503.23007
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2503.23007
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2503.23007
publication_date	2025-03-29
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	17, 89, 104
abstract_inverted_index.In	84
abstract_inverted_index.by	12, 141
abstract_inverted_index.is	103
abstract_inverted_index.of	2, 8, 20, 31, 96, 106
abstract_inverted_index.on	38
abstract_inverted_index.to	16, 28, 42, 64, 72, 79, 109, 132
abstract_inverted_index.we	87
abstract_inverted_index.(1)	53
abstract_inverted_index.(2)	68
abstract_inverted_index.and	67, 114
abstract_inverted_index.are	56
abstract_inverted_index.but	46
abstract_inverted_index.can	76
abstract_inverted_index.due	27
abstract_inverted_index.key	51
abstract_inverted_index.the	29, 40, 60, 73
abstract_inverted_index.two	50
abstract_inverted_index.via	98, 117
abstract_inverted_index.28%.	142
abstract_inverted_index.SMoE	24
abstract_inverted_index.both	112
abstract_inverted_index.each	70
abstract_inverted_index.face	49
abstract_inverted_index.from	111
abstract_inverted_index.have	36
abstract_inverted_index.than	59
abstract_inverted_index.that	127
abstract_inverted_index.them	78
abstract_inverted_index.this	44, 85
abstract_inverted_index.S2MoE	128
abstract_inverted_index.Top-K	74
abstract_inverted_index.cause	77
abstract_inverted_index.costs	140
abstract_inverted_index.input	14, 71
abstract_inverted_index.issue	30
abstract_inverted_index.large	9
abstract_inverted_index.learn	80, 110
abstract_inverted_index.novel	90
abstract_inverted_index.other	133
abstract_inverted_index.tasks	125
abstract_inverted_index.under	119
abstract_inverted_index.which	102
abstract_inverted_index.while	136
abstract_inverted_index.work,	86
abstract_inverted_index.(SMoE)	4
abstract_inverted_index.Recent	34
abstract_inverted_index.Robust	93
abstract_inverted_index.Sparse	0, 94
abstract_inverted_index.across	123
abstract_inverted_index.called	92
abstract_inverted_index.expert	54
abstract_inverted_index.inputs	116
abstract_inverted_index.models	11
abstract_inverted_index.number	19
abstract_inverted_index.overly	81
abstract_inverted_index.router	41
abstract_inverted_index.select	18
abstract_inverted_index.tokens	15
abstract_inverted_index.Experts	3, 97
abstract_inverted_index.Mixture	1, 95
abstract_inverted_index.enables	5
abstract_inverted_index.experts	75, 107
abstract_inverted_index.focused	37
abstract_inverted_index.methods	135
abstract_inverted_index.mixture	105
abstract_inverted_index.model's	61
abstract_inverted_index.propose	88
abstract_inverted_index.remains	25
abstract_inverted_index.routing	13, 69, 134
abstract_inverted_index.similar	82
abstract_inverted_index.smaller	58
abstract_inverted_index.studies	35
abstract_inverted_index.various	124
abstract_inverted_index.(S2MoE),	101
abstract_inverted_index.However,	22
abstract_inverted_index.Learning	100, 118
abstract_inverted_index.achieves	129
abstract_inverted_index.approach	91
abstract_inverted_index.designed	108
abstract_inverted_index.existing	47
abstract_inverted_index.experts.	21
abstract_inverted_index.language	10
abstract_inverted_index.mitigate	43
abstract_inverted_index.problem,	45
abstract_inverted_index.reducing	137
abstract_inverted_index.training	7, 23
abstract_inverted_index.Extensive	121
abstract_inverted_index.collapse,	66
abstract_inverted_index.collapse.	33
abstract_inverted_index.efficient	6
abstract_inverted_index.features.	83
abstract_inverted_index.improving	39
abstract_inverted_index.inference	139
abstract_inverted_index.Stochastic	99
abstract_inverted_index.approaches	48
abstract_inverted_index.comparable	131
abstract_inverted_index.dimension,	62
abstract_inverted_index.embeddings	55
abstract_inverted_index.challenging	26
abstract_inverted_index.demonstrate	126
abstract_inverted_index.experiments	122
abstract_inverted_index.performance	130
abstract_inverted_index.Uncertainty.	120
abstract_inverted_index.contributing	63
abstract_inverted_index.limitations:	52
abstract_inverted_index.computational	138
abstract_inverted_index.deterministic	113
abstract_inverted_index.significantly	57
abstract_inverted_index.representation	32, 65
abstract_inverted_index.non-deterministic	115
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	3
citation_normalized_percentile