S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2503.23007
Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model's dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2503.23007
- https://arxiv.org/pdf/2503.23007
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4417068622
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4417068622Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2503.23007Digital Object Identifier
- Title
-
S2MoE: Robust Sparse Mixture of Experts via Stochastic LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-03-29Full publication date if available
- Authors
-
Giang Do, Hung M. Le, Truyen TranList of authors in order
- Landing page
-
https://arxiv.org/abs/2503.23007Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2503.23007Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2503.23007Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4417068622 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2503.23007 |
| ids.doi | https://doi.org/10.48550/arxiv.2503.23007 |
| ids.openalex | https://openalex.org/W4417068622 |
| fwci | |
| type | preprint |
| title | S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2503.23007 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2503.23007 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2503.23007 |
| locations[1].id | doi:10.48550/arxiv.2503.23007 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2503.23007 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5035809896 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Giang Do |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Do, Giang |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5038017332 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-4060-9008 |
| authorships[1].author.display_name | Hung M. Le |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Le, Hung |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5085471517 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-6531-8907 |
| authorships[2].author.display_name | Truyen Tran |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Tran, Truyen |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2503.23007 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-12-06T10:42:06.643673 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2503.23007 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2503.23007 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2503.23007 |
| primary_location.id | pmh:oai:arXiv.org:2503.23007 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2503.23007 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2503.23007 |
| publication_date | 2025-03-29 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 17, 89, 104 |
| abstract_inverted_index.In | 84 |
| abstract_inverted_index.by | 12, 141 |
| abstract_inverted_index.is | 103 |
| abstract_inverted_index.of | 2, 8, 20, 31, 96, 106 |
| abstract_inverted_index.on | 38 |
| abstract_inverted_index.to | 16, 28, 42, 64, 72, 79, 109, 132 |
| abstract_inverted_index.we | 87 |
| abstract_inverted_index.(1) | 53 |
| abstract_inverted_index.(2) | 68 |
| abstract_inverted_index.and | 67, 114 |
| abstract_inverted_index.are | 56 |
| abstract_inverted_index.but | 46 |
| abstract_inverted_index.can | 76 |
| abstract_inverted_index.due | 27 |
| abstract_inverted_index.key | 51 |
| abstract_inverted_index.the | 29, 40, 60, 73 |
| abstract_inverted_index.two | 50 |
| abstract_inverted_index.via | 98, 117 |
| abstract_inverted_index.28%. | 142 |
| abstract_inverted_index.SMoE | 24 |
| abstract_inverted_index.both | 112 |
| abstract_inverted_index.each | 70 |
| abstract_inverted_index.face | 49 |
| abstract_inverted_index.from | 111 |
| abstract_inverted_index.have | 36 |
| abstract_inverted_index.than | 59 |
| abstract_inverted_index.that | 127 |
| abstract_inverted_index.them | 78 |
| abstract_inverted_index.this | 44, 85 |
| abstract_inverted_index.S2MoE | 128 |
| abstract_inverted_index.Top-K | 74 |
| abstract_inverted_index.cause | 77 |
| abstract_inverted_index.costs | 140 |
| abstract_inverted_index.input | 14, 71 |
| abstract_inverted_index.issue | 30 |
| abstract_inverted_index.large | 9 |
| abstract_inverted_index.learn | 80, 110 |
| abstract_inverted_index.novel | 90 |
| abstract_inverted_index.other | 133 |
| abstract_inverted_index.tasks | 125 |
| abstract_inverted_index.under | 119 |
| abstract_inverted_index.which | 102 |
| abstract_inverted_index.while | 136 |
| abstract_inverted_index.work, | 86 |
| abstract_inverted_index.(SMoE) | 4 |
| abstract_inverted_index.Recent | 34 |
| abstract_inverted_index.Robust | 93 |
| abstract_inverted_index.Sparse | 0, 94 |
| abstract_inverted_index.across | 123 |
| abstract_inverted_index.called | 92 |
| abstract_inverted_index.expert | 54 |
| abstract_inverted_index.inputs | 116 |
| abstract_inverted_index.models | 11 |
| abstract_inverted_index.number | 19 |
| abstract_inverted_index.overly | 81 |
| abstract_inverted_index.router | 41 |
| abstract_inverted_index.select | 18 |
| abstract_inverted_index.tokens | 15 |
| abstract_inverted_index.Experts | 3, 97 |
| abstract_inverted_index.Mixture | 1, 95 |
| abstract_inverted_index.enables | 5 |
| abstract_inverted_index.experts | 75, 107 |
| abstract_inverted_index.focused | 37 |
| abstract_inverted_index.methods | 135 |
| abstract_inverted_index.mixture | 105 |
| abstract_inverted_index.model's | 61 |
| abstract_inverted_index.propose | 88 |
| abstract_inverted_index.remains | 25 |
| abstract_inverted_index.routing | 13, 69, 134 |
| abstract_inverted_index.similar | 82 |
| abstract_inverted_index.smaller | 58 |
| abstract_inverted_index.studies | 35 |
| abstract_inverted_index.various | 124 |
| abstract_inverted_index.(S2MoE), | 101 |
| abstract_inverted_index.However, | 22 |
| abstract_inverted_index.Learning | 100, 118 |
| abstract_inverted_index.achieves | 129 |
| abstract_inverted_index.approach | 91 |
| abstract_inverted_index.designed | 108 |
| abstract_inverted_index.existing | 47 |
| abstract_inverted_index.experts. | 21 |
| abstract_inverted_index.language | 10 |
| abstract_inverted_index.mitigate | 43 |
| abstract_inverted_index.problem, | 45 |
| abstract_inverted_index.reducing | 137 |
| abstract_inverted_index.training | 7, 23 |
| abstract_inverted_index.Extensive | 121 |
| abstract_inverted_index.collapse, | 66 |
| abstract_inverted_index.collapse. | 33 |
| abstract_inverted_index.efficient | 6 |
| abstract_inverted_index.features. | 83 |
| abstract_inverted_index.improving | 39 |
| abstract_inverted_index.inference | 139 |
| abstract_inverted_index.Stochastic | 99 |
| abstract_inverted_index.approaches | 48 |
| abstract_inverted_index.comparable | 131 |
| abstract_inverted_index.dimension, | 62 |
| abstract_inverted_index.embeddings | 55 |
| abstract_inverted_index.challenging | 26 |
| abstract_inverted_index.demonstrate | 126 |
| abstract_inverted_index.experiments | 122 |
| abstract_inverted_index.performance | 130 |
| abstract_inverted_index.Uncertainty. | 120 |
| abstract_inverted_index.contributing | 63 |
| abstract_inverted_index.limitations: | 52 |
| abstract_inverted_index.computational | 138 |
| abstract_inverted_index.deterministic | 113 |
| abstract_inverted_index.significantly | 57 |
| abstract_inverted_index.representation | 32, 65 |
| abstract_inverted_index.non-deterministic | 115 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |