Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2508.11953
Optimizing data mixtures for supervised fine-tuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2508.11953
- https://arxiv.org/pdf/2508.11953
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414459914
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414459914Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2508.11953Digital Object Identifier
- Title
-
Data Mixing Optimization for Supervised Fine-Tuning of Large Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-08-16Full publication date if available
- Authors
-
Yuan Li, Zhengzhong Liu, Eric P. XingList of authors in order
- Landing page
-
https://arxiv.org/abs/2508.11953Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2508.11953Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2508.11953Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414459914 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2508.11953 |
| ids.doi | https://doi.org/10.48550/arxiv.2508.11953 |
| ids.openalex | https://openalex.org/W4414459914 |
| fwci | |
| type | preprint |
| title | Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.7612000107765198 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T10028 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.7595999836921692 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2508.11953 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2508.11953 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2508.11953 |
| locations[1].id | doi:10.48550/arxiv.2508.11953 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2508.11953 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100409804 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-3567-2101 |
| authorships[0].author.display_name | Yuan Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Yuan |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5029648803 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Zhengzhong Liu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Liu, Zhengzhong |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5009547049 |
| authorships[2].author.orcid | https://orcid.org/0009-0005-9158-4201 |
| authorships[2].author.display_name | Eric P. Xing |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Xing, Eric |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2508.11953 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.7612000107765198 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2508.11953 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2508.11953 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2508.11953 |
| primary_location.id | pmh:oai:arXiv.org:2508.11953 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2508.11953 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2508.11953 |
| publication_date | 2025-08-16 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 36 |
| abstract_inverted_index.By | 60 |
| abstract_inverted_index.In | 23 |
| abstract_inverted_index.We | 76 |
| abstract_inverted_index.an | 31 |
| abstract_inverted_index.as | 30 |
| abstract_inverted_index.by | 49 |
| abstract_inverted_index.is | 12 |
| abstract_inverted_index.of | 7 |
| abstract_inverted_index.on | 110, 135 |
| abstract_inverted_index.to | 40, 163 |
| abstract_inverted_index.we | 26, 67, 100, 138, 156 |
| abstract_inverted_index.Our | 44 |
| abstract_inverted_index.SFT | 143 |
| abstract_inverted_index.all | 95 |
| abstract_inverted_index.and | 34, 54, 71, 81, 91, 152, 170 |
| abstract_inverted_index.can | 161 |
| abstract_inverted_index.fit | 68 |
| abstract_inverted_index.for | 3, 14, 58, 167 |
| abstract_inverted_index.how | 158 |
| abstract_inverted_index.our | 86, 106, 146, 159 |
| abstract_inverted_index.par | 111 |
| abstract_inverted_index.the | 47, 73, 128 |
| abstract_inverted_index.via | 118 |
| abstract_inverted_index.yet | 18 |
| abstract_inverted_index.SFT. | 174 |
| abstract_inverted_index.area | 20 |
| abstract_inverted_index.best | 129 |
| abstract_inverted_index.both | 78, 149 |
| abstract_inverted_index.data | 1, 28, 52, 65, 165 |
| abstract_inverted_index.from | 132 |
| abstract_inverted_index.grid | 119, 133 |
| abstract_inverted_index.into | 173 |
| abstract_inverted_index.laws | 57 |
| abstract_inverted_index.loss | 48, 123, 131, 151 |
| abstract_inverted_index.only | 124 |
| abstract_inverted_index.show | 101, 139 |
| abstract_inverted_index.than | 127 |
| abstract_inverted_index.that | 85, 102, 140 |
| abstract_inverted_index.this | 19, 24 |
| abstract_inverted_index.with | 62, 105, 112, 121 |
| abstract_inverted_index.(SFT) | 6 |
| abstract_inverted_index.0.66% | 125 |
| abstract_inverted_index.frame | 27 |
| abstract_inverted_index.guide | 164 |
| abstract_inverted_index.large | 8 |
| abstract_inverted_index.loss. | 43 |
| abstract_inverted_index.novel | 37 |
| abstract_inverted_index.these | 69 |
| abstract_inverted_index.those | 113 |
| abstract_inverted_index.using | 114, 145 |
| abstract_inverted_index.(LLMs) | 11 |
| abstract_inverted_index.across | 94 |
| abstract_inverted_index.derive | 72 |
| abstract_inverted_index.domain | 130 |
| abstract_inverted_index.higher | 126 |
| abstract_inverted_index.method | 38, 147, 160 |
| abstract_inverted_index.mixing | 29 |
| abstract_inverted_index.models | 10, 103, 169 |
| abstract_inverted_index.paper, | 25 |
| abstract_inverted_index.proofs | 80 |
| abstract_inverted_index.search | 134 |
| abstract_inverted_index.Through | 97 |
| abstract_inverted_index.discuss | 157 |
| abstract_inverted_index.models, | 17 |
| abstract_inverted_index.optimal | 74, 115 |
| abstract_inverted_index.overall | 90 |
| abstract_inverted_index.perform | 109 |
| abstract_inverted_index.popular | 142 |
| abstract_inverted_index.problem | 33 |
| abstract_inverted_index.provide | 77, 171 |
| abstract_inverted_index.remains | 21 |
| abstract_inverted_index.results | 83 |
| abstract_inverted_index.scaling | 56 |
| abstract_inverted_index.search, | 120 |
| abstract_inverted_index.trained | 104 |
| abstract_inverted_index.various | 63 |
| abstract_inverted_index.weights | 108, 116 |
| abstract_inverted_index.Finally, | 155 |
| abstract_inverted_index.achieves | 88 |
| abstract_inverted_index.approach | 45 |
| abstract_inverted_index.average. | 136 |
| abstract_inverted_index.critical | 13 |
| abstract_inverted_index.datasets | 144 |
| abstract_inverted_index.designed | 39 |
| abstract_inverted_index.domains. | 96 |
| abstract_inverted_index.improves | 148 |
| abstract_inverted_index.insights | 172 |
| abstract_inverted_index.language | 9 |
| abstract_inverted_index.minimize | 41 |
| abstract_inverted_index.mixtures | 2 |
| abstract_inverted_index.modeling | 50 |
| abstract_inverted_index.weights. | 75 |
| abstract_inverted_index.algorithm | 87 |
| abstract_inverted_index.effective | 51 |
| abstract_inverted_index.empirical | 82 |
| abstract_inverted_index.excellent | 89 |
| abstract_inverted_index.introduce | 35 |
| abstract_inverted_index.mixtures, | 66 |
| abstract_inverted_index.optimized | 107 |
| abstract_inverted_index.selection | 166 |
| abstract_inverted_index.Optimizing | 0 |
| abstract_inverted_index.controlled | 98 |
| abstract_inverted_index.determined | 117 |
| abstract_inverted_index.developing | 15 |
| abstract_inverted_index.downstream | 153 |
| abstract_inverted_index.generalize | 162 |
| abstract_inverted_index.individual | 92 |
| abstract_inverted_index.leveraging | 55 |
| abstract_inverted_index.parameters | 70 |
| abstract_inverted_index.per-domain | 122 |
| abstract_inverted_index.supervised | 4 |
| abstract_inverted_index.validation | 42, 150 |
| abstract_inverted_index.fine-tuning | 5 |
| abstract_inverted_index.performance | 93 |
| abstract_inverted_index.reweighting | 141 |
| abstract_inverted_index.small-scale | 64 |
| abstract_inverted_index.transferred | 53 |
| abstract_inverted_index.experiments, | 99 |
| abstract_inverted_index.fine-tuning. | 59 |
| abstract_inverted_index.mathematical | 79 |
| abstract_inverted_index.optimization | 32 |
| abstract_inverted_index.parametrizes | 46 |
| abstract_inverted_index.performance. | 154 |
| abstract_inverted_index.Additionally, | 137 |
| abstract_inverted_index.demonstrating | 84 |
| abstract_inverted_index.experimenting | 61 |
| abstract_inverted_index.underexplored. | 22 |
| abstract_inverted_index.domain-specific | 168 |
| abstract_inverted_index.general-purpose | 16 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |