MixMin: Finding Data Mixtures via Convex Minimization Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2502.10510
Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2502.10510
- https://arxiv.org/pdf/2502.10510
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4407683676
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4407683676Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2502.10510Digital Object Identifier
- Title
-
MixMin: Finding Data Mixtures via Convex MinimizationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-02-14Full publication date if available
- Authors
-
Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. MaddisonList of authors in order
- Landing page
-
https://arxiv.org/abs/2502.10510Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2502.10510Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2502.10510Direct OA link when available
- Concepts
-
Minification, Regular polygon, Convex optimization, Computer science, Mathematics, Mathematical optimization, GeometryTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4407683676 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2502.10510 |
| ids.doi | https://doi.org/10.48550/arxiv.2502.10510 |
| ids.openalex | https://openalex.org/W4407683676 |
| fwci | |
| type | preprint |
| title | MixMin: Finding Data Mixtures via Convex Minimization |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10824 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9866999983787537 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Image Retrieval and Classification Techniques |
| topics[1].id | https://openalex.org/T10637 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9681000113487244 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Advanced Clustering Algorithms Research |
| topics[2].id | https://openalex.org/T11901 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9538999795913696 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Bayesian Methods and Mixture Models |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C147764199 |
| concepts[0].level | 2 |
| concepts[0].score | 0.6824743151664734 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q6865248 |
| concepts[0].display_name | Minification |
| concepts[1].id | https://openalex.org/C112680207 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5472038984298706 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q714886 |
| concepts[1].display_name | Regular polygon |
| concepts[2].id | https://openalex.org/C157972887 |
| concepts[2].level | 3 |
| concepts[2].score | 0.41907694935798645 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q463359 |
| concepts[2].display_name | Convex optimization |
| concepts[3].id | https://openalex.org/C41008148 |
| concepts[3].level | 0 |
| concepts[3].score | 0.3815561830997467 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[3].display_name | Computer science |
| concepts[4].id | https://openalex.org/C33923547 |
| concepts[4].level | 0 |
| concepts[4].score | 0.3589596152305603 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q395 |
| concepts[4].display_name | Mathematics |
| concepts[5].id | https://openalex.org/C126255220 |
| concepts[5].level | 1 |
| concepts[5].score | 0.32942184805870056 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q141495 |
| concepts[5].display_name | Mathematical optimization |
| concepts[6].id | https://openalex.org/C2524010 |
| concepts[6].level | 1 |
| concepts[6].score | 0.10076341032981873 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q8087 |
| concepts[6].display_name | Geometry |
| keywords[0].id | https://openalex.org/keywords/minification |
| keywords[0].score | 0.6824743151664734 |
| keywords[0].display_name | Minification |
| keywords[1].id | https://openalex.org/keywords/regular-polygon |
| keywords[1].score | 0.5472038984298706 |
| keywords[1].display_name | Regular polygon |
| keywords[2].id | https://openalex.org/keywords/convex-optimization |
| keywords[2].score | 0.41907694935798645 |
| keywords[2].display_name | Convex optimization |
| keywords[3].id | https://openalex.org/keywords/computer-science |
| keywords[3].score | 0.3815561830997467 |
| keywords[3].display_name | Computer science |
| keywords[4].id | https://openalex.org/keywords/mathematics |
| keywords[4].score | 0.3589596152305603 |
| keywords[4].display_name | Mathematics |
| keywords[5].id | https://openalex.org/keywords/mathematical-optimization |
| keywords[5].score | 0.32942184805870056 |
| keywords[5].display_name | Mathematical optimization |
| keywords[6].id | https://openalex.org/keywords/geometry |
| keywords[6].score | 0.10076341032981873 |
| keywords[6].display_name | Geometry |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2502.10510 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2502.10510 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2502.10510 |
| locations[1].id | doi:10.48550/arxiv.2502.10510 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2502.10510 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5099038757 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Anvith Thudi |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Thudi, Anvith |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5116305986 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Evianne Rovers |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Rovers, Evianne |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5074108427 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-0816-219X |
| authorships[2].author.display_name | Yangjun Ruan |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Ruan, Yangjun |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5116305987 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Tristan Thrush |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Thrush, Tristan |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5054711904 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Chris J. Maddison |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Maddison, Chris J. |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2502.10510 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | MixMin: Finding Data Mixtures via Convex Minimization |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10824 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9866999983787537 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Image Retrieval and Classification Techniques |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W1979597421, https://openalex.org/W2007980826, https://openalex.org/W2061531152, https://openalex.org/W3002753104, https://openalex.org/W2077600819, https://openalex.org/W2142036596, https://openalex.org/W2072657027, https://openalex.org/W2962838298, https://openalex.org/W2600246793 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2502.10510 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2502.10510 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2502.10510 |
| primary_location.id | pmh:oai:arXiv.org:2502.10510 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2502.10510 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2502.10510 |
| publication_date | 2025-02-14 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 27, 39, 56, 90, 140 |
| abstract_inverted_index.In | 65 |
| abstract_inverted_index.We | 32, 86 |
| abstract_inverted_index.an | 190 |
| abstract_inverted_index.as | 38, 80 |
| abstract_inverted_index.be | 182 |
| abstract_inverted_index.in | 122 |
| abstract_inverted_index.is | 26, 45, 62 |
| abstract_inverted_index.it | 104 |
| abstract_inverted_index.of | 174, 200 |
| abstract_inverted_index.on | 105, 144, 156 |
| abstract_inverted_index.to | 51, 152, 188, 196 |
| abstract_inverted_index.we | 68, 99, 128, 164, 193 |
| abstract_inverted_index.ARC | 158 |
| abstract_inverted_index.all | 123 |
| abstract_inverted_index.and | 7, 12, 29, 88, 102, 108, 161 |
| abstract_inverted_index.are | 4 |
| abstract_inverted_index.for | 55, 93, 139, 169 |
| abstract_inverted_index.log | 154 |
| abstract_inverted_index.may | 181 |
| abstract_inverted_index.one | 47 |
| abstract_inverted_index.our | 81, 124 |
| abstract_inverted_index.saw | 194 |
| abstract_inverted_index.the | 22, 42, 46, 52, 70, 73, 113, 119, 130 |
| abstract_inverted_index.was | 112 |
| abstract_inverted_index.0.2% | 136 |
| abstract_inverted_index.1-5% | 149 |
| abstract_inverted_index.8.2B | 145 |
| abstract_inverted_index.When | 184 |
| abstract_inverted_index.With | 126 |
| abstract_inverted_index.Yet, | 20 |
| abstract_inverted_index.best | 43, 53 |
| abstract_inverted_index.call | 100 |
| abstract_inverted_index.data | 9, 24, 35, 75, 120, 131, 187 |
| abstract_inverted_index.from | 10 |
| abstract_inverted_index.lead | 50 |
| abstract_inverted_index.less | 134 |
| abstract_inverted_index.make | 69 |
| abstract_inverted_index.only | 114 |
| abstract_inverted_index.open | 30 |
| abstract_inverted_index.test | 103 |
| abstract_inverted_index.than | 135 |
| abstract_inverted_index.that | 48, 72, 116, 166, 178 |
| abstract_inverted_index.this | 34, 60, 66, 95 |
| abstract_inverted_index.Easy, | 159 |
| abstract_inverted_index.PIQA, | 157 |
| abstract_inverted_index.SciQ, | 160 |
| abstract_inverted_index.class | 83 |
| abstract_inverted_index.e.g., | 15 |
| abstract_inverted_index.found | 165 |
| abstract_inverted_index.large | 17 |
| abstract_inverted_index.model | 54, 82, 142 |
| abstract_inverted_index.study | 89 |
| abstract_inverted_index.train | 189 |
| abstract_inverted_index.using | 133 |
| abstract_inverted_index.which | 98 |
| abstract_inverted_index.would | 49 |
| abstract_inverted_index.MixMin | 111, 167, 179 |
| abstract_inverted_index.Modern | 0 |
| abstract_inverted_index.convex | 79, 96 |
| abstract_inverted_index.larger | 175 |
| abstract_inverted_index.method | 115 |
| abstract_inverted_index.mixing | 8, 36, 76, 185 |
| abstract_inverted_index.model, | 192 |
| abstract_inverted_index.models | 171 |
| abstract_inverted_index.paper, | 67 |
| abstract_inverted_index.scores | 199 |
| abstract_inverted_index.tasks. | 110 |
| abstract_inverted_index.MixMin, | 101, 127 |
| abstract_inverted_index.XGBoost | 191 |
| abstract_inverted_index.average | 197 |
| abstract_inverted_index.becomes | 78, 84 |
| abstract_inverted_index.between | 148 |
| abstract_inverted_index.compute | 138 |
| abstract_inverted_index.develop | 87 |
| abstract_inverted_index.diverse | 11 |
| abstract_inverted_index.finding | 21 |
| abstract_inverted_index.larger. | 85 |
| abstract_inverted_index.machine | 1 |
| abstract_inverted_index.mixture | 25, 44, 121, 132 |
| abstract_inverted_index.models, | 176 |
| abstract_inverted_index.models. | 19 |
| abstract_inverted_index.optimal | 23 |
| abstract_inverted_index.problem | 37 |
| abstract_inverted_index.smaller | 170 |
| abstract_inverted_index.tokens, | 146 |
| abstract_inverted_index.trained | 143 |
| abstract_inverted_index.approach | 92 |
| abstract_inverted_index.bi-level | 40, 74 |
| abstract_inverted_index.bioassay | 186 |
| abstract_inverted_index.improved | 118, 129, 172 |
| abstract_inverted_index.language | 18, 106 |
| abstract_inverted_index.learning | 2 |
| abstract_inverted_index.mixtures | 168, 180 |
| abstract_inverted_index.modeling | 107 |
| abstract_inverted_index.negative | 153 |
| abstract_inverted_index.problem. | 31 |
| abstract_inverted_index.relative | 150 |
| abstract_inverted_index.sources, | 14 |
| abstract_inverted_index.training | 173 |
| abstract_inverted_index.chemistry | 109 |
| abstract_inverted_index.combining | 6 |
| abstract_inverted_index.disparate | 13 |
| abstract_inverted_index.formalize | 33 |
| abstract_inverted_index.generally | 63 |
| abstract_inverted_index.objective | 61, 77 |
| abstract_inverted_index.pipelines | 3 |
| abstract_inverted_index.precision | 198 |
| abstract_inverted_index.resulting | 147 |
| abstract_inverted_index.uniformly | 117 |
| abstract_inverted_index.0.03-0.15. | 201 |
| abstract_inverted_index.Crucially, | 163 |
| abstract_inverted_index.additional | 137 |
| abstract_inverted_index.downstream | 57 |
| abstract_inverted_index.likelihood | 155 |
| abstract_inverted_index.objective, | 97 |
| abstract_inverted_index.objective. | 58 |
| abstract_inverted_index.objective: | 41 |
| abstract_inverted_index.optimizing | 94 |
| abstract_inverted_index.suggesting | 177 |
| abstract_inverted_index.challenging | 28 |
| abstract_inverted_index.improvement | 151 |
| abstract_inverted_index.observation | 71 |
| abstract_inverted_index.pythia-410M | 141 |
| abstract_inverted_index.OpenWebMath. | 162 |
| abstract_inverted_index.experiments. | 125 |
| abstract_inverted_index.improvements | 195 |
| abstract_inverted_index.increasingly | 5 |
| abstract_inverted_index.intractable. | 64 |
| abstract_inverted_index.pre-training | 16 |
| abstract_inverted_index.Unfortunately, | 59 |
| abstract_inverted_index.gradient-based | 91 |
| abstract_inverted_index.scale-invariant. | 183 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |