Scaling Generalist Data-Analytic Agents Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2509.25084
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2509.25084
- https://arxiv.org/pdf/2509.25084
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415337623
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415337623Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2509.25084Digital Object Identifier
- Title
-
Scaling Generalist Data-Analytic AgentsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-09-29Full publication date if available
- Authors
-
Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Bin Zhao, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun ChenList of authors in order
- Landing page
-
https://arxiv.org/abs/2509.25084Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2509.25084Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2509.25084Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415337623 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2509.25084 |
| ids.doi | https://doi.org/10.48550/arxiv.2509.25084 |
| ids.openalex | https://openalex.org/W4415337623 |
| fwci | |
| type | preprint |
| title | Scaling Generalist Data-Analytic Agents |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10320 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.5975000262260437 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Neural Networks and Applications |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2509.25084 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2509.25084 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2509.25084 |
| locations[1].id | doi:10.48550/arxiv.2509.25084 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2509.25084 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5091748443 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Shuofei Qiao |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Qiao, Shuofei |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5043181400 |
| authorships[1].author.orcid | https://orcid.org/0009-0009-8085-2302 |
| authorships[1].author.display_name | Yanqiu Zhao |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhao, Yanqiu |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5113431592 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Zhisong Qiu |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Qiu, Zhisong |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5036254084 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-0606-4710 |
| authorships[3].author.display_name | Xiaobin Wang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wang, Xiaobin |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5104119322 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Jintian Zhang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhang, Jintian |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5103181856 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-8731-2503 |
| authorships[5].author.display_name | Bin Zhao |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Bin, Zhao |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5089259739 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-1970-0678 |
| authorships[6].author.display_name | Ningyu Zhang |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Zhang, Ningyu |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5101656023 |
| authorships[7].author.orcid | https://orcid.org/0000-0003-4482-1559 |
| authorships[7].author.display_name | Yong Jiang |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Jiang, Yong |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5005535444 |
| authorships[8].author.orcid | https://orcid.org/0009-0004-8412-359X |
| authorships[8].author.display_name | Pengjun Xie |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Xie, Pengjun |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5101488344 |
| authorships[9].author.orcid | https://orcid.org/0000-0002-3709-5053 |
| authorships[9].author.display_name | Fei Huang |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Huang, Fei |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5102018239 |
| authorships[10].author.orcid | https://orcid.org/0000-0001-5496-7442 |
| authorships[10].author.display_name | Huajun Chen |
| authorships[10].author_position | last |
| authorships[10].raw_author_name | Chen, Huajun |
| authorships[10].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2509.25084 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-19T00:00:00 |
| display_name | Scaling Generalist Data-Analytic Agents |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10320 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.5975000262260437 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Neural Networks and Applications |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2509.25084 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2509.25084 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2509.25084 |
| primary_location.id | pmh:oai:arXiv.org:2509.25084 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2509.25084 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2509.25084 |
| publication_date | 2025-09-29 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 5, 52, 92, 97, 113, 125, 137, 151, 203 |
| abstract_inverted_index.1) | 91 |
| abstract_inverted_index.2) | 112 |
| abstract_inverted_index.3) | 124 |
| abstract_inverted_index.4) | 136 |
| abstract_inverted_index.RL | 134 |
| abstract_inverted_index.We | 207, 233 |
| abstract_inverted_index.an | 175 |
| abstract_inverted_index.as | 4 |
| abstract_inverted_index.by | 119 |
| abstract_inverted_index.in | 71 |
| abstract_inverted_index.of | 16, 109, 178, 205 |
| abstract_inverted_index.on | 24, 146, 168, 180 |
| abstract_inverted_index.to | 34, 61, 103, 223 |
| abstract_inverted_index.we | 148 |
| abstract_inverted_index.AI. | 18 |
| abstract_inverted_index.Our | 193 |
| abstract_inverted_index.SFT | 132 |
| abstract_inverted_index.all | 199 |
| abstract_inverted_index.and | 12, 40, 56, 83, 96, 107, 121, 133, 139, 160, 191, 237 |
| abstract_inverted_index.are | 2 |
| abstract_inverted_index.for | 8, 13, 164, 230, 239 |
| abstract_inverted_index.key | 6, 69 |
| abstract_inverted_index.our | 170, 215 |
| abstract_inverted_index.set | 154 |
| abstract_inverted_index.the | 14, 105, 186, 219, 231, 240 |
| abstract_inverted_index.This | 48 |
| abstract_inverted_index.also | 195, 208 |
| abstract_inverted_index.best | 197 |
| abstract_inverted_index.both | 131 |
| abstract_inverted_index.data | 38, 54, 78, 161, 182 |
| abstract_inverted_index.face | 35 |
| abstract_inverted_index.file | 162 |
| abstract_inverted_index.from | 214 |
| abstract_inverted_index.into | 218 |
| abstract_inverted_index.over | 27 |
| abstract_inverted_index.rely | 22 |
| abstract_inverted_index.some | 210 |
| abstract_inverted_index.task | 94, 100, 158 |
| abstract_inverted_index.that | 44 |
| abstract_inverted_index.will | 234 |
| abstract_inverted_index.with | 174, 202 |
| abstract_inverted_index.Built | 145 |
| abstract_inverted_index.about | 227 |
| abstract_inverted_index.agent | 57 |
| abstract_inverted_index.among | 198 |
| abstract_inverted_index.build | 62 |
| abstract_inverted_index.files | 39 |
| abstract_inverted_index.paper | 49 |
| abstract_inverted_index.score | 177, 204 |
| abstract_inverted_index.three | 68 |
| abstract_inverted_index.while | 30 |
| abstract_inverted_index.71.16% | 179 |
| abstract_inverted_index.GPT-5. | 192 |
| abstract_inverted_index.agents | 1 |
| abstract_inverted_index.aiming | 222 |
| abstract_inverted_index.curate | 149 |
| abstract_inverted_index.future | 242 |
| abstract_inverted_index.gained | 213 |
| abstract_inverted_index.models | 32, 201 |
| abstract_inverted_index.prompt | 25 |
| abstract_inverted_index.recipe | 59 |
| abstract_inverted_index.stable | 140 |
| abstract_inverted_index.tasks. | 166 |
| abstract_inverted_index.trials | 217 |
| abstract_inverted_index.vision | 15 |
| abstract_inverted_index.68.10%. | 206 |
| abstract_inverted_index.Current | 19 |
| abstract_inverted_index.Trained | 167 |
| abstract_inverted_index.agentic | 228 |
| abstract_inverted_index.agents, | 75 |
| abstract_inverted_index.agents. | 65 |
| abstract_inverted_index.applies | 90 |
| abstract_inverted_index.average | 176 |
| abstract_inverted_index.diverse | 156 |
| abstract_inverted_index.formats | 163 |
| abstract_inverted_index.heavily | 23 |
| abstract_inverted_index.losses; | 135 |
| abstract_inverted_index.models, | 29 |
| abstract_inverted_index.provide | 224 |
| abstract_inverted_index.release | 235 |
| abstract_inverted_index.rollout | 143 |
| abstract_inverted_index.tackles | 67 |
| abstract_inverted_index.DataMind | 66, 89 |
| abstract_inverted_index.achieves | 172 |
| abstract_inverted_index.analysis | 183, 220 |
| abstract_inverted_index.building | 72 |
| abstract_inverted_index.catalyst | 7 |
| abstract_inverted_index.demands. | 47 |
| abstract_inverted_index.designed | 60 |
| abstract_inverted_index.domains, | 157 |
| abstract_inverted_index.emerging | 3 |
| abstract_inverted_index.followed | 118 |
| abstract_inverted_index.however, | 21 |
| abstract_inverted_index.improper | 80 |
| abstract_inverted_index.increase | 104 |
| abstract_inverted_index.insights | 212, 226 |
| abstract_inverted_index.multiple | 181 |
| abstract_inverted_index.performs | 196 |
| abstract_inverted_index.queries; | 111 |
| abstract_inverted_index.rollout. | 87 |
| abstract_inverted_index.sampling | 116 |
| abstract_inverted_index.scalable | 53 |
| abstract_inverted_index.spanning | 155 |
| abstract_inverted_index.strategy | 117 |
| abstract_inverted_index.struggle | 33 |
| abstract_inverted_index.taxonomy | 95 |
| abstract_inverted_index.training | 58, 81, 128, 229 |
| abstract_inverted_index.unstable | 84 |
| abstract_inverted_index.DataMind, | 51, 147 |
| abstract_inverted_index.analytics | 46 |
| abstract_inverted_index.automated | 9 |
| abstract_inverted_index.baselines | 189 |
| abstract_inverted_index.combining | 130 |
| abstract_inverted_index.discovery | 11 |
| abstract_inverted_index.diversity | 106 |
| abstract_inverted_index.empirical | 211 |
| abstract_inverted_index.including | 76 |
| abstract_inverted_index.mechanism | 102 |
| abstract_inverted_index.objective | 129 |
| abstract_inverted_index.reasoning | 43 |
| abstract_inverted_index.recursive | 98 |
| abstract_inverted_index.research. | 243 |
| abstract_inverted_index.strategy, | 82 |
| abstract_inverted_index.strongest | 187 |
| abstract_inverted_index.synthesis | 55 |
| abstract_inverted_index.Innovating | 17 |
| abstract_inverted_index.actionable | 225 |
| abstract_inverted_index.adjustable | 127 |
| abstract_inverted_index.challenges | 70 |
| abstract_inverted_index.code-based | 85, 141 |
| abstract_inverted_index.community. | 232 |
| abstract_inverted_index.difficulty | 108 |
| abstract_inverted_index.filtering; | 123 |
| abstract_inverted_index.framework. | 144 |
| abstract_inverted_index.generalist | 63 |
| abstract_inverted_index.introduces | 50 |
| abstract_inverted_index.multi-step | 42 |
| abstract_inverted_index.multi-turn | 86, 142 |
| abstract_inverted_index.real-world | 45 |
| abstract_inverted_index.resources, | 79 |
| abstract_inverted_index.rule-based | 122 |
| abstract_inverted_index.scientific | 10 |
| abstract_inverted_index.trajectory | 115, 153 |
| abstract_inverted_index.Concretely, | 88 |
| abstract_inverted_index.DataMind-7B | 194 |
| abstract_inverted_index.approaches, | 20 |
| abstract_inverted_index.benchmarks, | 184 |
| abstract_inverted_index.categories, | 159 |
| abstract_inverted_index.community's | 241 |
| abstract_inverted_index.composition | 101 |
| abstract_inverted_index.dynamically | 126 |
| abstract_inverted_index.engineering | 26 |
| abstract_inverted_index.exploratory | 216 |
| abstract_inverted_index.incorporate | 209 |
| abstract_inverted_index.large-scale | 37 |
| abstract_inverted_index.model-based | 120 |
| abstract_inverted_index.open-source | 31, 73, 200 |
| abstract_inverted_index.proprietary | 28, 188 |
| abstract_inverted_index.synthesized | 110 |
| abstract_inverted_index.DataMind-12K | 236 |
| abstract_inverted_index.DataMind-14B | 171 |
| abstract_inverted_index.easy-to-hard | 99 |
| abstract_inverted_index.experiments, | 221 |
| abstract_inverted_index.fine-grained | 93 |
| abstract_inverted_index.high-quality | 152 |
| abstract_inverted_index.insufficient | 77 |
| abstract_inverted_index.Data-analytic | 0 |
| abstract_inverted_index.DataMind-12K, | 150, 169 |
| abstract_inverted_index.DeepSeek-V3.1 | 190 |
| abstract_inverted_index.data-analytic | 64, 74, 165 |
| abstract_inverted_index.long-horizon, | 41 |
| abstract_inverted_index.memory-frugal | 138 |
| abstract_inverted_index.outperforming | 185 |
| abstract_inverted_index.DataMind-7B,14B | 238 |
| abstract_inverted_index.diverse-format, | 36 |
| abstract_inverted_index.state-of-the-art | 173 |
| abstract_inverted_index.knowledge-augmented | 114 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 11 |
| citation_normalized_percentile |