Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2505.12212
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$\times$ speedup. The code is available at https://github.com/gszfwsb/Data-Whisperer.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2505.12212
- https://arxiv.org/pdf/2505.12212
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416445292
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416445292Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2505.12212Digital Object Identifier
- Title
-
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context LearningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-18Full publication date if available
- Authors
-
Shaobo Wang, Xiaopeng Jin, Ziming Wang, Wang Jie, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, Linfeng ZhangList of authors in order
- Landing page
-
https://arxiv.org/abs/2505.12212Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2505.12212Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2505.12212Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416445292 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2505.12212 |
| ids.doi | https://doi.org/10.48550/arxiv.2505.12212 |
| ids.openalex | https://openalex.org/W4416445292 |
| fwci | |
| type | preprint |
| title | Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2505.12212 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2505.12212 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2505.12212 |
| locations[1].id | doi:10.48550/arxiv.2505.12212 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2505.12212 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5107930049 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-8156-7081 |
| authorships[0].author.display_name | Shaobo Wang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Wang, Shaobo |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5002557354 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-6143-8570 |
| authorships[1].author.display_name | Xiaopeng Jin |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Jin, Xiangqi |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100451885 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-5457-8882 |
| authorships[2].author.display_name | Ziming Wang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wang, Ziming |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5101828800 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-9704-3725 |
| authorships[3].author.display_name | Wang Jie |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wang, Jize |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100319572 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-5293-7434 |
| authorships[4].author.display_name | Jiajun Zhang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhang, Jiajun |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5101812386 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-8876-6371 |
| authorships[5].author.display_name | Kaixin Li |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Li, Kaixin |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5102635197 |
| authorships[6].author.orcid | https://orcid.org/0009-0002-6157-5898 |
| authorships[6].author.display_name | Zichen Wen |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Wen, Zichen |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5100664779 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-3849-3416 |
| authorships[7].author.display_name | Zhong Li |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Li, Zhong |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5072088593 |
| authorships[8].author.orcid | https://orcid.org/0009-0009-3186-6847 |
| authorships[8].author.display_name | Conghui He |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | He, Conghui |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5057914558 |
| authorships[9].author.orcid | https://orcid.org/0000-0002-6973-8252 |
| authorships[9].author.display_name | Xuming Hu |
| authorships[9].author_position | middle |
| authorships[9].raw_author_name | Hu, Xuming |
| authorships[9].is_corresponding | False |
| authorships[10].author.id | https://openalex.org/A5100689117 |
| authorships[10].author.orcid | https://orcid.org/0000-0002-8470-5846 |
| authorships[10].author.display_name | Linfeng Zhang |
| authorships[10].author_position | last |
| authorships[10].raw_author_name | Zhang, Linfeng |
| authorships[10].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2505.12212 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-12-01T00:03:43.161839 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2505.12212 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2505.12212 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2505.12212 |
| primary_location.id | pmh:oai:arXiv.org:2505.12212 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2505.12212 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2505.12212 |
| publication_date | 2025-05-18 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 39, 130, 134 |
| abstract_inverted_index.As | 14 |
| abstract_inverted_index.To | 64 |
| abstract_inverted_index.an | 72 |
| abstract_inverted_index.at | 141 |
| abstract_inverted_index.be | 86 |
| abstract_inverted_index.is | 8, 47, 139 |
| abstract_inverted_index.of | 122 |
| abstract_inverted_index.on | 5, 42, 53, 92, 115 |
| abstract_inverted_index.or | 51 |
| abstract_inverted_index.to | 26, 57, 85, 110 |
| abstract_inverted_index.we | 68 |
| abstract_inverted_index.10% | 121 |
| abstract_inverted_index.The | 137 |
| abstract_inverted_index.and | 29, 49, 95, 101, 125, 133 |
| abstract_inverted_index.for | 10, 22 |
| abstract_inverted_index.raw | 94 |
| abstract_inverted_index.the | 43, 60, 83, 111, 116, 123 |
| abstract_inverted_index.Data | 70, 104 |
| abstract_inverted_index.both | 93 |
| abstract_inverted_index.code | 138 |
| abstract_inverted_index.data | 7, 33 |
| abstract_inverted_index.fail | 56 |
| abstract_inverted_index.full | 112 |
| abstract_inverted_index.just | 120 |
| abstract_inverted_index.rely | 52 |
| abstract_inverted_index.that | 55, 77 |
| abstract_inverted_index.were | 90 |
| abstract_inverted_index.with | 82, 129 |
| abstract_inverted_index.GSM8K | 113 |
| abstract_inverted_index.data, | 124 |
| abstract_inverted_index.fully | 58 |
| abstract_inverted_index.grow, | 17 |
| abstract_inverted_index.large | 1 |
| abstract_inverted_index.model | 41, 84 |
| abstract_inverted_index.often | 36 |
| abstract_inverted_index.sizes | 16 |
| abstract_inverted_index.tasks | 100 |
| abstract_inverted_index.their | 11 |
| abstract_inverted_index.these | 66 |
| abstract_inverted_index.using | 119 |
| abstract_inverted_index.which | 46 |
| abstract_inverted_index.(LLMs) | 4 |
| abstract_inverted_index.across | 98 |
| abstract_inverted_index.costs. | 31 |
| abstract_inverted_index.method | 76 |
| abstract_inverted_index.model, | 118 |
| abstract_inverted_index.models | 3 |
| abstract_inverted_index.target | 44 |
| abstract_inverted_index.address | 65 |
| abstract_inverted_index.becomes | 24 |
| abstract_inverted_index.crucial | 25 |
| abstract_inverted_index.dataset | 15, 114 |
| abstract_inverted_index.diverse | 99 |
| abstract_inverted_index.methods | 35, 128 |
| abstract_inverted_index.model's | 61 |
| abstract_inverted_index.models. | 102 |
| abstract_inverted_index.optimal | 20 |
| abstract_inverted_index.propose | 69 |
| abstract_inverted_index.require | 37 |
| abstract_inverted_index.scoring | 40 |
| abstract_inverted_index.subsets | 21 |
| abstract_inverted_index.Notably, | 103 |
| abstract_inverted_index.achieves | 106 |
| abstract_inverted_index.compared | 109 |
| abstract_inverted_index.dataset, | 45 |
| abstract_inverted_index.datasets | 97 |
| abstract_inverted_index.existing | 127 |
| abstract_inverted_index.few-shot | 79 |
| abstract_inverted_index.language | 2 |
| abstract_inverted_index.learning | 81 |
| abstract_inverted_index.leverage | 59 |
| abstract_inverted_index.speedup. | 136 |
| abstract_inverted_index.superior | 107 |
| abstract_inverted_index.training | 23 |
| abstract_inverted_index.3.1-point | 131 |
| abstract_inverted_index.Whisperer | 105 |
| abstract_inverted_index.available | 140 |
| abstract_inverted_index.balancing | 27 |
| abstract_inverted_index.conducted | 91 |
| abstract_inverted_index.effective | 12 |
| abstract_inverted_index.essential | 9 |
| abstract_inverted_index.leverages | 78 |
| abstract_inverted_index.selecting | 19 |
| abstract_inverted_index.selection | 34 |
| abstract_inverted_index.synthetic | 96 |
| abstract_inverted_index.Whisperer, | 71 |
| abstract_inverted_index.efficient, | 73 |
| abstract_inverted_index.heuristics | 54 |
| abstract_inverted_index.in-context | 80 |
| abstract_inverted_index.predictive | 62 |
| abstract_inverted_index.7.4$\times$ | 135 |
| abstract_inverted_index.Fine-tuning | 0 |
| abstract_inverted_index.Traditional | 32 |
| abstract_inverted_index.challenges, | 67 |
| abstract_inverted_index.deployment. | 13 |
| abstract_inverted_index.efficiently | 18 |
| abstract_inverted_index.evaluations | 89 |
| abstract_inverted_index.fine-tuned. | 87 |
| abstract_inverted_index.fine-tuning | 38 |
| abstract_inverted_index.improvement | 132 |
| abstract_inverted_index.outperforms | 126 |
| abstract_inverted_index.performance | 28, 108 |
| abstract_inverted_index.Comprehensive | 88 |
| abstract_inverted_index.capabilities. | 63 |
| abstract_inverted_index.computational | 30 |
| abstract_inverted_index.task-specific | 6 |
| abstract_inverted_index.time-consuming | 48 |
| abstract_inverted_index.training-free, | 74 |
| abstract_inverted_index.attention-based | 75 |
| abstract_inverted_index.Llama-3-8B-Instruct | 117 |
| abstract_inverted_index.resource-intensive, | 50 |
| abstract_inverted_index.https://github.com/gszfwsb/Data-Whisperer. | 142 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 11 |
| citation_normalized_percentile |