Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2511.07380
Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via neural tangent kernels (NTK). Our method tackles two challenges of directly applying NTK to LLMs, theoretical assumptions and prohibitive computational cost, by empirically demonstrating a stable NTK-like behavior in LLMs during LoRA fine-tuning and proposing a Jacobian-free approximation method. Extensive experiments across four low-resource domains (medical, financial, legal, and psychological) demonstrate that NTK-Selector consistently improves downstream performance. Specifically, fine-tuning on 1,000 in-domain samples alone only yielded +0.8 points for Llama3-8B-Instruct and +0.9 points for Qwen3-8B. In contrast, enriching with 9,000 auxiliary samples selected by NTK-Selector led to substantial \textbf{gains of +8.7 and +5.1 points}, which corresponds to a \textbf{10.9x and 5.7x improvement} over the domain-only setting.
Related Topics
- Type
- preprint
- Landing Page
- http://arxiv.org/abs/2511.07380
- https://arxiv.org/pdf/2511.07380
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4416159260
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4416159260Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2511.07380Digital Object Identifier
- Title
-
Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource DomainsWork title
- Type
-
preprintOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-11-10Full publication date if available
- Authors
-
Ziqing Fan, Yaxin Du, Shuo Tang, Yu WangList of authors in order
- Landing page
-
https://arxiv.org/abs/2511.07380Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2511.07380Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2511.07380Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4416159260 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2511.07380 |
| ids.doi | https://doi.org/10.48550/arxiv.2511.07380 |
| ids.openalex | https://openalex.org/W4416159260 |
| fwci | |
| type | preprint |
| title | Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | |
| locations[0].id | pmh:oai:arXiv.org:2511.07380 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2511.07380 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2511.07380 |
| locations[1].id | doi:10.48550/arxiv.2511.07380 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2511.07380 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5104315338 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Ziqing Fan |
| authorships[0].author_position | last |
| authorships[0].raw_author_name | Fan, Ziqing |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5078544994 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Yaxin Du |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Du, Yaxin |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5112873425 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Shuo Tang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Tang, Shuo |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100626680 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-8344-1586 |
| authorships[3].author.display_name | Yu Wang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Wang, Yu |
| authorships[3].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2511.07380 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-11-12T00:00:00 |
| display_name | Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-28T06:38:57.532577 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2511.07380 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2511.07380 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2511.07380 |
| primary_location.id | pmh:oai:arXiv.org:2511.07380 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2511.07380 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2511.07380 |
| publication_date | 2025-11-10 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 18, 90, 106, 145, 156, 218 |
| abstract_inverted_index.In | 196 |
| abstract_inverted_index.To | 100 |
| abstract_inverted_index.as | 54 |
| abstract_inverted_index.by | 142, 204 |
| abstract_inverted_index.in | 14, 149 |
| abstract_inverted_index.is | 34 |
| abstract_inverted_index.of | 29, 40, 92, 130, 210 |
| abstract_inverted_index.on | 180 |
| abstract_inverted_index.or | 97 |
| abstract_inverted_index.to | 22, 64, 70, 78, 89, 116, 134, 207, 217 |
| abstract_inverted_index.us | 63 |
| abstract_inverted_index.we | 103 |
| abstract_inverted_index.NTK | 133 |
| abstract_inverted_index.Our | 125 |
| abstract_inverted_index.and | 25, 44, 108, 138, 154, 169, 191, 212, 220 |
| abstract_inverted_index.are | 86 |
| abstract_inverted_index.due | 21, 88 |
| abstract_inverted_index.for | 57, 111, 189, 194 |
| abstract_inverted_index.led | 206 |
| abstract_inverted_index.our | 45, 65 |
| abstract_inverted_index.the | 26, 73, 224 |
| abstract_inverted_index.two | 128 |
| abstract_inverted_index.via | 120 |
| abstract_inverted_index.yet | 11 |
| abstract_inverted_index.+0.8 | 187 |
| abstract_inverted_index.+0.9 | 192 |
| abstract_inverted_index.+5.1 | 213 |
| abstract_inverted_index.+8.7 | 211 |
| abstract_inverted_index.5.7x | 221 |
| abstract_inverted_index.LLMs | 150 |
| abstract_inverted_index.LoRA | 152 |
| abstract_inverted_index.This | 60 |
| abstract_inverted_index.data | 23, 33, 77, 95, 115 |
| abstract_inverted_index.four | 163 |
| abstract_inverted_index.have | 4 |
| abstract_inverted_index.high | 27 |
| abstract_inverted_index.lack | 91 |
| abstract_inverted_index.most | 74 |
| abstract_inverted_index.only | 185 |
| abstract_inverted_index.over | 223 |
| abstract_inverted_index.risk | 28 |
| abstract_inverted_index.that | 49, 172 |
| abstract_inverted_index.they | 50 |
| abstract_inverted_index.vast | 38 |
| abstract_inverted_index.when | 83 |
| abstract_inverted_index.with | 199 |
| abstract_inverted_index.1,000 | 181 |
| abstract_inverted_index.9,000 | 200 |
| abstract_inverted_index.LLMs, | 135 |
| abstract_inverted_index.Large | 0 |
| abstract_inverted_index.While | 31 |
| abstract_inverted_index.alone | 184 |
| abstract_inverted_index.cost, | 141 |
| abstract_inverted_index.could | 51 |
| abstract_inverted_index.data, | 43 |
| abstract_inverted_index.exist | 37 |
| abstract_inverted_index.large | 93 |
| abstract_inverted_index.leads | 62 |
| abstract_inverted_index.pools | 96 |
| abstract_inverted_index.serve | 53 |
| abstract_inverted_index.sets. | 99 |
| abstract_inverted_index.their | 12 |
| abstract_inverted_index.there | 36 |
| abstract_inverted_index.this, | 102 |
| abstract_inverted_index.which | 215 |
| abstract_inverted_index.(LLMs) | 3 |
| abstract_inverted_index.(NTK). | 124 |
| abstract_inverted_index.across | 8, 162 |
| abstract_inverted_index.domain | 58 |
| abstract_inverted_index.during | 151 |
| abstract_inverted_index.legal, | 168 |
| abstract_inverted_index.method | 126 |
| abstract_inverted_index.models | 2 |
| abstract_inverted_index.neural | 121 |
| abstract_inverted_index.points | 188, 193 |
| abstract_inverted_index.reveal | 48 |
| abstract_inverted_index.select | 72 |
| abstract_inverted_index.stable | 146 |
| abstract_inverted_index.tasks, | 10 |
| abstract_inverted_index.address | 101 |
| abstract_inverted_index.amounts | 39 |
| abstract_inverted_index.central | 66 |
| abstract_inverted_index.domains | 16, 165 |
| abstract_inverted_index.enhance | 117 |
| abstract_inverted_index.initial | 46 |
| abstract_inverted_index.kernels | 123 |
| abstract_inverted_index.method. | 159 |
| abstract_inverted_index.methods | 85 |
| abstract_inverted_index.propose | 104 |
| abstract_inverted_index.remains | 17 |
| abstract_inverted_index.samples | 183, 202 |
| abstract_inverted_index.similar | 41 |
| abstract_inverted_index.success | 7 |
| abstract_inverted_index.tackles | 127 |
| abstract_inverted_index.tangent | 122 |
| abstract_inverted_index.yielded | 186 |
| abstract_inverted_index.NTK-like | 147 |
| abstract_inverted_index.achieved | 5 |
| abstract_inverted_index.applying | 132 |
| abstract_inverted_index.behavior | 148 |
| abstract_inverted_index.directly | 131 |
| abstract_inverted_index.findings | 47 |
| abstract_inverted_index.improves | 175 |
| abstract_inverted_index.language | 1 |
| abstract_inverted_index.limited, | 35 |
| abstract_inverted_index.maximize | 79 |
| abstract_inverted_index.points}, | 214 |
| abstract_inverted_index.research | 67 |
| abstract_inverted_index.scarcity | 24 |
| abstract_inverted_index.selected | 203 |
| abstract_inverted_index.setting. | 226 |
| abstract_inverted_index.valuable | 75 |
| abstract_inverted_index.(medical, | 166 |
| abstract_inverted_index.Extensive | 160 |
| abstract_inverted_index.Qwen3-8B. | 195 |
| abstract_inverted_index.auxiliary | 55, 76, 114, 201 |
| abstract_inverted_index.challenge | 20 |
| abstract_inverted_index.contrast, | 197 |
| abstract_inverted_index.efficient | 109 |
| abstract_inverted_index.enriching | 198 |
| abstract_inverted_index.framework | 110 |
| abstract_inverted_index.in-domain | 32, 94, 182 |
| abstract_inverted_index.proposing | 155 |
| abstract_inverted_index.question: | 68 |
| abstract_inverted_index.selecting | 112 |
| abstract_inverted_index.challenges | 129 |
| abstract_inverted_index.downstream | 176 |
| abstract_inverted_index.financial, | 167 |
| abstract_inverted_index.principled | 107 |
| abstract_inverted_index.remarkable | 6 |
| abstract_inverted_index.validation | 98 |
| abstract_inverted_index.widespread | 9 |
| abstract_inverted_index.application | 13 |
| abstract_inverted_index.assumptions | 137 |
| abstract_inverted_index.corresponds | 216 |
| abstract_inverted_index.demonstrate | 171 |
| abstract_inverted_index.domain-only | 225 |
| abstract_inverted_index.effectively | 71 |
| abstract_inverted_index.empirically | 143 |
| abstract_inverted_index.experiments | 161 |
| abstract_inverted_index.fine-tuning | 153, 179 |
| abstract_inverted_index.observation | 61 |
| abstract_inverted_index.performance | 119 |
| abstract_inverted_index.potentially | 52 |
| abstract_inverted_index.prohibitive | 139 |
| abstract_inverted_index.significant | 19 |
| abstract_inverted_index.substantial | 208 |
| abstract_inverted_index.supervision | 56 |
| abstract_inverted_index.theoretical | 136 |
| abstract_inverted_index.traditional | 84 |
| abstract_inverted_index.NTK-Selector | 173, 205 |
| abstract_inverted_index.consistently | 174 |
| abstract_inverted_index.enhancement. | 59 |
| abstract_inverted_index.improvement} | 222 |
| abstract_inverted_index.inapplicable | 87 |
| abstract_inverted_index.low-resource | 15, 164 |
| abstract_inverted_index.overfitting. | 30 |
| abstract_inverted_index.particularly | 82 |
| abstract_inverted_index.performance. | 177 |
| abstract_inverted_index.Jacobian-free | 157 |
| abstract_inverted_index.Specifically, | 178 |
| abstract_inverted_index.\textbf{10.9x | 219 |
| abstract_inverted_index.\textbf{gains | 209 |
| abstract_inverted_index.approximation | 158 |
| abstract_inverted_index.computational | 140 |
| abstract_inverted_index.demonstrating | 144 |
| abstract_inverted_index.general-domain | 42, 113 |
| abstract_inverted_index.performance}}, | 81 |
| abstract_inverted_index.psychological) | 170 |
| abstract_inverted_index.domain-specific | 80, 118 |
| abstract_inverted_index.Llama3-8B-Instruct | 190 |
| abstract_inverted_index.\textbf{\textit{how | 69 |
| abstract_inverted_index.\textbf{NTK-Selector}, | 105 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |