SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2506.00676
As large language models (LLMs) become ubiquitous, parameter-efficient fine-tuning methods and safety-first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations-varied datasets, metrics, and inconsistent threat settings-making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks, and allows for the generation of harmful-variant splits; (ii) enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair; and (iii) provides evaluators for safety (attack success rate, refusal consistency) and utility. Built on Python-first, dataclass-driven configs and plugins, SafeTuneBed requires minimal additional code to specify any fine-tuning regime, defense method, and metric suite, while ensuring end-to-end reproducibility. We showcase its value by benchmarking representative defenses across varied poisoning scenarios and tasks. By standardizing data, code, and metrics, SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and comparable research in safe LLM fine-tuning. Code is available at: https://github.com/criticalml-uw/SafeTuneBed
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2506.00676
- https://arxiv.org/pdf/2506.00676
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414891791
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414891791Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2506.00676Digital Object Identifier
- Title
-
SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-TuningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-31Full publication date if available
- Authors
-
Saad Hossain, S Kant Vajpayee, Sirisha RambhatlaList of authors in order
- Landing page
-
https://arxiv.org/abs/2506.00676Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2506.00676Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2506.00676Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414891791 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2506.00676 |
| ids.doi | https://doi.org/10.48550/arxiv.2506.00676 |
| ids.openalex | https://openalex.org/W4414891791 |
| fwci | |
| type | preprint |
| title | SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T13295 |
| topics[0].field.id | https://openalex.org/fields/22 |
| topics[0].field.display_name | Engineering |
| topics[0].score | 0.8007000088691711 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/2213 |
| topics[0].subfield.display_name | Safety, Risk, Reliability and Quality |
| topics[0].display_name | Safety Systems Engineering in Autonomy |
| topics[1].id | https://openalex.org/T11357 |
| topics[1].field.id | https://openalex.org/fields/18 |
| topics[1].field.display_name | Decision Sciences |
| topics[1].score | 0.6832000017166138 |
| topics[1].domain.id | https://openalex.org/domains/2 |
| topics[1].domain.display_name | Social Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1804 |
| topics[1].subfield.display_name | Statistics, Probability and Uncertainty |
| topics[1].display_name | Risk and Safety Analysis |
| topics[2].id | https://openalex.org/T13999 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.6823999881744385 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1710 |
| topics[2].subfield.display_name | Information Systems |
| topics[2].display_name | Digital Rights Management and Security |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2506.00676 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2506.00676 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2506.00676 |
| locations[1].id | doi:10.48550/arxiv.2506.00676 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2506.00676 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5012642050 |
| authorships[0].author.orcid | https://orcid.org/0009-0006-9844-8437 |
| authorships[0].author.display_name | Saad Hossain |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Hossain, Saad |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5074111404 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | S Kant Vajpayee |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Vajpayee, Samanvay |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5018625427 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-9389-727X |
| authorships[2].author.display_name | Sirisha Rambhatla |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Rambhatla, Sirisha |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2506.00676 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T13295 |
| primary_topic.field.id | https://openalex.org/fields/22 |
| primary_topic.field.display_name | Engineering |
| primary_topic.score | 0.8007000088691711 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/2213 |
| primary_topic.subfield.display_name | Safety, Risk, Reliability and Quality |
| primary_topic.display_name | Safety Systems Engineering in Autonomy |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2506.00676 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2506.00676 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2506.00676 |
| primary_location.id | pmh:oai:arXiv.org:2506.00676 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2506.00676 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2506.00676 |
| publication_date | 2025-05-31 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 53, 65 |
| abstract_inverted_index.As | 0 |
| abstract_inverted_index.By | 157 |
| abstract_inverted_index.To | 47 |
| abstract_inverted_index.We | 143 |
| abstract_inverted_index.by | 147 |
| abstract_inverted_index.in | 27, 178 |
| abstract_inverted_index.is | 164, 183 |
| abstract_inverted_index.it | 36 |
| abstract_inverted_index.of | 19, 68, 87, 93, 169 |
| abstract_inverted_index.on | 118 |
| abstract_inverted_index.to | 38, 129, 172 |
| abstract_inverted_index.we | 50 |
| abstract_inverted_index.(i) | 63 |
| abstract_inverted_index.LLM | 180 |
| abstract_inverted_index.and | 10, 21, 32, 43, 55, 59, 78, 82, 101, 104, 115, 122, 136, 155, 161, 175 |
| abstract_inverted_index.any | 131 |
| abstract_inverted_index.at: | 185 |
| abstract_inverted_index.for | 84, 108 |
| abstract_inverted_index.its | 145, 170 |
| abstract_inverted_index.the | 17, 85, 165 |
| abstract_inverted_index.(ii) | 90 |
| abstract_inverted_index.Code | 182 |
| abstract_inverted_index.code | 128 |
| abstract_inverted_index.have | 13, 25 |
| abstract_inverted_index.kind | 171 |
| abstract_inverted_index.safe | 179 |
| abstract_inverted_index.(iii) | 105 |
| abstract_inverted_index.Built | 117 |
| abstract_inverted_index.code, | 160 |
| abstract_inverted_index.data, | 159 |
| abstract_inverted_index.first | 166 |
| abstract_inverted_index.large | 1 |
| abstract_inverted_index.rate, | 112 |
| abstract_inverted_index.their | 22 |
| abstract_inverted_index.this, | 49 |
| abstract_inverted_index.value | 146 |
| abstract_inverted_index.while | 139 |
| abstract_inverted_index.(LLMs) | 4 |
| abstract_inverted_index.across | 45, 151 |
| abstract_inverted_index.allows | 83 |
| abstract_inverted_index.become | 5 |
| abstract_inverted_index.fairly | 39 |
| abstract_inverted_index.metric | 137 |
| abstract_inverted_index.models | 3 |
| abstract_inverted_index.number | 18 |
| abstract_inverted_index.recent | 23 |
| abstract_inverted_index.safety | 109 |
| abstract_inverted_index.suite, | 138 |
| abstract_inverted_index.tasks, | 81 |
| abstract_inverted_index.tasks. | 156 |
| abstract_inverted_index.threat | 34 |
| abstract_inverted_index.varied | 152 |
| abstract_inverted_index.(attack | 110 |
| abstract_inverted_index.address | 48 |
| abstract_inverted_index.compare | 40 |
| abstract_inverted_index.configs | 121 |
| abstract_inverted_index.curates | 64 |
| abstract_inverted_index.defense | 60, 134 |
| abstract_inverted_index.diverse | 28, 66 |
| abstract_inverted_index.enables | 91 |
| abstract_inverted_index.focused | 167 |
| abstract_inverted_index.method, | 135 |
| abstract_inverted_index.methods | 9 |
| abstract_inverted_index.minimal | 126 |
| abstract_inverted_index.refusal | 113 |
| abstract_inverted_index.regime, | 133 |
| abstract_inverted_index.repair; | 103 |
| abstract_inverted_index.safety, | 41 |
| abstract_inverted_index.specify | 130 |
| abstract_inverted_index.splits; | 89 |
| abstract_inverted_index.success | 111 |
| abstract_inverted_index.toolkit | 56, 168 |
| abstract_inverted_index.However, | 16 |
| abstract_inverted_index.datasets | 71 |
| abstract_inverted_index.defenses | 12, 150 |
| abstract_inverted_index.ensuring | 140 |
| abstract_inverted_index.increase | 24 |
| abstract_inverted_index.language | 2 |
| abstract_inverted_index.methods. | 46 |
| abstract_inverted_index.metrics, | 31, 162 |
| abstract_inverted_index.multiple | 69 |
| abstract_inverted_index.plugins, | 123 |
| abstract_inverted_index.provides | 106 |
| abstract_inverted_index.rapidly. | 15 |
| abstract_inverted_index.requires | 125 |
| abstract_inverted_index.research | 177 |
| abstract_inverted_index.resulted | 26 |
| abstract_inverted_index.rigorous | 174 |
| abstract_inverted_index.showcase | 144 |
| abstract_inverted_index.spanning | 72 |
| abstract_inverted_index.unifying | 57 |
| abstract_inverted_index.utility, | 42 |
| abstract_inverted_index.utility. | 116 |
| abstract_inverted_index.analysis, | 74 |
| abstract_inverted_index.available | 184 |
| abstract_inverted_index.benchmark | 54 |
| abstract_inverted_index.datasets, | 30 |
| abstract_inverted_index.defenses, | 95 |
| abstract_inverted_index.difficult | 37 |
| abstract_inverted_index.including | 96 |
| abstract_inverted_index.introduce | 51 |
| abstract_inverted_index.poisoning | 153 |
| abstract_inverted_index.scenarios | 154 |
| abstract_inverted_index.sentiment | 73 |
| abstract_inverted_index.accelerate | 173 |
| abstract_inverted_index.additional | 127 |
| abstract_inverted_index.approaches | 20 |
| abstract_inverted_index.comparable | 176 |
| abstract_inverted_index.end-to-end | 141 |
| abstract_inverted_index.evaluators | 107 |
| abstract_inverted_index.generation | 86 |
| abstract_inverted_index.multi-step | 76 |
| abstract_inverted_index.open-ended | 79 |
| abstract_inverted_index.reasoning, | 77 |
| abstract_inverted_index.repository | 67 |
| abstract_inverted_index.robustness | 44 |
| abstract_inverted_index.SafeTuneBed | 62, 124, 163 |
| abstract_inverted_index.evaluation. | 61 |
| abstract_inverted_index.fine-tuning | 8, 58, 70, 132 |
| abstract_inverted_index.in-training | 99 |
| abstract_inverted_index.instruction | 80 |
| abstract_inverted_index.integration | 92 |
| abstract_inverted_index.post-tuning | 102 |
| abstract_inverted_index.safeguards, | 100 |
| abstract_inverted_index.ubiquitous, | 6 |
| abstract_inverted_index.SafeTuneBed, | 52 |
| abstract_inverted_index.benchmarking | 148 |
| abstract_inverted_index.consistency) | 114 |
| abstract_inverted_index.fine-tuning. | 181 |
| abstract_inverted_index.inconsistent | 33 |
| abstract_inverted_index.proliferated | 14 |
| abstract_inverted_index.safety-first | 11 |
| abstract_inverted_index.Python-first, | 119 |
| abstract_inverted_index.immunization, | 98 |
| abstract_inverted_index.standardizing | 158 |
| abstract_inverted_index.representative | 149 |
| abstract_inverted_index.alignment-stage | 97 |
| abstract_inverted_index.harmful-variant | 88 |
| abstract_inverted_index.settings-making | 35 |
| abstract_inverted_index.dataclass-driven | 120 |
| abstract_inverted_index.reproducibility. | 142 |
| abstract_inverted_index.state-of-the-art | 94 |
| abstract_inverted_index.evaluations-varied | 29 |
| abstract_inverted_index.parameter-efficient | 7 |
| abstract_inverted_index.question-answering, | 75 |
| abstract_inverted_index.https://github.com/criticalml-uw/SafeTuneBed | 186 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |