Language Models That Walk the Talk: A Framework for Formal Fairness Certificates Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2505.12767
As large language models become integral to high-stakes applications, ensuring their robustness and fairness is critical. Despite their success, large language models remain vulnerable to adversarial attacks, where small perturbations, such as synonym substitutions, can alter model predictions, posing risks in fairness-critical areas, such as gender bias mitigation, and safety-critical areas, such as toxicity detection. While formal verification has been explored for neural networks, its application to large language models remains limited. This work presents a holistic verification framework to certify the robustness of transformer-based language models, with a focus on ensuring gender fairness and consistent outputs across different gender-related terms. Furthermore, we extend this methodology to toxicity detection, offering formal guarantees that adversarially manipulated toxic inputs are consistently detected and appropriately censored, thereby ensuring the reliability of moderation systems. By formalizing robustness within the embedding space, this work strengthens the reliability of language models in ethical AI deployment and content moderation.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2505.12767
- https://arxiv.org/pdf/2505.12767
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4417302136
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4417302136Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2505.12767Digital Object Identifier
- Title
-
Language Models That Walk the Talk: A Framework for Formal Fairness CertificatesWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-19Full publication date if available
- Authors
-
D.-J. Chen, Tobias Ladner, Ahmed Rayen Mhadhbi, Matthias AlthoffList of authors in order
- Landing page
-
https://arxiv.org/abs/2505.12767Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2505.12767Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2505.12767Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4417302136 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2505.12767 |
| ids.doi | https://doi.org/10.48550/arxiv.2505.12767 |
| ids.openalex | https://openalex.org/W4417302136 |
| fwci | |
| type | preprint |
| title | Language Models That Walk the Talk: A Framework for Formal Fairness Certificates |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2505.12767 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2505.12767 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2505.12767 |
| locations[1].id | pmh:oai:mediatum.ub.tum.de:node/1781914 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400453 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | False |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | mediaTUM – the media and publications repository of the Technical University Munich (Technical University Munich) |
| locations[1].source.host_organization | https://openalex.org/I62916508 |
| locations[1].source.host_organization_name | Technical University of Munich |
| locations[1].source.host_organization_lineage | https://openalex.org/I62916508 |
| locations[1].license | other-oa |
| locations[1].pdf_url | |
| locations[1].version | submittedVersion |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/other-oa |
| locations[1].is_accepted | False |
| locations[1].is_published | False |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://mediatum.ub.tum.de/1781914 |
| locations[2].id | doi:10.48550/arxiv.2505.12767 |
| locations[2].is_oa | True |
| locations[2].source.id | https://openalex.org/S4306400194 |
| locations[2].source.issn | |
| locations[2].source.type | repository |
| locations[2].source.is_oa | True |
| locations[2].source.issn_l | |
| locations[2].source.is_core | False |
| locations[2].source.is_in_doaj | False |
| locations[2].source.display_name | arXiv (Cornell University) |
| locations[2].source.host_organization | https://openalex.org/I205783295 |
| locations[2].source.host_organization_name | Cornell University |
| locations[2].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[2].license | cc-by |
| locations[2].pdf_url | |
| locations[2].version | |
| locations[2].raw_type | article |
| locations[2].license_id | https://openalex.org/licenses/cc-by |
| locations[2].is_accepted | False |
| locations[2].is_published | |
| locations[2].raw_source_name | |
| locations[2].landing_page_url | https://doi.org/10.48550/arxiv.2505.12767 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5031489802 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | D.-J. Chen |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Chen, Danqing |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5102710294 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-4556-8308 |
| authorships[1].author.display_name | Tobias Ladner |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Ladner, Tobias |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5120793759 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Ahmed Rayen Mhadhbi |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Mhadhbi, Ahmed Rayen |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5005383495 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-3733-842X |
| authorships[3].author.display_name | Matthias Althoff |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Althoff, Matthias |
| authorships[3].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2505.12767 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Language Models That Walk the Talk: A Framework for Formal Fairness Certificates |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-12-13T21:47:13.963165 |
| primary_topic | |
| cited_by_count | 0 |
| locations_count | 3 |
| best_oa_location.id | pmh:oai:arXiv.org:2505.12767 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2505.12767 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2505.12767 |
| primary_location.id | pmh:oai:arXiv.org:2505.12767 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2505.12767 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2505.12767 |
| publication_date | 2025-05-19 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 75, 88 |
| abstract_inverted_index.AI | 147 |
| abstract_inverted_index.As | 0 |
| abstract_inverted_index.By | 130 |
| abstract_inverted_index.as | 31, 44, 52 |
| abstract_inverted_index.in | 40, 145 |
| abstract_inverted_index.is | 14 |
| abstract_inverted_index.of | 83, 127, 142 |
| abstract_inverted_index.on | 90 |
| abstract_inverted_index.to | 6, 24, 66, 79, 106 |
| abstract_inverted_index.we | 102 |
| abstract_inverted_index.and | 12, 48, 94, 120, 149 |
| abstract_inverted_index.are | 117 |
| abstract_inverted_index.can | 34 |
| abstract_inverted_index.for | 61 |
| abstract_inverted_index.has | 58 |
| abstract_inverted_index.its | 64 |
| abstract_inverted_index.the | 81, 125, 134, 140 |
| abstract_inverted_index.This | 72 |
| abstract_inverted_index.been | 59 |
| abstract_inverted_index.bias | 46 |
| abstract_inverted_index.such | 30, 43, 51 |
| abstract_inverted_index.that | 112 |
| abstract_inverted_index.this | 104, 137 |
| abstract_inverted_index.with | 87 |
| abstract_inverted_index.work | 73, 138 |
| abstract_inverted_index.While | 55 |
| abstract_inverted_index.alter | 35 |
| abstract_inverted_index.focus | 89 |
| abstract_inverted_index.large | 1, 19, 67 |
| abstract_inverted_index.model | 36 |
| abstract_inverted_index.risks | 39 |
| abstract_inverted_index.small | 28 |
| abstract_inverted_index.their | 10, 17 |
| abstract_inverted_index.toxic | 115 |
| abstract_inverted_index.where | 27 |
| abstract_inverted_index.across | 97 |
| abstract_inverted_index.areas, | 42, 50 |
| abstract_inverted_index.become | 4 |
| abstract_inverted_index.extend | 103 |
| abstract_inverted_index.formal | 56, 110 |
| abstract_inverted_index.gender | 45, 92 |
| abstract_inverted_index.inputs | 116 |
| abstract_inverted_index.models | 3, 21, 69, 144 |
| abstract_inverted_index.neural | 62 |
| abstract_inverted_index.posing | 38 |
| abstract_inverted_index.remain | 22 |
| abstract_inverted_index.space, | 136 |
| abstract_inverted_index.terms. | 100 |
| abstract_inverted_index.within | 133 |
| abstract_inverted_index.Despite | 16 |
| abstract_inverted_index.certify | 80 |
| abstract_inverted_index.content | 150 |
| abstract_inverted_index.ethical | 146 |
| abstract_inverted_index.models, | 86 |
| abstract_inverted_index.outputs | 96 |
| abstract_inverted_index.remains | 70 |
| abstract_inverted_index.synonym | 32 |
| abstract_inverted_index.thereby | 123 |
| abstract_inverted_index.attacks, | 26 |
| abstract_inverted_index.detected | 119 |
| abstract_inverted_index.ensuring | 9, 91, 124 |
| abstract_inverted_index.explored | 60 |
| abstract_inverted_index.fairness | 13, 93 |
| abstract_inverted_index.holistic | 76 |
| abstract_inverted_index.integral | 5 |
| abstract_inverted_index.language | 2, 20, 68, 85, 143 |
| abstract_inverted_index.limited. | 71 |
| abstract_inverted_index.offering | 109 |
| abstract_inverted_index.presents | 74 |
| abstract_inverted_index.success, | 18 |
| abstract_inverted_index.systems. | 129 |
| abstract_inverted_index.toxicity | 53, 107 |
| abstract_inverted_index.censored, | 122 |
| abstract_inverted_index.critical. | 15 |
| abstract_inverted_index.different | 98 |
| abstract_inverted_index.embedding | 135 |
| abstract_inverted_index.framework | 78 |
| abstract_inverted_index.networks, | 63 |
| abstract_inverted_index.consistent | 95 |
| abstract_inverted_index.deployment | 148 |
| abstract_inverted_index.detection, | 108 |
| abstract_inverted_index.detection. | 54 |
| abstract_inverted_index.guarantees | 111 |
| abstract_inverted_index.moderation | 128 |
| abstract_inverted_index.robustness | 11, 82, 132 |
| abstract_inverted_index.vulnerable | 23 |
| abstract_inverted_index.adversarial | 25 |
| abstract_inverted_index.application | 65 |
| abstract_inverted_index.formalizing | 131 |
| abstract_inverted_index.high-stakes | 7 |
| abstract_inverted_index.manipulated | 114 |
| abstract_inverted_index.methodology | 105 |
| abstract_inverted_index.mitigation, | 47 |
| abstract_inverted_index.moderation. | 151 |
| abstract_inverted_index.reliability | 126, 141 |
| abstract_inverted_index.strengthens | 139 |
| abstract_inverted_index.Furthermore, | 101 |
| abstract_inverted_index.consistently | 118 |
| abstract_inverted_index.predictions, | 37 |
| abstract_inverted_index.verification | 57, 77 |
| abstract_inverted_index.adversarially | 113 |
| abstract_inverted_index.applications, | 8 |
| abstract_inverted_index.appropriately | 121 |
| abstract_inverted_index.gender-related | 99 |
| abstract_inverted_index.perturbations, | 29 |
| abstract_inverted_index.substitutions, | 33 |
| abstract_inverted_index.safety-critical | 49 |
| abstract_inverted_index.fairness-critical | 41 |
| abstract_inverted_index.transformer-based | 84 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |