Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2401.08491
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology. We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective. Central to our method is the construction of hard negatives - toxic outputs that are generated through adversarial paraphrasing to be semantically similar and model probability to their non-toxic counterparts. By training on these challenging and realistic pairs, our approach ensures robust and stable contrastive optimization. Experimental results in the domain of detoxification demonstrate that our method significantly reduces toxic generation while maintaining strong performance on downstream tasks such as commonsense reasoning and reading comprehension. Our findings highlight the effectiveness of exploiting hard negatives for attribute-aware fine-tuning.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2401.08491
- https://arxiv.org/pdf/2401.08491
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4390963085
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4390963085Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2401.08491Digital Object Identifier
- Title
-
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-01-16Full publication date if available
- Authors
-
Tassilo Klein, Moin NabiList of authors in order
- Landing page
-
https://arxiv.org/abs/2401.08491Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2401.08491Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2401.08491Direct OA link when available
- Concepts
-
Perplexity, Computer science, Leverage (statistics), Artificial intelligence, Comprehension, Natural language processing, Language model, Programming languageTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2024: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4390963085 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2401.08491 |
| ids.doi | https://doi.org/10.48550/arxiv.2401.08491 |
| ids.openalex | https://openalex.org/W4390963085 |
| fwci | |
| type | preprint |
| title | Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.998199999332428 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T10028 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9975000023841858 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Topic Modeling |
| topics[2].id | https://openalex.org/T13629 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9850000143051147 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Text Readability and Simplification |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C100279451 |
| concepts[0].level | 3 |
| concepts[0].score | 0.9122321009635925 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q372193 |
| concepts[0].display_name | Perplexity |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.7518483400344849 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C153083717 |
| concepts[2].level | 2 |
| concepts[2].score | 0.7410998344421387 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q6535263 |
| concepts[2].display_name | Leverage (statistics) |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.5748544335365295 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C511192102 |
| concepts[4].level | 2 |
| concepts[4].score | 0.500415563583374 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q5156948 |
| concepts[4].display_name | Comprehension |
| concepts[5].id | https://openalex.org/C204321447 |
| concepts[5].level | 1 |
| concepts[5].score | 0.49357879161834717 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[5].display_name | Natural language processing |
| concepts[6].id | https://openalex.org/C137293760 |
| concepts[6].level | 2 |
| concepts[6].score | 0.4545939266681671 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q3621696 |
| concepts[6].display_name | Language model |
| concepts[7].id | https://openalex.org/C199360897 |
| concepts[7].level | 1 |
| concepts[7].score | 0.11714458465576172 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[7].display_name | Programming language |
| keywords[0].id | https://openalex.org/keywords/perplexity |
| keywords[0].score | 0.9122321009635925 |
| keywords[0].display_name | Perplexity |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.7518483400344849 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/leverage |
| keywords[2].score | 0.7410998344421387 |
| keywords[2].display_name | Leverage (statistics) |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.5748544335365295 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/comprehension |
| keywords[4].score | 0.500415563583374 |
| keywords[4].display_name | Comprehension |
| keywords[5].id | https://openalex.org/keywords/natural-language-processing |
| keywords[5].score | 0.49357879161834717 |
| keywords[5].display_name | Natural language processing |
| keywords[6].id | https://openalex.org/keywords/language-model |
| keywords[6].score | 0.4545939266681671 |
| keywords[6].display_name | Language model |
| keywords[7].id | https://openalex.org/keywords/programming-language |
| keywords[7].score | 0.11714458465576172 |
| keywords[7].display_name | Programming language |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2401.08491 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2401.08491 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2401.08491 |
| locations[1].id | doi:10.48550/arxiv.2401.08491 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2401.08491 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5023876634 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-0631-2940 |
| authorships[0].author.display_name | Tassilo Klein |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Klein, Tassilo |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5001459748 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-7559-9888 |
| authorships[1].author.display_name | Moin Nabi |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Nabi, Moin |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2401.08491 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.998199999332428 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| related_works | https://openalex.org/W2169518243, https://openalex.org/W2252095989, https://openalex.org/W4322096525, https://openalex.org/W2551914602, https://openalex.org/W4281893144, https://openalex.org/W2105076537, https://openalex.org/W2787311093, https://openalex.org/W2084531783, https://openalex.org/W2902731467, https://openalex.org/W2020757772 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2024 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2401.08491 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2401.08491 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2401.08491 |
| primary_location.id | pmh:oai:arXiv.org:2401.08491 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2401.08491 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2401.08491 |
| publication_date | 2024-01-16 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.- | 53 |
| abstract_inverted_index.a | 11, 23, 38 |
| abstract_inverted_index.By | 73 |
| abstract_inverted_index.We | 21 |
| abstract_inverted_index.as | 112 |
| abstract_inverted_index.be | 63 |
| abstract_inverted_index.by | 5, 34 |
| abstract_inverted_index.in | 91 |
| abstract_inverted_index.is | 47 |
| abstract_inverted_index.of | 2, 18, 50, 94, 123 |
| abstract_inverted_index.on | 75, 108 |
| abstract_inverted_index.to | 44, 62, 69 |
| abstract_inverted_index.Our | 118 |
| abstract_inverted_index.The | 0 |
| abstract_inverted_index.and | 30, 66, 78, 85, 115 |
| abstract_inverted_index.are | 57 |
| abstract_inverted_index.for | 14, 26, 127 |
| abstract_inverted_index.our | 45, 81, 98 |
| abstract_inverted_index.the | 15, 48, 92, 121 |
| abstract_inverted_index.LLMs | 36 |
| abstract_inverted_index.hard | 51, 125 |
| abstract_inverted_index.safe | 16 |
| abstract_inverted_index.such | 111 |
| abstract_inverted_index.text | 32 |
| abstract_inverted_index.that | 56, 97 |
| abstract_inverted_index.with | 37 |
| abstract_inverted_index.large | 6 |
| abstract_inverted_index.model | 67 |
| abstract_inverted_index.novel | 24 |
| abstract_inverted_index.tasks | 110 |
| abstract_inverted_index.their | 70 |
| abstract_inverted_index.these | 76 |
| abstract_inverted_index.toxic | 3, 54, 102 |
| abstract_inverted_index.while | 104 |
| abstract_inverted_index.(LLMs) | 9 |
| abstract_inverted_index.domain | 93 |
| abstract_inverted_index.method | 46, 99 |
| abstract_inverted_index.models | 8 |
| abstract_inverted_index.pairs, | 80 |
| abstract_inverted_index.robust | 84 |
| abstract_inverted_index.stable | 86 |
| abstract_inverted_index.strong | 106 |
| abstract_inverted_index.Central | 43 |
| abstract_inverted_index.content | 4 |
| abstract_inverted_index.editing | 29 |
| abstract_inverted_index.ensures | 83 |
| abstract_inverted_index.outputs | 55 |
| abstract_inverted_index.propose | 22 |
| abstract_inverted_index.reading | 116 |
| abstract_inverted_index.reduces | 101 |
| abstract_inverted_index.remains | 10 |
| abstract_inverted_index.results | 90 |
| abstract_inverted_index.similar | 65 |
| abstract_inverted_index.through | 59 |
| abstract_inverted_index.approach | 82 |
| abstract_inverted_index.critical | 12 |
| abstract_inverted_index.findings | 119 |
| abstract_inverted_index.implicit | 27 |
| abstract_inverted_index.language | 7, 19 |
| abstract_inverted_index.training | 74 |
| abstract_inverted_index.challenge | 13 |
| abstract_inverted_index.framework | 25 |
| abstract_inverted_index.generated | 58 |
| abstract_inverted_index.highlight | 120 |
| abstract_inverted_index.knowledge | 28 |
| abstract_inverted_index.negatives | 52, 126 |
| abstract_inverted_index.non-toxic | 71 |
| abstract_inverted_index.realistic | 79 |
| abstract_inverted_index.reasoning | 114 |
| abstract_inverted_index.controlled | 31 |
| abstract_inverted_index.deployment | 17 |
| abstract_inverted_index.downstream | 109 |
| abstract_inverted_index.exploiting | 124 |
| abstract_inverted_index.generation | 1, 33, 103 |
| abstract_inverted_index.objective. | 42 |
| abstract_inverted_index.perplexity | 41 |
| abstract_inverted_index.adversarial | 60 |
| abstract_inverted_index.challenging | 77 |
| abstract_inverted_index.commonsense | 113 |
| abstract_inverted_index.contrastive | 40, 87 |
| abstract_inverted_index.demonstrate | 96 |
| abstract_inverted_index.fine-tuning | 35 |
| abstract_inverted_index.maintaining | 105 |
| abstract_inverted_index.performance | 107 |
| abstract_inverted_index.probability | 68 |
| abstract_inverted_index.technology. | 20 |
| abstract_inverted_index.Experimental | 89 |
| abstract_inverted_index.construction | 49 |
| abstract_inverted_index.fine-tuning. | 129 |
| abstract_inverted_index.paraphrasing | 61 |
| abstract_inverted_index.semantically | 64 |
| abstract_inverted_index.counterparts. | 72 |
| abstract_inverted_index.effectiveness | 122 |
| abstract_inverted_index.optimization. | 88 |
| abstract_inverted_index.significantly | 100 |
| abstract_inverted_index.comprehension. | 117 |
| abstract_inverted_index.detoxification | 95 |
| abstract_inverted_index.attribute-aware | 128 |
| abstract_inverted_index.prototype-based | 39 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.8799999952316284 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |