Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning Article Swipe

PDF

Qiming Bao , G. Gendron , Alex Yuxuan Peng , Wanjun Zhong , Neşet Tan , Yang Chen , Michael Witbrock , Jiamou Liu ·

YOU? · · 2023 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2310.09430

Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus" that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.

Related Topics

Computer Science

Artificial Intelligence

Concepts

Computer science Robustness (evolution) Discriminative model Artificial intelligence Logical reasoning Machine learning Logical consequence Generative grammar Natural language processing Biochemistry Gene Chemistry

Metadata

Type: preprint
Language: en
Landing Page: http://arxiv.org/abs/2310.09430
PDF: https://arxiv.org/pdf/2310.09430
OA Status: green
Cited By: 4
Related Works: 10
OpenAlex ID: https://openalex.org/W4387764395

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4387764395

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2310.09430

Digital Object Identifier
Title: Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

Work title
Type: preprint

OpenAlex work type
Language: en

Primary language
Publication year: 2023

Year of publication
Publication date: 2023-10-13

Full publication date if available
Authors: Qiming Bao, G. Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neşet Tan, Yang Chen, Michael Witbrock, Jiamou Liu

List of authors in order
Landing page: https://arxiv.org/abs/2310.09430

Publisher landing page
PDF URL: https://arxiv.org/pdf/2310.09430

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2310.09430

Direct OA link when available
Concepts: Computer science, Robustness (evolution), Discriminative model, Artificial intelligence, Logical reasoning, Machine learning, Logical consequence, Generative grammar, Natural language processing, Biochemistry, Gene, Chemistry

Top concepts (fields/topics) attached by OpenAlex
Cited by: 4

Total citation count in OpenAlex
Citations by year (recent): 2024: 3, 2023: 1

Per-year citation counts (last 5 years)
Related works (count): 10

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W4387764395
doi	https://doi.org/10.48550/arxiv.2310.09430
ids.doi	https://doi.org/10.48550/arxiv.2310.09430
ids.openalex	https://openalex.org/W4387764395
fwci
type	preprint
title	Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T10028
topics[0].field.id	https://openalex.org/fields/17
topics[0].field.display_name	Computer Science
topics[0].score	0.993399977684021
topics[0].domain.id	https://openalex.org/domains/3
topics[0].domain.display_name	Physical Sciences
topics[0].subfield.id	https://openalex.org/subfields/1702
topics[0].subfield.display_name	Artificial Intelligence
topics[0].display_name	Topic Modeling
topics[1].id	https://openalex.org/T12535
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9757999777793884
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1702
topics[1].subfield.display_name	Artificial Intelligence
topics[1].display_name	Machine Learning and Data Classification
topics[2].id	https://openalex.org/T10181
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9757999777793884
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1702
topics[2].subfield.display_name	Artificial Intelligence
topics[2].display_name	Natural Language Processing Techniques
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C41008148
concepts[0].level	0
concepts[0].score	0.7824336290359497
concepts[0].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[0].display_name	Computer science
concepts[1].id	https://openalex.org/C63479239
concepts[1].level	3
concepts[1].score	0.71518474817276
concepts[1].wikidata	https://www.wikidata.org/wiki/Q7353546
concepts[1].display_name	Robustness (evolution)
concepts[2].id	https://openalex.org/C97931131
concepts[2].level	2
concepts[2].score	0.5839521288871765
concepts[2].wikidata	https://www.wikidata.org/wiki/Q5282087
concepts[2].display_name	Discriminative model
concepts[3].id	https://openalex.org/C154945302
concepts[3].level	1
concepts[3].score	0.5599006414413452
concepts[3].wikidata	https://www.wikidata.org/wiki/Q11660
concepts[3].display_name	Artificial intelligence
concepts[4].id	https://openalex.org/C43971567
concepts[4].level	2
concepts[4].score	0.53841233253479
concepts[4].wikidata	https://www.wikidata.org/wiki/Q3142865
concepts[4].display_name	Logical reasoning
concepts[5].id	https://openalex.org/C119857082
concepts[5].level	1
concepts[5].score	0.5053778290748596
concepts[5].wikidata	https://www.wikidata.org/wiki/Q2539
concepts[5].display_name	Machine learning
concepts[6].id	https://openalex.org/C134752490
concepts[6].level	2
concepts[6].score	0.4952966272830963
concepts[6].wikidata	https://www.wikidata.org/wiki/Q374182
concepts[6].display_name	Logical consequence
concepts[7].id	https://openalex.org/C39890363
concepts[7].level	2
concepts[7].score	0.42953336238861084
concepts[7].wikidata	https://www.wikidata.org/wiki/Q36108
concepts[7].display_name	Generative grammar
concepts[8].id	https://openalex.org/C204321447
concepts[8].level	1
concepts[8].score	0.36409854888916016
concepts[8].wikidata	https://www.wikidata.org/wiki/Q30642
concepts[8].display_name	Natural language processing
concepts[9].id	https://openalex.org/C55493867
concepts[9].level	1
concepts[9].score	0.0
concepts[9].wikidata	https://www.wikidata.org/wiki/Q7094
concepts[9].display_name	Biochemistry
concepts[10].id	https://openalex.org/C104317684
concepts[10].level	2
concepts[10].score	0.0
concepts[10].wikidata	https://www.wikidata.org/wiki/Q7187
concepts[10].display_name	Gene
concepts[11].id	https://openalex.org/C185592680
concepts[11].level	0
concepts[11].score	0.0
concepts[11].wikidata	https://www.wikidata.org/wiki/Q2329
concepts[11].display_name	Chemistry
keywords[0].id	https://openalex.org/keywords/computer-science
keywords[0].score	0.7824336290359497
keywords[0].display_name	Computer science
keywords[1].id	https://openalex.org/keywords/robustness
keywords[1].score	0.71518474817276
keywords[1].display_name	Robustness (evolution)
keywords[2].id	https://openalex.org/keywords/discriminative-model
keywords[2].score	0.5839521288871765
keywords[2].display_name	Discriminative model
keywords[3].id	https://openalex.org/keywords/artificial-intelligence
keywords[3].score	0.5599006414413452
keywords[3].display_name	Artificial intelligence
keywords[4].id	https://openalex.org/keywords/logical-reasoning
keywords[4].score	0.53841233253479
keywords[4].display_name	Logical reasoning
keywords[5].id	https://openalex.org/keywords/machine-learning
keywords[5].score	0.5053778290748596
keywords[5].display_name	Machine learning
keywords[6].id	https://openalex.org/keywords/logical-consequence
keywords[6].score	0.4952966272830963
keywords[6].display_name	Logical consequence
keywords[7].id	https://openalex.org/keywords/generative-grammar
keywords[7].score	0.42953336238861084
keywords[7].display_name	Generative grammar
keywords[8].id	https://openalex.org/keywords/natural-language-processing
keywords[8].score	0.36409854888916016
keywords[8].display_name	Natural language processing
language	en
locations[0].id	pmh:oai:arXiv.org:2310.09430
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2310.09430
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2310.09430
locations[1].id	doi:10.48550/arxiv.2310.09430
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2310.09430
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5009635346
authorships[0].author.orcid	https://orcid.org/0000-0002-1000-7383
authorships[0].author.display_name	Qiming Bao
authorships[0].author_position	first
authorships[0].raw_author_name	Bao, Qiming
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5074844252
authorships[1].author.orcid	https://orcid.org/0000-0002-2457-934X
authorships[1].author.display_name	G. Gendron
authorships[1].author_position	middle
authorships[1].raw_author_name	Gendron, Gael
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5049443230
authorships[2].author.orcid	https://orcid.org/0000-0002-9922-5781
authorships[2].author.display_name	Alex Yuxuan Peng
authorships[2].author_position	middle
authorships[2].raw_author_name	Peng, Alex Yuxuan
authorships[2].is_corresponding	False
authorships[3].author.id	https://openalex.org/A5019101763
authorships[3].author.orcid	https://orcid.org/0009-0007-2236-228X
authorships[3].author.display_name	Wanjun Zhong
authorships[3].author_position	middle
authorships[3].raw_author_name	Zhong, Wanjun
authorships[3].is_corresponding	False
authorships[4].author.id	https://openalex.org/A5062975742
authorships[4].author.orcid	https://orcid.org/0000-0001-6201-7295
authorships[4].author.display_name	Neşet Tan
authorships[4].author_position	middle
authorships[4].raw_author_name	Tan, Neset
authorships[4].is_corresponding	False
authorships[5].author.id	https://openalex.org/A5100350503
authorships[5].author.orcid	https://orcid.org/0000-0003-4749-3060
authorships[5].author.display_name	Yang Chen
authorships[5].author_position	middle
authorships[5].raw_author_name	Chen, Yang
authorships[5].is_corresponding	False
authorships[6].author.id	https://openalex.org/A5057995059
authorships[6].author.orcid	https://orcid.org/0000-0002-7554-0971
authorships[6].author.display_name	Michael Witbrock
authorships[6].author_position	middle
authorships[6].raw_author_name	Witbrock, Michael
authorships[6].is_corresponding	False
authorships[7].author.id	https://openalex.org/A5083914998
authorships[7].author.orcid	https://orcid.org/0000-0002-0824-0899
authorships[7].author.display_name	Jiamou Liu
authorships[7].author_position	last
authorships[7].raw_author_name	Liu, Jiamou
authorships[7].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2310.09430
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2023-10-20T00:00:00
display_name	Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T10028
primary_topic.field.id	https://openalex.org/fields/17
primary_topic.field.display_name	Computer Science
primary_topic.score	0.993399977684021
primary_topic.domain.id	https://openalex.org/domains/3
primary_topic.domain.display_name	Physical Sciences
primary_topic.subfield.id	https://openalex.org/subfields/1702
primary_topic.subfield.display_name	Artificial Intelligence
primary_topic.display_name	Topic Modeling
related_works	https://openalex.org/W4389116644, https://openalex.org/W2153315159, https://openalex.org/W3103844505, https://openalex.org/W259157601, https://openalex.org/W4205463238, https://openalex.org/W2761785940, https://openalex.org/W2987280934, https://openalex.org/W4241564561, https://openalex.org/W1510214531, https://openalex.org/W2351976579
cited_by_count	4
counts_by_year[0].year	2024
counts_by_year[0].cited_by_count	3
counts_by_year[1].year	2023
counts_by_year[1].cited_by_count	1
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2310.09430
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2310.09430
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2310.09430
primary_location.id	pmh:oai:arXiv.org:2310.09430
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2310.09430
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2310.09430
publication_date	2023-10-13
publication_year	2023
referenced_works_count	0
abstract_inverted_index.a	104, 193
abstract_inverted_index.AI	17
abstract_inverted_index.To	42
abstract_inverted_index.We	146
abstract_inverted_index.as	5
abstract_inverted_index.at	212
abstract_inverted_index.by	92
abstract_inverted_index.in	186
abstract_inverted_index.is	98
abstract_inverted_index.of	16, 69, 94, 106
abstract_inverted_index.on	19, 111, 128, 141, 163
abstract_inverted_index.to	25, 65, 195
abstract_inverted_index.we	47, 75, 134, 172
abstract_inverted_index.For	73
abstract_inverted_index.all	137
abstract_inverted_index.and	10, 31, 57, 100, 108, 167, 181, 189, 206
abstract_inverted_index.are	208
abstract_inverted_index.can	157, 183
abstract_inverted_index.for	179, 199
abstract_inverted_index.has	37
abstract_inverted_index.new	50
abstract_inverted_index.not	38
abstract_inverted_index.our	168
abstract_inverted_index.set	156
abstract_inverted_index.the	14, 67, 70, 79, 85, 88, 95, 101, 121, 129, 154, 160, 165
abstract_inverted_index.also	147
abstract_inverted_index.been	39
abstract_inverted_index.both	164, 187
abstract_inverted_index.code	205
abstract_inverted_index.data	177, 207
abstract_inverted_index.find	135
abstract_inverted_index.have	12
abstract_inverted_index.high	126
abstract_inverted_index.into	153
abstract_inverted_index.made	209
abstract_inverted_index.path	194
abstract_inverted_index.show	114, 173
abstract_inverted_index.such	4
abstract_inverted_index.task	151
abstract_inverted_index.that	59, 115, 136, 149, 174
abstract_inverted_index.this	45
abstract_inverted_index.when	33
abstract_inverted_index.with	81, 87, 103
abstract_inverted_index."none	93
abstract_inverted_index.LLM's	71
abstract_inverted_index.Large	0
abstract_inverted_index.each,	74
abstract_inverted_index.first	80
abstract_inverted_index.named	54
abstract_inverted_index.newly	143
abstract_inverted_index.other	96
abstract_inverted_index.tasks	24, 200
abstract_inverted_index.their	29, 125, 197
abstract_inverted_index.these	112, 116, 142
abstract_inverted_index.third	102
abstract_inverted_index.three	49, 77
abstract_inverted_index.GPT-4,	11
abstract_inverted_index.LLaMA,	6
abstract_inverted_index.Source	204
abstract_inverted_index.create	76
abstract_inverted_index.extend	60
abstract_inverted_index.hinder	120
abstract_inverted_index.models	2, 138
abstract_inverted_index.poorly	140
abstract_inverted_index.second	86
abstract_inverted_index.simple	117
abstract_inverted_index.(LLMs),	3
abstract_inverted_index.Alpaca,	7
abstract_inverted_index.Despite	124
abstract_inverted_index.GPT-3.5	9
abstract_inverted_index.Vicuna,	8
abstract_inverted_index.choices	90
abstract_inverted_index.correct	89
abstract_inverted_index.develop	48
abstract_inverted_index.enhance	184
abstract_inverted_index.greatly	119
abstract_inverted_index.improve	159
abstract_inverted_index.levels.	27
abstract_inverted_index.logical	35, 51, 62, 202
abstract_inverted_index.model's	161
abstract_inverted_index.models'	122
abstract_inverted_index.models,	191
abstract_inverted_index.natural	21
abstract_inverted_index.options	97
abstract_inverted_index.perform	139
abstract_inverted_index.systems	18
abstract_inverted_index.various	20
abstract_inverted_index.Finally,	171
abstract_inverted_index.However,	28
abstract_inverted_index.ability,	46
abstract_inverted_index.advanced	13
abstract_inverted_index.applying	175
abstract_inverted_index.datasets	53, 64, 113
abstract_inverted_index.evaluate	44, 66
abstract_inverted_index.language	1, 22
abstract_inverted_index.markedly	158
abstract_inverted_index.offering	192
abstract_inverted_index.options,	84
abstract_inverted_index.original	130, 166
abstract_inverted_index.publicly	131, 210
abstract_inverted_index.randomly	82
abstract_inverted_index.replaced	91
abstract_inverted_index.shuffled	83
abstract_inverted_index.standard	61
abstract_inverted_index.subsets:	78
abstract_inverted_index.training	155
abstract_inverted_index.assessed.	41
abstract_inverted_index.available	132, 211
abstract_inverted_index.correct",	99
abstract_inverted_index.datasets,	133
abstract_inverted_index.datasets.	145, 170
abstract_inverted_index.developed	169
abstract_inverted_index.improving	196
abstract_inverted_index.involving	201
abstract_inverted_index.prompting	182
abstract_inverted_index.reasoning	36, 52, 63
abstract_inverted_index.shuffling	107
abstract_inverted_index.generative	190
abstract_inverted_index.human-like	26
abstract_inverted_index.performing	34
abstract_inverted_index.processing	23
abstract_inverted_index.reasoning.	72, 203
abstract_inverted_index.robustness	32, 68, 198
abstract_inverted_index.variations	152
abstract_inverted_index.Experiments	110
abstract_inverted_index.combination	105
abstract_inverted_index.constructed	144
abstract_inverted_index.demonstrate	148
abstract_inverted_index.fine-tuning	180
abstract_inverted_index.introducing	150
abstract_inverted_index.performance	15, 127, 162
abstract_inverted_index.augmentation	178
abstract_inverted_index.logic-driven	176
abstract_inverted_index.performance.	123
abstract_inverted_index.sufficiently	40
abstract_inverted_index."LogiQA-plus"	56
abstract_inverted_index.augmentations	118
abstract_inverted_index.substitution.	109
abstract_inverted_index."ReClor-plus",	55
abstract_inverted_index.discriminative	188
abstract_inverted_index.generalisation	30, 185
abstract_inverted_index."LogiQAv2-plus"	58
abstract_inverted_index.comprehensively	43
abstract_inverted_index.https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.	213
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	8
sustainable_development_goals[0].id	https://metadata.un.org/sdg/10
sustainable_development_goals[0].score	0.7099999785423279
sustainable_development_goals[0].display_name	Reduced inequalities
citation_normalized_percentile