Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2411.19710
Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2411.19710
- https://arxiv.org/pdf/2411.19710
- OA Status
- green
- Cited By
- 2
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4405031432
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4405031432Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2411.19710Digital Object Identifier
- Title
-
Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG SystemsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-11-29Full publication date if available
- Authors
-
R. Teixeira De Lima, Shubham Gupta, Cèsar Berrospi, Lokesh Mishra, Michele Dolfi, Peter Staar, Panagiotis VagenasList of authors in order
- Landing page
-
https://arxiv.org/abs/2411.19710Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2411.19710Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2411.19710Direct OA link when available
- Concepts
-
Taxonomy (biology), Computer science, Biology, ZoologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
2Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 2Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4405031432 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2411.19710 |
| ids.doi | https://doi.org/10.48550/arxiv.2411.19710 |
| ids.openalex | https://openalex.org/W4405031432 |
| fwci | |
| type | preprint |
| title | Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10444 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.10440000146627426 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Context-Aware Activity Recognition Systems |
| topics[1].id | https://openalex.org/T13690 |
| topics[1].field.id | https://openalex.org/fields/36 |
| topics[1].field.display_name | Health Professions |
| topics[1].score | 0.09049999713897705 |
| topics[1].domain.id | https://openalex.org/domains/4 |
| topics[1].domain.display_name | Health Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/3607 |
| topics[1].subfield.display_name | Medical Laboratory Technology |
| topics[1].display_name | Quality and Safety in Healthcare |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C58642233 |
| concepts[0].level | 2 |
| concepts[0].score | 0.5652869939804077 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q8269924 |
| concepts[0].display_name | Taxonomy (biology) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.44117307662963867 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C86803240 |
| concepts[2].level | 0 |
| concepts[2].score | 0.27998074889183044 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[2].display_name | Biology |
| concepts[3].id | https://openalex.org/C90856448 |
| concepts[3].level | 1 |
| concepts[3].score | 0.17761573195457458 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q431 |
| concepts[3].display_name | Zoology |
| keywords[0].id | https://openalex.org/keywords/taxonomy |
| keywords[0].score | 0.5652869939804077 |
| keywords[0].display_name | Taxonomy (biology) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.44117307662963867 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/biology |
| keywords[2].score | 0.27998074889183044 |
| keywords[2].display_name | Biology |
| keywords[3].id | https://openalex.org/keywords/zoology |
| keywords[3].score | 0.17761573195457458 |
| keywords[3].display_name | Zoology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2411.19710 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2411.19710 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2411.19710 |
| locations[1].id | doi:10.48550/arxiv.2411.19710 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2411.19710 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5114375423 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-5545-6513 |
| authorships[0].author.display_name | R. Teixeira De Lima |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | de Lima, Rafael Teixeira |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5067485451 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Shubham Gupta |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Gupta, Shubham |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5039997281 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-6435-4586 |
| authorships[2].author.display_name | Cèsar Berrospi |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Berrospi, Cesar |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5031402623 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-1256-7261 |
| authorships[3].author.display_name | Lokesh Mishra |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Mishra, Lokesh |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5038675415 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-7216-8505 |
| authorships[4].author.display_name | Michele Dolfi |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Dolfi, Michele |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5024778597 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-8088-0823 |
| authorships[5].author.display_name | Peter Staar |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Staar, Peter |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5078566296 |
| authorships[6].author.orcid | |
| authorships[6].author.display_name | Panagiotis Vagenas |
| authorships[6].author_position | last |
| authorships[6].raw_author_name | Vagenas, Panagiotis |
| authorships[6].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2411.19710 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-12-05T00:00:00 |
| display_name | Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10444 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.10440000146627426 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Context-Aware Activity Recognition Systems |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 2 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 2 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2411.19710 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2411.19710 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2411.19710 |
| primary_location.id | pmh:oai:arXiv.org:2411.19710 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2411.19710 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2411.19710 |
| publication_date | 2024-11-29 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 6, 41 |
| abstract_inverted_index.In | 65 |
| abstract_inverted_index.We | 101, 133 |
| abstract_inverted_index.in | 14 |
| abstract_inverted_index.is | 40 |
| abstract_inverted_index.of | 9, 35, 111, 144 |
| abstract_inverted_index.on | 108 |
| abstract_inverted_index.to | 23, 45, 56, 78, 84, 98, 104, 140 |
| abstract_inverted_index.we | 68, 122 |
| abstract_inverted_index.RAG | 93, 112, 145 |
| abstract_inverted_index.and | 51, 58, 74, 88, 116 |
| abstract_inverted_index.are | 5, 138 |
| abstract_inverted_index.can | 82, 96, 128 |
| abstract_inverted_index.for | 92 |
| abstract_inverted_index.own | 26 |
| abstract_inverted_index.the | 15, 36, 109, 141 |
| abstract_inverted_index.use | 38 |
| abstract_inverted_index.LLMs | 127 |
| abstract_inverted_index.data | 61, 119 |
| abstract_inverted_index.from | 49, 62 |
| abstract_inverted_index.lead | 83, 97 |
| abstract_inverted_index.many | 18 |
| abstract_inverted_index.show | 69, 123 |
| abstract_inverted_index.step | 143 |
| abstract_inverted_index.that | 70, 89, 124, 135 |
| abstract_inverted_index.this | 46, 66 |
| abstract_inverted_index.with | 32 |
| abstract_inverted_index.(RAG) | 3 |
| abstract_inverted_index.(most | 53 |
| abstract_inverted_index.Large | 10 |
| abstract_inverted_index.While | 17 |
| abstract_inverted_index.based | 107 |
| abstract_inverted_index.build | 24 |
| abstract_inverted_index.cheap | 52 |
| abstract_inverted_index.data. | 100 |
| abstract_inverted_index.exist | 20 |
| abstract_inverted_index.local | 63 |
| abstract_inverted_index.range | 48 |
| abstract_inverted_index.small | 126 |
| abstract_inverted_index.their | 25, 29 |
| abstract_inverted_index.these | 105, 136 |
| abstract_inverted_index.tools | 19, 91 |
| abstract_inverted_index.using | 71 |
| abstract_inverted_index.(LLMs) | 13 |
| abstract_inverted_index.Models | 12 |
| abstract_inverted_index.answer | 75 |
| abstract_inverted_index.assess | 79 |
| abstract_inverted_index.cases, | 39 |
| abstract_inverted_index.common | 90 |
| abstract_inverted_index.costly | 59 |
| abstract_inverted_index.issues | 106 |
| abstract_inverted_index.labels | 115 |
| abstract_inverted_index.paper, | 67 |
| abstract_inverted_index.public | 54, 72 |
| abstract_inverted_index.Q&A | 131 |
| abstract_inverted_index.believe | 134 |
| abstract_inverted_index.dataset | 94 |
| abstract_inverted_index.design, | 87 |
| abstract_inverted_index.problem | 47 |
| abstract_inverted_index.propose | 102 |
| abstract_inverted_index.systems | 4, 86, 146 |
| abstract_inverted_index.through | 114, 117 |
| abstract_inverted_index.Finally, | 121 |
| abstract_inverted_index.Language | 11 |
| abstract_inverted_index.datasets | 33, 77, 113 |
| abstract_inverted_index.generate | 130 |
| abstract_inverted_index.locally, | 31 |
| abstract_inverted_index.question | 73 |
| abstract_inverted_index.specific | 57 |
| abstract_inverted_index.system's | 37 |
| abstract_inverted_index.systems, | 27 |
| abstract_inverted_index.(Q&A) | 76 |
| abstract_inverted_index.Augmented | 1 |
| abstract_inverted_index.Retrieval | 0 |
| abstract_inverted_index.Solutions | 44 |
| abstract_inverted_index.datasets) | 55 |
| abstract_inverted_index.datasets. | 132 |
| abstract_inverted_index.industry. | 16 |
| abstract_inverted_index.measuring | 28 |
| abstract_inverted_index.retrieval | 80 |
| abstract_inverted_index.solutions | 103 |
| abstract_inverted_index.Generation | 2 |
| abstract_inverted_index.challenge. | 43 |
| abstract_inverted_index.developers | 22 |
| abstract_inverted_index.empowering | 21 |
| abstract_inverted_index.fine-tuned | 125 |
| abstract_inverted_index.generation | 95 |
| abstract_inverted_index.invaluable | 139 |
| abstract_inverted_index.reflective | 34 |
| abstract_inverted_index.unbalanced | 99 |
| abstract_inverted_index.widespread | 7 |
| abstract_inverted_index.(generating | 60 |
| abstract_inverted_index.application | 8 |
| abstract_inverted_index.documents). | 64 |
| abstract_inverted_index.efficiently | 129 |
| abstract_inverted_index.generation. | 120 |
| abstract_inverted_index.non-optimal | 85 |
| abstract_inverted_index.performance | 30, 81 |
| abstract_inverted_index.development. | 147 |
| abstract_inverted_index.non-specific | 50 |
| abstract_inverted_index.observations | 137 |
| abstract_inverted_index.technological | 42 |
| abstract_inverted_index.know-your-data | 142 |
| abstract_inverted_index.label-targeted | 118 |
| abstract_inverted_index.characterization | 110 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 7 |
| citation_normalized_percentile |