Large Language Models for Data Annotation and Synthesis: A Survey Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2402.13446
Data annotation and synthesis generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and costly. The emergence of advanced Large Language Models (LLMs), exemplified by GPT-4, presents an unprecedented opportunity to automate the complicated process of data annotation and synthesis. While existing surveys have extensively covered LLM architecture, training, and general applications, we uniquely focus on their specific utility for data annotation. This survey contributes to three core aspects: LLM-Based Annotation Generation, LLM-Generated Annotations Assessment, and LLM-Generated Annotations Utilization. Furthermore, this survey includes an in-depth taxonomy of data types that LLMs can annotate, a comprehensive review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis. Serving as a key guide, this survey aims to assist researchers and practitioners in exploring the potential of the latest LLMs for data annotation, thereby fostering future advancements in this critical field.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2402.13446
- https://arxiv.org/pdf/2402.13446
- OA Status
- green
- Cited By
- 34
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4392085924
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4392085924Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2402.13446Digital Object Identifier
- Title
-
Large Language Models for Data Annotation and Synthesis: A SurveyWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-02-21Full publication date if available
- Authors
-
Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh Karami, Jundong Li, Cheng Lu, Huan LiuList of authors in order
- Landing page
-
https://arxiv.org/abs/2402.13446Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2402.13446Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2402.13446Direct OA link when available
- Concepts
-
Annotation, Computer science, Natural language processing, Artificial intelligence, Information retrievalTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
34Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 18, 2024: 16Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4392085924 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2402.13446 |
| ids.doi | https://doi.org/10.48550/arxiv.2402.13446 |
| ids.openalex | https://openalex.org/W4392085924 |
| fwci | |
| type | preprint |
| title | Large Language Models for Data Annotation and Synthesis: A Survey |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10028 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9837999939918518 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Topic Modeling |
| topics[1].id | https://openalex.org/T11719 |
| topics[1].field.id | https://openalex.org/fields/18 |
| topics[1].field.display_name | Decision Sciences |
| topics[1].score | 0.9708999991416931 |
| topics[1].domain.id | https://openalex.org/domains/2 |
| topics[1].domain.display_name | Social Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1803 |
| topics[1].subfield.display_name | Management Science and Operations Research |
| topics[1].display_name | Data Quality and Management |
| topics[2].id | https://openalex.org/T10181 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9642000198364258 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Natural Language Processing Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2776321320 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7277224659919739 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q857525 |
| concepts[0].display_name | Annotation |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.6211884021759033 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C204321447 |
| concepts[2].level | 1 |
| concepts[2].score | 0.472780704498291 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[2].display_name | Natural language processing |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.35147494077682495 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C23123220 |
| concepts[4].level | 1 |
| concepts[4].score | 0.3335564434528351 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q816826 |
| concepts[4].display_name | Information retrieval |
| keywords[0].id | https://openalex.org/keywords/annotation |
| keywords[0].score | 0.7277224659919739 |
| keywords[0].display_name | Annotation |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.6211884021759033 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/natural-language-processing |
| keywords[2].score | 0.472780704498291 |
| keywords[2].display_name | Natural language processing |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.35147494077682495 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/information-retrieval |
| keywords[4].score | 0.3335564434528351 |
| keywords[4].display_name | Information retrieval |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2402.13446 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2402.13446 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2402.13446 |
| locations[1].id | doi:10.48550/arxiv.2402.13446 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2402.13446 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101997705 |
| authorships[0].author.orcid | https://orcid.org/0009-0004-2082-8566 |
| authorships[0].author.display_name | Zhen Tan |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Tan, Zhen |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5114129610 |
| authorships[1].author.orcid | https://orcid.org/0009-0009-6637-0761 |
| authorships[1].author.display_name | Alimohammad Beigi |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Beigi, Alimohammad |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5048781959 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-5205-4359 |
| authorships[2].author.display_name | Song Wang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wang, Song |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5054719216 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-8522-6142 |
| authorships[3].author.display_name | Ruocheng Guo |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Guo, Ruocheng |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5103250573 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-6117-6382 |
| authorships[4].author.display_name | Amrita Bhattacharjee |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Bhattacharjee, Amrita |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5035988035 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Bohan Jiang |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Jiang, Bohan |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5015142767 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-8168-8075 |
| authorships[6].author.display_name | Mansooreh Karami |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Karami, Mansooreh |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5029588473 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-1878-817X |
| authorships[7].author.display_name | Jundong Li |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Li, Jundong |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5024288211 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-7651-3924 |
| authorships[8].author.display_name | Cheng Lu |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Cheng, Lu |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5100338946 |
| authorships[9].author.orcid | https://orcid.org/0000-0002-3264-7904 |
| authorships[9].author.display_name | Huan Liu |
| authorships[9].author_position | last |
| authorships[9].raw_author_name | Liu, Huan |
| authorships[9].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2402.13446 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-02-23T00:00:00 |
| display_name | Large Language Models for Data Annotation and Synthesis: A Survey |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10028 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9837999939918518 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Topic Modeling |
| related_works | https://openalex.org/W2748952813, https://openalex.org/W2361861616, https://openalex.org/W2263699433, https://openalex.org/W2377979023, https://openalex.org/W2218034408, https://openalex.org/W2392921965, https://openalex.org/W2358755282, https://openalex.org/W2625833328, https://openalex.org/W1533177136, https://openalex.org/W3204019825 |
| cited_by_count | 34 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 18 |
| counts_by_year[1].year | 2024 |
| counts_by_year[1].cited_by_count | 16 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2402.13446 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2402.13446 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2402.13446 |
| primary_location.id | pmh:oai:arXiv.org:2402.13446 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2402.13446 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2402.13446 |
| publication_date | 2024-02-21 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 114, 126, 146 |
| abstract_inverted_index.an | 48, 104 |
| abstract_inverted_index.as | 145 |
| abstract_inverted_index.be | 19 |
| abstract_inverted_index.by | 45 |
| abstract_inverted_index.in | 157, 172 |
| abstract_inverted_index.is | 32 |
| abstract_inverted_index.of | 11, 25, 38, 56, 107, 117, 129, 161 |
| abstract_inverted_index.on | 76 |
| abstract_inverted_index.or | 9 |
| abstract_inverted_index.to | 6, 51, 86, 152 |
| abstract_inverted_index.we | 73 |
| abstract_inverted_index.LLM | 67 |
| abstract_inverted_index.The | 29, 36 |
| abstract_inverted_index.and | 2, 34, 59, 70, 96, 125, 133, 142, 155 |
| abstract_inverted_index.can | 112 |
| abstract_inverted_index.for | 21, 80, 120, 139, 165 |
| abstract_inverted_index.key | 147 |
| abstract_inverted_index.raw | 12 |
| abstract_inverted_index.the | 7, 23, 53, 130, 159, 162 |
| abstract_inverted_index.Data | 0 |
| abstract_inverted_index.LLMs | 111, 138, 164 |
| abstract_inverted_index.This | 83 |
| abstract_inverted_index.aims | 151 |
| abstract_inverted_index.core | 88 |
| abstract_inverted_index.data | 13, 57, 81, 108, 140, 166 |
| abstract_inverted_index.have | 64 |
| abstract_inverted_index.that | 110 |
| abstract_inverted_index.this | 101, 149, 173 |
| abstract_inverted_index.used | 20 |
| abstract_inverted_index.with | 14, 136 |
| abstract_inverted_index.Large | 40 |
| abstract_inverted_index.While | 61 |
| abstract_inverted_index.could | 18 |
| abstract_inverted_index.focus | 75 |
| abstract_inverted_index.their | 77 |
| abstract_inverted_index.three | 87 |
| abstract_inverted_index.types | 109 |
| abstract_inverted_index.using | 137 |
| abstract_inverted_index.which | 17 |
| abstract_inverted_index.GPT-4, | 46 |
| abstract_inverted_index.Models | 42 |
| abstract_inverted_index.assist | 153 |
| abstract_inverted_index.field. | 175 |
| abstract_inverted_index.future | 170 |
| abstract_inverted_index.guide, | 148 |
| abstract_inverted_index.latest | 163 |
| abstract_inverted_index.models | 121 |
| abstract_inverted_index.refers | 5 |
| abstract_inverted_index.review | 116 |
| abstract_inverted_index.survey | 84, 102, 150 |
| abstract_inverted_index.(LLMs), | 43 |
| abstract_inverted_index.Serving | 144 |
| abstract_inverted_index.costly. | 35 |
| abstract_inverted_index.covered | 66 |
| abstract_inverted_index.general | 71 |
| abstract_inverted_index.machine | 26 |
| abstract_inverted_index.models. | 28 |
| abstract_inverted_index.primary | 131 |
| abstract_inverted_index.process | 55 |
| abstract_inverted_index.surveys | 63 |
| abstract_inverted_index.thereby | 168 |
| abstract_inverted_index.utility | 79 |
| abstract_inverted_index.Language | 41 |
| abstract_inverted_index.advanced | 39 |
| abstract_inverted_index.aspects: | 89 |
| abstract_inverted_index.automate | 52 |
| abstract_inverted_index.critical | 174 |
| abstract_inverted_index.detailed | 127 |
| abstract_inverted_index.efficacy | 24 |
| abstract_inverted_index.existing | 62 |
| abstract_inverted_index.however, | 31 |
| abstract_inverted_index.in-depth | 105 |
| abstract_inverted_index.includes | 103 |
| abstract_inverted_index.labeling | 8 |
| abstract_inverted_index.learning | 27, 118 |
| abstract_inverted_index.presents | 47 |
| abstract_inverted_index.process, | 30 |
| abstract_inverted_index.relevant | 15 |
| abstract_inverted_index.specific | 78 |
| abstract_inverted_index.taxonomy | 106 |
| abstract_inverted_index.uniquely | 74 |
| abstract_inverted_index.LLM-Based | 90 |
| abstract_inverted_index.annotate, | 113 |
| abstract_inverted_index.emergence | 37 |
| abstract_inverted_index.exploring | 158 |
| abstract_inverted_index.fostering | 169 |
| abstract_inverted_index.generally | 4 |
| abstract_inverted_index.improving | 22 |
| abstract_inverted_index.potential | 160 |
| abstract_inverted_index.synthesis | 3 |
| abstract_inverted_index.training, | 69 |
| abstract_inverted_index.utilizing | 122 |
| abstract_inverted_index.Annotation | 91 |
| abstract_inverted_index.annotation | 1, 58, 141 |
| abstract_inverted_index.associated | 135 |
| abstract_inverted_index.challenges | 132 |
| abstract_inverted_index.discussion | 128 |
| abstract_inverted_index.generating | 10 |
| abstract_inverted_index.strategies | 119 |
| abstract_inverted_index.synthesis. | 60, 143 |
| abstract_inverted_index.Annotations | 94, 98 |
| abstract_inverted_index.Assessment, | 95 |
| abstract_inverted_index.Generation, | 92 |
| abstract_inverted_index.annotation, | 167 |
| abstract_inverted_index.annotation. | 82 |
| abstract_inverted_index.complicated | 54 |
| abstract_inverted_index.contributes | 85 |
| abstract_inverted_index.exemplified | 44 |
| abstract_inverted_index.extensively | 65 |
| abstract_inverted_index.limitations | 134 |
| abstract_inverted_index.opportunity | 50 |
| abstract_inverted_index.researchers | 154 |
| abstract_inverted_index.Furthermore, | 100 |
| abstract_inverted_index.Utilization. | 99 |
| abstract_inverted_index.advancements | 171 |
| abstract_inverted_index.annotations, | 124 |
| abstract_inverted_index.information, | 16 |
| abstract_inverted_index.LLM-Generated | 93, 97 |
| abstract_inverted_index.LLM-generated | 123 |
| abstract_inverted_index.applications, | 72 |
| abstract_inverted_index.architecture, | 68 |
| abstract_inverted_index.comprehensive | 115 |
| abstract_inverted_index.practitioners | 156 |
| abstract_inverted_index.unprecedented | 49 |
| abstract_inverted_index.labor-intensive | 33 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 10 |
| citation_normalized_percentile |