Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.1145/3706598.3714319
Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark," where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable -- only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- https://doi.org/10.1145/3706598.3714319
- OA Status
- gold
- Cited By
- 2
- References
- 85
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4407695830
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4407695830Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.1145/3706598.3714319Digital Object Identifier
- Title
-
Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are AbsentWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-04-24Full publication date if available
- Authors
-
Zeyu He, Saniya Naphade, Ting-Hao HuangList of authors in order
- Landing page
-
https://doi.org/10.1145/3706598.3714319Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
goldOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.1145/3706598.3714319Direct OA link when available
- Concepts
-
Computer science, Gold standard (test), Data science, Artificial intelligence, Machine learning, Medicine, Internal medicineTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
2Total citation count in OpenAlex
- Citations by year (recent)
-
2025: 2Per-year citation counts (last 5 years)
- References (count)
-
85Number of works referenced by this work
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4407695830 |
|---|---|
| doi | https://doi.org/10.1145/3706598.3714319 |
| ids.doi | https://doi.org/10.1145/3706598.3714319 |
| ids.openalex | https://openalex.org/W4407695830 |
| fwci | 19.32734649 |
| type | preprint |
| title | Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | 33 |
| biblio.first_page | 1 |
| topics[0].id | https://openalex.org/T10260 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9980999827384949 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1710 |
| topics[0].subfield.display_name | Information Systems |
| topics[0].display_name | Software Engineering Research |
| topics[1].id | https://openalex.org/T11704 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9911999702453613 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1706 |
| topics[1].subfield.display_name | Computer Science Applications |
| topics[1].display_name | Mobile Crowdsensing and Crowdsourcing |
| topics[2].id | https://openalex.org/T10028 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9876999855041504 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Topic Modeling |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7158586978912354 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C40993552 |
| concepts[1].level | 2 |
| concepts[1].score | 0.5494347810745239 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q514654 |
| concepts[1].display_name | Gold standard (test) |
| concepts[2].id | https://openalex.org/C2522767166 |
| concepts[2].level | 1 |
| concepts[2].score | 0.42226940393447876 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q2374463 |
| concepts[2].display_name | Data science |
| concepts[3].id | https://openalex.org/C154945302 |
| concepts[3].level | 1 |
| concepts[3].score | 0.34773868322372437 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[3].display_name | Artificial intelligence |
| concepts[4].id | https://openalex.org/C119857082 |
| concepts[4].level | 1 |
| concepts[4].score | 0.3347756862640381 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q2539 |
| concepts[4].display_name | Machine learning |
| concepts[5].id | https://openalex.org/C71924100 |
| concepts[5].level | 0 |
| concepts[5].score | 0.07730111479759216 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11190 |
| concepts[5].display_name | Medicine |
| concepts[6].id | https://openalex.org/C126322002 |
| concepts[6].level | 1 |
| concepts[6].score | 0.0 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11180 |
| concepts[6].display_name | Internal medicine |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.7158586978912354 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/gold-standard |
| keywords[1].score | 0.5494347810745239 |
| keywords[1].display_name | Gold standard (test) |
| keywords[2].id | https://openalex.org/keywords/data-science |
| keywords[2].score | 0.42226940393447876 |
| keywords[2].display_name | Data science |
| keywords[3].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[3].score | 0.34773868322372437 |
| keywords[3].display_name | Artificial intelligence |
| keywords[4].id | https://openalex.org/keywords/machine-learning |
| keywords[4].score | 0.3347756862640381 |
| keywords[4].display_name | Machine learning |
| keywords[5].id | https://openalex.org/keywords/medicine |
| keywords[5].score | 0.07730111479759216 |
| keywords[5].display_name | Medicine |
| language | en |
| locations[0].id | doi:10.1145/3706598.3714319 |
| locations[0].is_oa | True |
| locations[0].source | |
| locations[0].license | cc-by |
| locations[0].pdf_url | |
| locations[0].version | publishedVersion |
| locations[0].raw_type | proceedings-article |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | True |
| locations[0].is_published | True |
| locations[0].raw_source_name | Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems |
| locations[0].landing_page_url | https://doi.org/10.1145/3706598.3714319 |
| locations[1].id | pmh:oai:arXiv.org:2502.11267 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | https://arxiv.org/pdf/2502.11267 |
| locations[1].version | submittedVersion |
| locations[1].raw_type | text |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | False |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | http://arxiv.org/abs/2502.11267 |
| indexed_in | arxiv, crossref |
| authorships[0].author.id | https://openalex.org/A5101817568 |
| authorships[0].author.orcid | https://orcid.org/0009-0007-7115-2692 |
| authorships[0].author.display_name | Zeyu He |
| authorships[0].countries | US |
| authorships[0].affiliations[0].institution_ids | https://openalex.org/I130769515 |
| authorships[0].affiliations[0].raw_affiliation_string | College of Information Sciences and Technology, Pennsylvania State University, University Park, Pennsylvania, USA |
| authorships[0].institutions[0].id | https://openalex.org/I130769515 |
| authorships[0].institutions[0].ror | https://ror.org/04p491231 |
| authorships[0].institutions[0].type | education |
| authorships[0].institutions[0].lineage | https://openalex.org/I130769515 |
| authorships[0].institutions[0].country_code | US |
| authorships[0].institutions[0].display_name | Pennsylvania State University |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Zeyu He |
| authorships[0].is_corresponding | False |
| authorships[0].raw_affiliation_strings | College of Information Sciences and Technology, Pennsylvania State University, University Park, Pennsylvania, USA |
| authorships[1].author.id | https://openalex.org/A5036481959 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Saniya Naphade |
| authorships[1].affiliations[0].raw_affiliation_string | GumGum, Tempe, Arizona, USA |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Saniya Naphade |
| authorships[1].is_corresponding | False |
| authorships[1].raw_affiliation_strings | GumGum, Tempe, Arizona, USA |
| authorships[2].author.id | https://openalex.org/A5083675499 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-7021-4627 |
| authorships[2].author.display_name | Ting-Hao Huang |
| authorships[2].countries | US |
| authorships[2].affiliations[0].institution_ids | https://openalex.org/I130769515 |
| authorships[2].affiliations[0].raw_affiliation_string | College of Information Sciences and Technology, Pennsylvania State University, University Park, Pennsylvania, USA |
| authorships[2].institutions[0].id | https://openalex.org/I130769515 |
| authorships[2].institutions[0].ror | https://ror.org/04p491231 |
| authorships[2].institutions[0].type | education |
| authorships[2].institutions[0].lineage | https://openalex.org/I130769515 |
| authorships[2].institutions[0].country_code | US |
| authorships[2].institutions[0].display_name | Pennsylvania State University |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Ting-Hao Kenneth Huang |
| authorships[2].is_corresponding | False |
| authorships[2].raw_affiliation_strings | College of Information Sciences and Technology, Pennsylvania State University, University Park, Pennsylvania, USA |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.1145/3706598.3714319 |
| open_access.oa_status | gold |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T03:46:38.306776 |
| primary_topic.id | https://openalex.org/T10260 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9980999827384949 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1710 |
| primary_topic.subfield.display_name | Information Systems |
| primary_topic.display_name | Software Engineering Research |
| related_works | https://openalex.org/W2961085424, https://openalex.org/W4306674287, https://openalex.org/W4387369504, https://openalex.org/W4394896187, https://openalex.org/W3170094116, https://openalex.org/W4386462264, https://openalex.org/W3107602296, https://openalex.org/W4364306694, https://openalex.org/W4312192474, https://openalex.org/W4283697347 |
| cited_by_count | 2 |
| counts_by_year[0].year | 2025 |
| counts_by_year[0].cited_by_count | 2 |
| locations_count | 2 |
| best_oa_location.id | doi:10.1145/3706598.3714319 |
| best_oa_location.is_oa | True |
| best_oa_location.source | |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | |
| best_oa_location.version | publishedVersion |
| best_oa_location.raw_type | proceedings-article |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | True |
| best_oa_location.is_published | True |
| best_oa_location.raw_source_name | Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems |
| best_oa_location.landing_page_url | https://doi.org/10.1145/3706598.3714319 |
| primary_location.id | doi:10.1145/3706598.3714319 |
| primary_location.is_oa | True |
| primary_location.source | |
| primary_location.license | cc-by |
| primary_location.pdf_url | |
| primary_location.version | publishedVersion |
| primary_location.raw_type | proceedings-article |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | True |
| primary_location.is_published | True |
| primary_location.raw_source_name | Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems |
| primary_location.landing_page_url | https://doi.org/10.1145/3706598.3714319 |
| publication_date | 2025-04-24 |
| publication_year | 2025 |
| referenced_works | https://openalex.org/W2621045072, https://openalex.org/W2807737088, https://openalex.org/W4315646142, https://openalex.org/W1815682670, https://openalex.org/W2916904544, https://openalex.org/W4396832834, https://openalex.org/W2753079413, https://openalex.org/W3159250634, https://openalex.org/W3109443741, https://openalex.org/W4385571232, https://openalex.org/W4389520295, https://openalex.org/W2002753905, https://openalex.org/W2400269077, https://openalex.org/W2104749423, https://openalex.org/W3029299514, https://openalex.org/W3001315432, https://openalex.org/W1524805786, https://openalex.org/W2116705992, https://openalex.org/W4293777629, https://openalex.org/W1838907055, https://openalex.org/W4396833424, https://openalex.org/W4379598302, https://openalex.org/W2998679789, https://openalex.org/W3172340245, https://openalex.org/W4401042461, https://openalex.org/W4392223038, https://openalex.org/W3212305774, https://openalex.org/W3205934028, https://openalex.org/W4411112894, https://openalex.org/W2135504089, https://openalex.org/W1990892398, https://openalex.org/W4411630039, https://openalex.org/W4387801187, https://openalex.org/W2151332190, https://openalex.org/W2053075547, https://openalex.org/W2168747479, https://openalex.org/W4391631702, https://openalex.org/W4392903699, https://openalex.org/W4400222414, https://openalex.org/W2983697263, https://openalex.org/W4389523621, https://openalex.org/W4393248165, https://openalex.org/W4406858513, https://openalex.org/W3129132077, https://openalex.org/W1506754461, https://openalex.org/W4401109844, https://openalex.org/W3041011740, https://openalex.org/W4389520756, https://openalex.org/W1996878936, https://openalex.org/W2970641574, https://openalex.org/W3195589064, https://openalex.org/W3091962337, https://openalex.org/W3171896931, https://openalex.org/W2149489787, https://openalex.org/W4301393026, https://openalex.org/W4404783774, https://openalex.org/W4321598326, https://openalex.org/W2137244916, https://openalex.org/W4396780476, https://openalex.org/W4366003124, https://openalex.org/W1506491340, https://openalex.org/W2056584528, https://openalex.org/W2519338267, https://openalex.org/W4390023570, https://openalex.org/W4396831993, https://openalex.org/W4393100761, https://openalex.org/W2150098611, https://openalex.org/W2938213904, https://openalex.org/W2156279557, https://openalex.org/W4383679928, https://openalex.org/W4366548330, https://openalex.org/W4389518684, https://openalex.org/W4404783213, https://openalex.org/W3154288086, https://openalex.org/W3183000984, https://openalex.org/W4293067800, https://openalex.org/W4385571344, https://openalex.org/W4404331635, https://openalex.org/W4403334714, https://openalex.org/W4232845215, https://openalex.org/W4386566840, https://openalex.org/W4389519153, https://openalex.org/W4387872955, https://openalex.org/W2792932412, https://openalex.org/W1490960179 |
| referenced_works_count | 85 |
| abstract_inverted_index.9 | 109 |
| abstract_inverted_index.a | 50, 75, 92 |
| abstract_inverted_index.-- | 107 |
| abstract_inverted_index.20 | 95 |
| abstract_inverted_index.Do | 19 |
| abstract_inverted_index.We | 72 |
| abstract_inverted_index.as | 144, 146 |
| abstract_inverted_index.at | 16 |
| abstract_inverted_index.in | 52, 57, 101, 152 |
| abstract_inverted_index.no | 39 |
| abstract_inverted_index.of | 1, 31, 138, 149 |
| abstract_inverted_index.or | 116 |
| abstract_inverted_index.to | 24, 44, 65, 82 |
| abstract_inverted_index.we | 97 |
| abstract_inverted_index.Our | 133 |
| abstract_inverted_index.and | 85, 141 |
| abstract_inverted_index.are | 14, 36, 42 |
| abstract_inverted_index.but | 11 |
| abstract_inverted_index.few | 128 |
| abstract_inverted_index.for | 8, 158 |
| abstract_inverted_index.get | 22 |
| abstract_inverted_index.how | 12 |
| abstract_inverted_index.the | 58, 102, 136, 142, 147 |
| abstract_inverted_index.was | 104 |
| abstract_inverted_index.DSPy | 124 |
| abstract_inverted_index.LLMs | 64 |
| abstract_inverted_index.This | 47 |
| abstract_inverted_index.also | 125 |
| abstract_inverted_index.dark | 103 |
| abstract_inverted_index.data | 54, 67, 88 |
| abstract_inverted_index.four | 115 |
| abstract_inverted_index.gold | 129, 139 |
| abstract_inverted_index.good | 13 |
| abstract_inverted_index.like | 123 |
| abstract_inverted_index.more | 117 |
| abstract_inverted_index.only | 108 |
| abstract_inverted_index.over | 28 |
| abstract_inverted_index.that | 79, 99 |
| abstract_inverted_index.tool | 160 |
| abstract_inverted_index.well | 145 |
| abstract_inverted_index.were | 131 |
| abstract_inverted_index.when | 38, 127 |
| abstract_inverted_index.with | 94 |
| abstract_inverted_index.These | 34 |
| abstract_inverted_index.after | 114 |
| abstract_inverted_index.found | 98 |
| abstract_inverted_index.human | 153 |
| abstract_inverted_index.label | 66, 87 |
| abstract_inverted_index.large | 4 |
| abstract_inverted_index.paper | 48 |
| abstract_inverted_index.study | 93 |
| abstract_inverted_index.their | 25, 32 |
| abstract_inverted_index.tools | 122 |
| abstract_inverted_index.users | 2, 20, 61, 81 |
| abstract_inverted_index.using | 69 |
| abstract_inverted_index.where | 60 |
| abstract_inverted_index.(LLMs) | 7 |
| abstract_inverted_index.Google | 76 |
| abstract_inverted_index.Sheets | 77 |
| abstract_inverted_index.add-on | 78 |
| abstract_inverted_index.closer | 23 |
| abstract_inverted_index.dark," | 59 |
| abstract_inverted_index.future | 159 |
| abstract_inverted_index.highly | 105 |
| abstract_inverted_index.labels | 41, 130, 140 |
| abstract_inverted_index.models | 6 |
| abstract_inverted_index.needs, | 143 |
| abstract_inverted_index.people | 15 |
| abstract_inverted_index.prompt | 3, 17, 63, 120, 154 |
| abstract_inverted_index.risks, | 148 |
| abstract_inverted_index.tasks, | 10 |
| abstract_inverted_index.Through | 91 |
| abstract_inverted_index.crucial | 37 |
| abstract_inverted_index.design. | 161 |
| abstract_inverted_index.desired | 26 |
| abstract_inverted_index.enables | 80 |
| abstract_inverted_index.measure | 45 |
| abstract_inverted_index.outcome | 27 |
| abstract_inverted_index.revise, | 84 |
| abstract_inverted_index.support | 151 |
| abstract_inverted_index.through | 89 |
| abstract_inverted_index.various | 9 |
| abstract_inverted_index.without | 68 |
| abstract_inverted_index.Millions | 0 |
| abstract_inverted_index.accuracy | 113 |
| abstract_inverted_index.actually | 21 |
| abstract_inverted_index.compose, | 83 |
| abstract_inverted_index.findings | 134 |
| abstract_inverted_index.improved | 111 |
| abstract_inverted_index.insights | 157 |
| abstract_inverted_index.labeling | 112 |
| abstract_inverted_index.language | 5 |
| abstract_inverted_index.multiple | 29 |
| abstract_inverted_index.prompts? | 33 |
| abstract_inverted_index.scenario | 51 |
| abstract_inverted_index.Automated | 119 |
| abstract_inverted_index.automated | 150 |
| abstract_inverted_index.available | 43 |
| abstract_inverted_index.developed | 73 |
| abstract_inverted_index.highlight | 135 |
| abstract_inverted_index.labeling, | 55 |
| abstract_inverted_index.progress. | 46 |
| abstract_inverted_index.prompting | 100 |
| abstract_inverted_index.providing | 156 |
| abstract_inverted_index.questions | 35 |
| abstract_inverted_index.struggled | 126 |
| abstract_inverted_index."prompting | 56 |
| abstract_inverted_index.available. | 132 |
| abstract_inverted_index.importance | 137 |
| abstract_inverted_index.iterations | 30 |
| abstract_inverted_index.unreliable | 106 |
| abstract_inverted_index.LLM-powered | 53 |
| abstract_inverted_index.benchmarks. | 71 |
| abstract_inverted_index.iterations. | 118 |
| abstract_inverted_index.iteratively | 62, 86 |
| abstract_inverted_index.engineering, | 155 |
| abstract_inverted_index.engineering? | 18 |
| abstract_inverted_index.investigates | 49 |
| abstract_inverted_index.optimization | 121 |
| abstract_inverted_index.participants | 110 |
| abstract_inverted_index.gold-standard | 40 |
| abstract_inverted_index.participants, | 96 |
| abstract_inverted_index.spreadsheets. | 90 |
| abstract_inverted_index.PromptingSheet, | 74 |
| abstract_inverted_index.manually-labeled | 70 |
| cited_by_percentile_year.max | 97 |
| cited_by_percentile_year.min | 95 |
| countries_distinct_count | 1 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile.value | 0.96997873 |
| citation_normalized_percentile.is_in_top_1_percent | True |
| citation_normalized_percentile.is_in_top_10_percent | True |