Lingua Manga: A Generic Large Language Model Centric System for Data Curation Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2306.11702
Data curation is a wide-ranging area which contains many critical but time-consuming data processing tasks. However, the diversity of such tasks makes it challenging to develop a general-purpose data curation system. To address this issue, we present Lingua Manga, a user-friendly and versatile system that utilizes pre-trained large language models. Lingua Manga offers automatic optimization for achieving high performance and label efficiency while facilitating flexible and rapid development. Through three example applications with distinct objectives and users of varying levels of technical proficiency, we demonstrate that Lingua Manga can effectively assist both skilled programmers and low-code or even no-code users in addressing data curation challenges.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2306.11702
- https://arxiv.org/pdf/2306.11702
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4381586905
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4381586905Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2306.11702Digital Object Identifier
- Title
-
Lingua Manga: A Generic Large Language Model Centric System for Data CurationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-06-20Full publication date if available
- Authors
-
Zui Chen, Lei Cao, Samuel MaddenList of authors in order
- Landing page
-
https://arxiv.org/abs/2306.11702Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2306.11702Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2306.11702Direct OA link when available
- Concepts
-
Computer science, Lingua franca, Data curation, Code (set theory), Ranging, Database, Data science, Programming language, Set (abstract data type), Humanities, Telecommunications, PhilosophyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4381586905 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2306.11702 |
| ids.doi | https://doi.org/10.48550/arxiv.2306.11702 |
| ids.openalex | https://openalex.org/W4381586905 |
| fwci | |
| type | preprint |
| title | Lingua Manga: A Generic Large Language Model Centric System for Data Curation |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11719 |
| topics[0].field.id | https://openalex.org/fields/18 |
| topics[0].field.display_name | Decision Sciences |
| topics[0].score | 0.9991000294685364 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1803 |
| topics[0].subfield.display_name | Management Science and Operations Research |
| topics[0].display_name | Data Quality and Management |
| topics[1].id | https://openalex.org/T11937 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9908999800682068 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1710 |
| topics[1].subfield.display_name | Information Systems |
| topics[1].display_name | Research Data Management Practices |
| topics[2].id | https://openalex.org/T11986 |
| topics[2].field.id | https://openalex.org/fields/18 |
| topics[2].field.display_name | Decision Sciences |
| topics[2].score | 0.9897000193595886 |
| topics[2].domain.id | https://openalex.org/domains/2 |
| topics[2].domain.display_name | Social Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1802 |
| topics[2].subfield.display_name | Information Systems and Management |
| topics[2].display_name | Scientific Computing and Data Management |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7624804973602295 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C159789966 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6844815611839294 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q80839 |
| concepts[1].display_name | Lingua franca |
| concepts[2].id | https://openalex.org/C91632574 |
| concepts[2].level | 2 |
| concepts[2].score | 0.5553174614906311 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q15088675 |
| concepts[2].display_name | Data curation |
| concepts[3].id | https://openalex.org/C2776760102 |
| concepts[3].level | 3 |
| concepts[3].score | 0.5549814105033875 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q5139990 |
| concepts[3].display_name | Code (set theory) |
| concepts[4].id | https://openalex.org/C115051666 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4140494465827942 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q6522493 |
| concepts[4].display_name | Ranging |
| concepts[5].id | https://openalex.org/C77088390 |
| concepts[5].level | 1 |
| concepts[5].score | 0.3333205282688141 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q8513 |
| concepts[5].display_name | Database |
| concepts[6].id | https://openalex.org/C2522767166 |
| concepts[6].level | 1 |
| concepts[6].score | 0.3246012032032013 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q2374463 |
| concepts[6].display_name | Data science |
| concepts[7].id | https://openalex.org/C199360897 |
| concepts[7].level | 1 |
| concepts[7].score | 0.23432588577270508 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q9143 |
| concepts[7].display_name | Programming language |
| concepts[8].id | https://openalex.org/C177264268 |
| concepts[8].level | 2 |
| concepts[8].score | 0.22475388646125793 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q1514741 |
| concepts[8].display_name | Set (abstract data type) |
| concepts[9].id | https://openalex.org/C15708023 |
| concepts[9].level | 1 |
| concepts[9].score | 0.09151339530944824 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q80083 |
| concepts[9].display_name | Humanities |
| concepts[10].id | https://openalex.org/C76155785 |
| concepts[10].level | 1 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q418 |
| concepts[10].display_name | Telecommunications |
| concepts[11].id | https://openalex.org/C138885662 |
| concepts[11].level | 0 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[11].display_name | Philosophy |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.7624804973602295 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/lingua-franca |
| keywords[1].score | 0.6844815611839294 |
| keywords[1].display_name | Lingua franca |
| keywords[2].id | https://openalex.org/keywords/data-curation |
| keywords[2].score | 0.5553174614906311 |
| keywords[2].display_name | Data curation |
| keywords[3].id | https://openalex.org/keywords/code |
| keywords[3].score | 0.5549814105033875 |
| keywords[3].display_name | Code (set theory) |
| keywords[4].id | https://openalex.org/keywords/ranging |
| keywords[4].score | 0.4140494465827942 |
| keywords[4].display_name | Ranging |
| keywords[5].id | https://openalex.org/keywords/database |
| keywords[5].score | 0.3333205282688141 |
| keywords[5].display_name | Database |
| keywords[6].id | https://openalex.org/keywords/data-science |
| keywords[6].score | 0.3246012032032013 |
| keywords[6].display_name | Data science |
| keywords[7].id | https://openalex.org/keywords/programming-language |
| keywords[7].score | 0.23432588577270508 |
| keywords[7].display_name | Programming language |
| keywords[8].id | https://openalex.org/keywords/set |
| keywords[8].score | 0.22475388646125793 |
| keywords[8].display_name | Set (abstract data type) |
| keywords[9].id | https://openalex.org/keywords/humanities |
| keywords[9].score | 0.09151339530944824 |
| keywords[9].display_name | Humanities |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2306.11702 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2306.11702 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2306.11702 |
| locations[1].id | doi:10.48550/arxiv.2306.11702 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2306.11702 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5101718220 |
| authorships[0].author.orcid | https://orcid.org/0009-0008-3488-7312 |
| authorships[0].author.display_name | Zui Chen |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Chen, Zui |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5049926126 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-9909-8607 |
| authorships[1].author.display_name | Lei Cao |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Cao, Lei |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5037742794 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Samuel Madden |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Madden, Sam |
| authorships[2].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2306.11702 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2023-06-22T00:00:00 |
| display_name | Lingua Manga: A Generic Large Language Model Centric System for Data Curation |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11719 |
| primary_topic.field.id | https://openalex.org/fields/18 |
| primary_topic.field.display_name | Decision Sciences |
| primary_topic.score | 0.9991000294685364 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1803 |
| primary_topic.subfield.display_name | Management Science and Operations Research |
| primary_topic.display_name | Data Quality and Management |
| related_works | https://openalex.org/W2783354812, https://openalex.org/W4384112194, https://openalex.org/W4312958259, https://openalex.org/W2103009189, https://openalex.org/W4308259661, https://openalex.org/W4390813131, https://openalex.org/W2349383066, https://openalex.org/W4328132048, https://openalex.org/W1969901537, https://openalex.org/W2376202349 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2306.11702 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2306.11702 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2306.11702 |
| primary_location.id | pmh:oai:arXiv.org:2306.11702 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2306.11702 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2306.11702 |
| publication_date | 2023-06-20 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 3, 26, 39 |
| abstract_inverted_index.To | 31 |
| abstract_inverted_index.in | 100 |
| abstract_inverted_index.is | 2 |
| abstract_inverted_index.it | 22 |
| abstract_inverted_index.of | 18, 77, 80 |
| abstract_inverted_index.or | 96 |
| abstract_inverted_index.to | 24 |
| abstract_inverted_index.we | 35, 83 |
| abstract_inverted_index.and | 41, 59, 65, 75, 94 |
| abstract_inverted_index.but | 10 |
| abstract_inverted_index.can | 88 |
| abstract_inverted_index.for | 55 |
| abstract_inverted_index.the | 16 |
| abstract_inverted_index.Data | 0 |
| abstract_inverted_index.area | 5 |
| abstract_inverted_index.both | 91 |
| abstract_inverted_index.data | 12, 28, 102 |
| abstract_inverted_index.even | 97 |
| abstract_inverted_index.high | 57 |
| abstract_inverted_index.many | 8 |
| abstract_inverted_index.such | 19 |
| abstract_inverted_index.that | 44, 85 |
| abstract_inverted_index.this | 33 |
| abstract_inverted_index.with | 72 |
| abstract_inverted_index.Manga | 51, 87 |
| abstract_inverted_index.label | 60 |
| abstract_inverted_index.large | 47 |
| abstract_inverted_index.makes | 21 |
| abstract_inverted_index.rapid | 66 |
| abstract_inverted_index.tasks | 20 |
| abstract_inverted_index.three | 69 |
| abstract_inverted_index.users | 76, 99 |
| abstract_inverted_index.which | 6 |
| abstract_inverted_index.while | 62 |
| abstract_inverted_index.Lingua | 37, 50, 86 |
| abstract_inverted_index.Manga, | 38 |
| abstract_inverted_index.assist | 90 |
| abstract_inverted_index.issue, | 34 |
| abstract_inverted_index.levels | 79 |
| abstract_inverted_index.offers | 52 |
| abstract_inverted_index.system | 43 |
| abstract_inverted_index.tasks. | 14 |
| abstract_inverted_index.Through | 68 |
| abstract_inverted_index.address | 32 |
| abstract_inverted_index.develop | 25 |
| abstract_inverted_index.example | 70 |
| abstract_inverted_index.models. | 49 |
| abstract_inverted_index.no-code | 98 |
| abstract_inverted_index.present | 36 |
| abstract_inverted_index.skilled | 92 |
| abstract_inverted_index.system. | 30 |
| abstract_inverted_index.varying | 78 |
| abstract_inverted_index.However, | 15 |
| abstract_inverted_index.contains | 7 |
| abstract_inverted_index.critical | 9 |
| abstract_inverted_index.curation | 1, 29, 103 |
| abstract_inverted_index.distinct | 73 |
| abstract_inverted_index.flexible | 64 |
| abstract_inverted_index.language | 48 |
| abstract_inverted_index.low-code | 95 |
| abstract_inverted_index.utilizes | 45 |
| abstract_inverted_index.achieving | 56 |
| abstract_inverted_index.automatic | 53 |
| abstract_inverted_index.diversity | 17 |
| abstract_inverted_index.technical | 81 |
| abstract_inverted_index.versatile | 42 |
| abstract_inverted_index.addressing | 101 |
| abstract_inverted_index.efficiency | 61 |
| abstract_inverted_index.objectives | 74 |
| abstract_inverted_index.processing | 13 |
| abstract_inverted_index.challenges. | 104 |
| abstract_inverted_index.challenging | 23 |
| abstract_inverted_index.demonstrate | 84 |
| abstract_inverted_index.effectively | 89 |
| abstract_inverted_index.performance | 58 |
| abstract_inverted_index.pre-trained | 46 |
| abstract_inverted_index.programmers | 93 |
| abstract_inverted_index.applications | 71 |
| abstract_inverted_index.development. | 67 |
| abstract_inverted_index.facilitating | 63 |
| abstract_inverted_index.optimization | 54 |
| abstract_inverted_index.proficiency, | 82 |
| abstract_inverted_index.wide-ranging | 4 |
| abstract_inverted_index.user-friendly | 40 |
| abstract_inverted_index.time-consuming | 11 |
| abstract_inverted_index.general-purpose | 27 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.75 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |