KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs Article Swipe
Building high-quality knowledge graphs (KGs) from diverse sources requires combining methods for information extraction, data transformation, ontology mapping, entity matching, and data fusion. Numerous methods and tools exist for each of these tasks, but support for combining them into reproducible and effective end-to-end pipelines is still lacking. We present a new framework, KGpipe for defining and executing integration pipelines that can combine existing tools or LLM (Large Language Model) functionality. To evaluate different pipelines and the resulting KGs, we propose a benchmark to integrate heterogeneous data of different formats (RDF, JSON, text) into a seed KG. We demonstrate the flexibility of KGpipe by running and comparatively evaluating several pipelines integrating sources of the same or different formats using selected performance and quality metrics.
Related Topics
- Type
- article
- Landing Page
- http://arxiv.org/abs/2511.18364
- https://arxiv.org/pdf/2511.18364
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W7106782941
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W7106782941Canonical identifier for this work in OpenAlex
- Title
-
KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge GraphsWork title
- Type
-
articleOpenAlex work type
- Publication year
-
2025Year of publication
- Publication date
-
2025-11-23Full publication date if available
- Authors
-
Hofer, Marvin, Rahm ErhardList of authors in order
- Landing page
-
https://arxiv.org/abs/2511.18364Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2511.18364Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2511.18364Direct OA link when available
- Concepts
-
Computer science, Pipeline transport, Data integration, Flexibility (engineering), Ontology, Benchmark (surveying), Data mining, Pipeline (software), Database, Information integration, Quality (philosophy), Knowledge graph, Data quality, Data science, Software engineering, Data modeling, Information retrieval, Ontology-based data integration, System integration, Data model (GIS), Data access, Data source, Knowledge integrationTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W7106782941 |
|---|---|
| doi | |
| ids.openalex | https://openalex.org/W7106782941 |
| fwci | 0.0 |
| type | article |
| title | KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11273 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.763526201248169 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Advanced Graph Neural Networks |
| topics[1].id | https://openalex.org/T11719 |
| topics[1].field.id | https://openalex.org/fields/18 |
| topics[1].field.display_name | Decision Sciences |
| topics[1].score | 0.11983056366443634 |
| topics[1].domain.id | https://openalex.org/domains/2 |
| topics[1].domain.display_name | Social Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1803 |
| topics[1].subfield.display_name | Management Science and Operations Research |
| topics[1].display_name | Data Quality and Management |
| topics[2].id | https://openalex.org/T10215 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.06443244963884354 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Semantic Web and Ontologies |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7721224427223206 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C175309249 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7344516515731812 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q725864 |
| concepts[1].display_name | Pipeline transport |
| concepts[2].id | https://openalex.org/C72634772 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6052640676498413 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q386824 |
| concepts[2].display_name | Data integration |
| concepts[3].id | https://openalex.org/C2780598303 |
| concepts[3].level | 2 |
| concepts[3].score | 0.5988371968269348 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q65921492 |
| concepts[3].display_name | Flexibility (engineering) |
| concepts[4].id | https://openalex.org/C25810664 |
| concepts[4].level | 2 |
| concepts[4].score | 0.5841217637062073 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q44325 |
| concepts[4].display_name | Ontology |
| concepts[5].id | https://openalex.org/C185798385 |
| concepts[5].level | 2 |
| concepts[5].score | 0.5716614127159119 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q1161707 |
| concepts[5].display_name | Benchmark (surveying) |
| concepts[6].id | https://openalex.org/C124101348 |
| concepts[6].level | 1 |
| concepts[6].score | 0.5433017611503601 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q172491 |
| concepts[6].display_name | Data mining |
| concepts[7].id | https://openalex.org/C43521106 |
| concepts[7].level | 2 |
| concepts[7].score | 0.4987257122993469 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q2165493 |
| concepts[7].display_name | Pipeline (software) |
| concepts[8].id | https://openalex.org/C77088390 |
| concepts[8].level | 1 |
| concepts[8].score | 0.3826621174812317 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q8513 |
| concepts[8].display_name | Database |
| concepts[9].id | https://openalex.org/C33326189 |
| concepts[9].level | 2 |
| concepts[9].score | 0.36895057559013367 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q17092450 |
| concepts[9].display_name | Information integration |
| concepts[10].id | https://openalex.org/C2779530757 |
| concepts[10].level | 2 |
| concepts[10].score | 0.3638152778148651 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q1207505 |
| concepts[10].display_name | Quality (philosophy) |
| concepts[11].id | https://openalex.org/C2987255567 |
| concepts[11].level | 2 |
| concepts[11].score | 0.3625951409339905 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q33002955 |
| concepts[11].display_name | Knowledge graph |
| concepts[12].id | https://openalex.org/C24756922 |
| concepts[12].level | 3 |
| concepts[12].score | 0.3314189016819 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q1757694 |
| concepts[12].display_name | Data quality |
| concepts[13].id | https://openalex.org/C2522767166 |
| concepts[13].level | 1 |
| concepts[13].score | 0.3308601677417755 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q2374463 |
| concepts[13].display_name | Data science |
| concepts[14].id | https://openalex.org/C115903868 |
| concepts[14].level | 1 |
| concepts[14].score | 0.32673922181129456 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q80993 |
| concepts[14].display_name | Software engineering |
| concepts[15].id | https://openalex.org/C67186912 |
| concepts[15].level | 2 |
| concepts[15].score | 0.30801451206207275 |
| concepts[15].wikidata | https://www.wikidata.org/wiki/Q367664 |
| concepts[15].display_name | Data modeling |
| concepts[16].id | https://openalex.org/C23123220 |
| concepts[16].level | 1 |
| concepts[16].score | 0.30312424898147583 |
| concepts[16].wikidata | https://www.wikidata.org/wiki/Q816826 |
| concepts[16].display_name | Information retrieval |
| concepts[17].id | https://openalex.org/C22550185 |
| concepts[17].level | 3 |
| concepts[17].score | 0.2868225574493408 |
| concepts[17].wikidata | https://www.wikidata.org/wiki/Q7095047 |
| concepts[17].display_name | Ontology-based data integration |
| concepts[18].id | https://openalex.org/C19527686 |
| concepts[18].level | 2 |
| concepts[18].score | 0.27596157789230347 |
| concepts[18].wikidata | https://www.wikidata.org/wiki/Q1665453 |
| concepts[18].display_name | System integration |
| concepts[19].id | https://openalex.org/C100463513 |
| concepts[19].level | 2 |
| concepts[19].score | 0.2747898995876312 |
| concepts[19].wikidata | https://www.wikidata.org/wiki/Q5227322 |
| concepts[19].display_name | Data model (GIS) |
| concepts[20].id | https://openalex.org/C47487241 |
| concepts[20].level | 2 |
| concepts[20].score | 0.27121010422706604 |
| concepts[20].wikidata | https://www.wikidata.org/wiki/Q5227230 |
| concepts[20].display_name | Data access |
| concepts[21].id | https://openalex.org/C2983685735 |
| concepts[21].level | 2 |
| concepts[21].score | 0.26151013374328613 |
| concepts[21].wikidata | https://www.wikidata.org/wiki/Q5227355 |
| concepts[21].display_name | Data source |
| concepts[22].id | https://openalex.org/C56289545 |
| concepts[22].level | 3 |
| concepts[22].score | 0.25037357211112976 |
| concepts[22].wikidata | https://www.wikidata.org/wiki/Q6423376 |
| concepts[22].display_name | Knowledge integration |
| keywords[0].id | https://openalex.org/keywords/pipeline-transport |
| keywords[0].score | 0.7344516515731812 |
| keywords[0].display_name | Pipeline transport |
| keywords[1].id | https://openalex.org/keywords/data-integration |
| keywords[1].score | 0.6052640676498413 |
| keywords[1].display_name | Data integration |
| keywords[2].id | https://openalex.org/keywords/flexibility |
| keywords[2].score | 0.5988371968269348 |
| keywords[2].display_name | Flexibility (engineering) |
| keywords[3].id | https://openalex.org/keywords/ontology |
| keywords[3].score | 0.5841217637062073 |
| keywords[3].display_name | Ontology |
| keywords[4].id | https://openalex.org/keywords/benchmark |
| keywords[4].score | 0.5716614127159119 |
| keywords[4].display_name | Benchmark (surveying) |
| keywords[5].id | https://openalex.org/keywords/pipeline |
| keywords[5].score | 0.4987257122993469 |
| keywords[5].display_name | Pipeline (software) |
| keywords[6].id | https://openalex.org/keywords/information-integration |
| keywords[6].score | 0.36895057559013367 |
| keywords[6].display_name | Information integration |
| keywords[7].id | https://openalex.org/keywords/quality |
| keywords[7].score | 0.3638152778148651 |
| keywords[7].display_name | Quality (philosophy) |
| language | |
| locations[0].id | pmh:oai:arXiv.org:2511.18364 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2511.18364 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2511.18364 |
| indexed_in | arxiv |
| authorships[0].author.id | https://openalex.org/A4322147174 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Hofer, Marvin |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Hofer, Marvin |
| authorships[0].is_corresponding | True |
| authorships[1].author.id | https://openalex.org/A2742500765 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Rahm Erhard |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Rahm, Erhard |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2511.18364 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-11-27T00:00:00 |
| display_name | KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-27T01:16:37.896743 |
| primary_topic.id | https://openalex.org/T11273 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.763526201248169 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Advanced Graph Neural Networks |
| cited_by_count | 0 |
| locations_count | 1 |
| best_oa_location.id | pmh:oai:arXiv.org:2511.18364 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2511.18364 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2511.18364 |
| primary_location.id | pmh:oai:arXiv.org:2511.18364 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2511.18364 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2511.18364 |
| publication_date | 2025-11-23 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 49, 80, 93 |
| abstract_inverted_index.To | 70 |
| abstract_inverted_index.We | 47, 96 |
| abstract_inverted_index.by | 102 |
| abstract_inverted_index.is | 44 |
| abstract_inverted_index.of | 30, 86, 100, 111 |
| abstract_inverted_index.or | 64, 114 |
| abstract_inverted_index.to | 82 |
| abstract_inverted_index.we | 78 |
| abstract_inverted_index.KG. | 95 |
| abstract_inverted_index.LLM | 65 |
| abstract_inverted_index.and | 20, 25, 40, 55, 74, 104, 120 |
| abstract_inverted_index.but | 33 |
| abstract_inverted_index.can | 60 |
| abstract_inverted_index.for | 11, 28, 35, 53 |
| abstract_inverted_index.new | 50 |
| abstract_inverted_index.the | 75, 98, 112 |
| abstract_inverted_index.KGs, | 77 |
| abstract_inverted_index.data | 14, 21, 85 |
| abstract_inverted_index.each | 29 |
| abstract_inverted_index.from | 5 |
| abstract_inverted_index.into | 38, 92 |
| abstract_inverted_index.same | 113 |
| abstract_inverted_index.seed | 94 |
| abstract_inverted_index.that | 59 |
| abstract_inverted_index.them | 37 |
| abstract_inverted_index.(KGs) | 4 |
| abstract_inverted_index.(RDF, | 89 |
| abstract_inverted_index.JSON, | 90 |
| abstract_inverted_index.exist | 27 |
| abstract_inverted_index.still | 45 |
| abstract_inverted_index.text) | 91 |
| abstract_inverted_index.these | 31 |
| abstract_inverted_index.tools | 26, 63 |
| abstract_inverted_index.using | 117 |
| abstract_inverted_index.(Large | 66 |
| abstract_inverted_index.KGpipe | 52, 101 |
| abstract_inverted_index.Model) | 68 |
| abstract_inverted_index.entity | 18 |
| abstract_inverted_index.graphs | 3 |
| abstract_inverted_index.tasks, | 32 |
| abstract_inverted_index.combine | 61 |
| abstract_inverted_index.diverse | 6 |
| abstract_inverted_index.formats | 88, 116 |
| abstract_inverted_index.fusion. | 22 |
| abstract_inverted_index.methods | 10, 24 |
| abstract_inverted_index.present | 48 |
| abstract_inverted_index.propose | 79 |
| abstract_inverted_index.quality | 121 |
| abstract_inverted_index.running | 103 |
| abstract_inverted_index.several | 107 |
| abstract_inverted_index.sources | 7, 110 |
| abstract_inverted_index.support | 34 |
| abstract_inverted_index.Building | 0 |
| abstract_inverted_index.Language | 67 |
| abstract_inverted_index.Numerous | 23 |
| abstract_inverted_index.defining | 54 |
| abstract_inverted_index.evaluate | 71 |
| abstract_inverted_index.existing | 62 |
| abstract_inverted_index.lacking. | 46 |
| abstract_inverted_index.mapping, | 17 |
| abstract_inverted_index.metrics. | 122 |
| abstract_inverted_index.ontology | 16 |
| abstract_inverted_index.requires | 8 |
| abstract_inverted_index.selected | 118 |
| abstract_inverted_index.benchmark | 81 |
| abstract_inverted_index.combining | 9, 36 |
| abstract_inverted_index.different | 72, 87, 115 |
| abstract_inverted_index.effective | 41 |
| abstract_inverted_index.executing | 56 |
| abstract_inverted_index.integrate | 83 |
| abstract_inverted_index.knowledge | 2 |
| abstract_inverted_index.matching, | 19 |
| abstract_inverted_index.pipelines | 43, 58, 73, 108 |
| abstract_inverted_index.resulting | 76 |
| abstract_inverted_index.end-to-end | 42 |
| abstract_inverted_index.evaluating | 106 |
| abstract_inverted_index.framework, | 51 |
| abstract_inverted_index.demonstrate | 97 |
| abstract_inverted_index.extraction, | 13 |
| abstract_inverted_index.flexibility | 99 |
| abstract_inverted_index.information | 12 |
| abstract_inverted_index.integrating | 109 |
| abstract_inverted_index.integration | 57 |
| abstract_inverted_index.performance | 119 |
| abstract_inverted_index.high-quality | 1 |
| abstract_inverted_index.reproducible | 39 |
| abstract_inverted_index.comparatively | 105 |
| abstract_inverted_index.heterogeneous | 84 |
| abstract_inverted_index.functionality. | 69 |
| abstract_inverted_index.transformation, | 15 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile.value | 0.91800783 |
| citation_normalized_percentile.is_in_top_1_percent | False |
| citation_normalized_percentile.is_in_top_10_percent | True |