Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin Article Swipe
YOU?
·
· 2018
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.1809.05501
In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95\% (early printings) and 98\% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/1809.05501
- https://arxiv.org/pdf/1809.05501
- OA Status
- green
- Cited By
- 1
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4311252874
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4311252874Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.1809.05501Digital Object Identifier
- Title
-
Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern LatinWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2018Year of publication
- Publication date
-
2018-09-14Full publication date if available
- Authors
-
Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes BaiterList of authors in order
- Landing page
-
https://arxiv.org/abs/1809.05501Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/1809.05501Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/1809.05501Direct OA link when available
- Concepts
-
German, Transcription (linguistics), Computer science, Natural language processing, Perl, Artificial intelligence, Optical character recognition, USable, Ground truth, Image (mathematics), Linguistics, World Wide Web, PhilosophyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
1Total citation count in OpenAlex
- Citations by year (recent)
-
2023: 1Per-year citation counts (last 5 years)
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4311252874 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.1809.05501 |
| ids.doi | https://doi.org/10.48550/arxiv.1809.05501 |
| ids.openalex | https://openalex.org/W4311252874 |
| fwci | 0.0 |
| type | preprint |
| title | Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10601 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9965000152587891 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Handwritten Text Recognition Techniques |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9941999912261963 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| topics[2].id | https://openalex.org/T13523 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.953499972820282 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1703 |
| topics[2].subfield.display_name | Computational Theory and Mathematics |
| topics[2].display_name | Mathematics, Computing, and Information Processing |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C154775046 |
| concepts[0].level | 2 |
| concepts[0].score | 0.6601150035858154 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q188 |
| concepts[0].display_name | German |
| concepts[1].id | https://openalex.org/C179926584 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6528842449188232 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q207714 |
| concepts[1].display_name | Transcription (linguistics) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.6308586001396179 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C204321447 |
| concepts[3].level | 1 |
| concepts[3].score | 0.6160627603530884 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[3].display_name | Natural language processing |
| concepts[4].id | https://openalex.org/C2777002779 |
| concepts[4].level | 2 |
| concepts[4].score | 0.57771897315979 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q42478 |
| concepts[4].display_name | Perl |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.5400875210762024 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C546480517 |
| concepts[6].level | 3 |
| concepts[6].score | 0.4698326289653778 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q167555 |
| concepts[6].display_name | Optical character recognition |
| concepts[7].id | https://openalex.org/C2780615836 |
| concepts[7].level | 2 |
| concepts[7].score | 0.45979639887809753 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q2471869 |
| concepts[7].display_name | USable |
| concepts[8].id | https://openalex.org/C146849305 |
| concepts[8].level | 2 |
| concepts[8].score | 0.43004778027534485 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q370766 |
| concepts[8].display_name | Ground truth |
| concepts[9].id | https://openalex.org/C115961682 |
| concepts[9].level | 2 |
| concepts[9].score | 0.27223992347717285 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q860623 |
| concepts[9].display_name | Image (mathematics) |
| concepts[10].id | https://openalex.org/C41895202 |
| concepts[10].level | 1 |
| concepts[10].score | 0.22713297605514526 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q8162 |
| concepts[10].display_name | Linguistics |
| concepts[11].id | https://openalex.org/C136764020 |
| concepts[11].level | 1 |
| concepts[11].score | 0.20538988709449768 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q466 |
| concepts[11].display_name | World Wide Web |
| concepts[12].id | https://openalex.org/C138885662 |
| concepts[12].level | 0 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[12].display_name | Philosophy |
| keywords[0].id | https://openalex.org/keywords/german |
| keywords[0].score | 0.6601150035858154 |
| keywords[0].display_name | German |
| keywords[1].id | https://openalex.org/keywords/transcription |
| keywords[1].score | 0.6528842449188232 |
| keywords[1].display_name | Transcription (linguistics) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.6308586001396179 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/natural-language-processing |
| keywords[3].score | 0.6160627603530884 |
| keywords[3].display_name | Natural language processing |
| keywords[4].id | https://openalex.org/keywords/perl |
| keywords[4].score | 0.57771897315979 |
| keywords[4].display_name | Perl |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.5400875210762024 |
| keywords[5].display_name | Artificial intelligence |
| keywords[6].id | https://openalex.org/keywords/optical-character-recognition |
| keywords[6].score | 0.4698326289653778 |
| keywords[6].display_name | Optical character recognition |
| keywords[7].id | https://openalex.org/keywords/usable |
| keywords[7].score | 0.45979639887809753 |
| keywords[7].display_name | USable |
| keywords[8].id | https://openalex.org/keywords/ground-truth |
| keywords[8].score | 0.43004778027534485 |
| keywords[8].display_name | Ground truth |
| keywords[9].id | https://openalex.org/keywords/image |
| keywords[9].score | 0.27223992347717285 |
| keywords[9].display_name | Image (mathematics) |
| keywords[10].id | https://openalex.org/keywords/linguistics |
| keywords[10].score | 0.22713297605514526 |
| keywords[10].display_name | Linguistics |
| keywords[11].id | https://openalex.org/keywords/world-wide-web |
| keywords[11].score | 0.20538988709449768 |
| keywords[11].display_name | World Wide Web |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:1809.05501 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/1809.05501 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/1809.05501 |
| locations[1].id | doi:10.48550/arxiv.1809.05501 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.1809.05501 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5086884211 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Uwe Springmann |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Springmann, Uwe |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5063938010 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-1776-1469 |
| authorships[1].author.display_name | Christian Reul |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Reul, Christian |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5063444213 |
| authorships[2].author.orcid | https://orcid.org/0000-0003-4357-9078 |
| authorships[2].author.display_name | Stefanie Dipper |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Dipper, Stefanie |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5033897631 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Johannes Baiter |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Baiter, Johannes |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/1809.05501 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2022-12-25T00:00:00 |
| display_name | Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10601 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9965000152587891 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Handwritten Text Recognition Techniques |
| related_works | https://openalex.org/W2480056510, https://openalex.org/W2412623563, https://openalex.org/W190299232, https://openalex.org/W1521481095, https://openalex.org/W594361471, https://openalex.org/W2604964690, https://openalex.org/W2417897055, https://openalex.org/W2613359208, https://openalex.org/W2291334382, https://openalex.org/W1484205720 |
| cited_by_count | 1 |
| counts_by_year[0].year | 2023 |
| counts_by_year[0].cited_by_count | 1 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:1809.05501 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/1809.05501 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/1809.05501 |
| primary_location.id | pmh:oai:arXiv.org:1809.05501 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/1809.05501 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/1809.05501 |
| publication_date | 2018-09-14 |
| publication_year | 2018 |
| referenced_works_count | 0 |
| abstract_inverted_index.4 | 99 |
| abstract_inverted_index.a | 5, 39, 64, 132 |
| abstract_inverted_index.GT | 72, 137, 150 |
| abstract_inverted_index.In | 0 |
| abstract_inverted_index.We | 102 |
| abstract_inverted_index.as | 73, 97 |
| abstract_inverted_index.by | 139 |
| abstract_inverted_index.in | 17, 56, 93 |
| abstract_inverted_index.is | 60 |
| abstract_inverted_index.it | 78 |
| abstract_inverted_index.of | 7, 20, 34, 42, 71, 111 |
| abstract_inverted_index.on | 128, 146 |
| abstract_inverted_index.or | 100 |
| abstract_inverted_index.to | 51, 81, 135, 148 |
| abstract_inverted_index.we | 3 |
| abstract_inverted_index.4.0 | 66 |
| abstract_inverted_index.OCR | 16, 87, 152 |
| abstract_inverted_index.The | 68 |
| abstract_inverted_index.and | 9, 59, 119, 143 |
| abstract_inverted_index.for | 14, 86, 109, 151 |
| abstract_inverted_index.has | 155 |
| abstract_inverted_index.how | 147 |
| abstract_inverted_index.may | 158 |
| abstract_inverted_index.our | 112 |
| abstract_inverted_index.the | 18, 48 |
| abstract_inverted_index.(GT) | 13 |
| abstract_inverted_index.15th | 49 |
| abstract_inverted_index.19th | 52 |
| abstract_inverted_index.95\% | 116 |
| abstract_inverted_index.98\% | 120 |
| abstract_inverted_index.LSTM | 94 |
| abstract_inverted_index.Perl | 133 |
| abstract_inverted_index.This | 29 |
| abstract_inverted_index.also | 103 |
| abstract_inverted_index.form | 19, 70 |
| abstract_inverted_index.from | 45, 47, 160 |
| abstract_inverted_index.give | 144 |
| abstract_inverted_index.line | 23, 36, 74 |
| abstract_inverted_index.some | 105 |
| abstract_inverted_index.such | 96 |
| abstract_inverted_index.test | 130 |
| abstract_inverted_index.text | 22 |
| abstract_inverted_index.that | 157 |
| abstract_inverted_index.this | 1 |
| abstract_inverted_index.wide | 40 |
| abstract_inverted_index.with | 26 |
| abstract_inverted_index.(19th | 121 |
| abstract_inverted_index.CC-BY | 65 |
| abstract_inverted_index.Latin | 10 |
| abstract_inverted_index.books | 54 |
| abstract_inverted_index.dates | 44 |
| abstract_inverted_index.hints | 145 |
| abstract_inverted_index.makes | 77 |
| abstract_inverted_index.pairs | 37, 76 |
| abstract_inverted_index.paper | 2 |
| abstract_inverted_index.rates | 127 |
| abstract_inverted_index.their | 27 |
| abstract_inverted_index.train | 82 |
| abstract_inverted_index.types | 58 |
| abstract_inverted_index.under | 63 |
| abstract_inverted_index.which | 154 |
| abstract_inverted_index.(early | 117 |
| abstract_inverted_index.German | 8 |
| abstract_inverted_index.called | 31 |
| abstract_inverted_index.cases, | 131 |
| abstract_inverted_index.differ | 159 |
| abstract_inverted_index.images | 24 |
| abstract_inverted_index.models | 85, 108 |
| abstract_inverted_index.neural | 91 |
| abstract_inverted_index.openly | 61 |
| abstract_inverted_index.paired | 25 |
| abstract_inverted_index.period | 41 |
| abstract_inverted_index.rules, | 142 |
| abstract_inverted_index.script | 134 |
| abstract_inverted_index.truth} | 12 |
| abstract_inverted_index.unseen | 129 |
| abstract_inverted_index.usable | 80 |
| abstract_inverted_index.313,173 | 35 |
| abstract_inverted_index.Fraktur | 57, 123 |
| abstract_inverted_index.OCRopus | 107 |
| abstract_inverted_index.between | 115 |
| abstract_inverted_index.century | 50, 53, 122 |
| abstract_inverted_index.dataset | 6, 113 |
| abstract_inverted_index.printed | 21, 55 |
| abstract_inverted_index.provide | 104 |
| abstract_inverted_index.special | 69 |
| abstract_inverted_index.OCRopus. | 101 |
| abstract_inverted_index.accuracy | 126 |
| abstract_inverted_index.consists | 33 |
| abstract_inverted_index.covering | 38 |
| abstract_inverted_index.dataset, | 30 |
| abstract_inverted_index.describe | 4 |
| abstract_inverted_index.directly | 79 |
| abstract_inverted_index.license. | 67 |
| abstract_inverted_index.networks | 92 |
| abstract_inverted_index.printing | 43 |
| abstract_inverted_index.produced | 138 |
| abstract_inverted_index.purposes | 153 |
| abstract_inverted_index.software | 88 |
| abstract_inverted_index.yielding | 114 |
| abstract_inverted_index.Tesseract | 98 |
| abstract_inverted_index.available | 62 |
| abstract_inverted_index.character | 125 |
| abstract_inverted_index.construct | 149 |
| abstract_inverted_index.different | 140 |
| abstract_inverted_index.employing | 89 |
| abstract_inverted_index.harmonize | 136 |
| abstract_inverted_index.motivated | 162 |
| abstract_inverted_index.recurring | 90 |
| abstract_inverted_index.historical | 15 |
| abstract_inverted_index.incunabula | 46 |
| abstract_inverted_index.pretrained | 106 |
| abstract_inverted_index.printings) | 118, 124 |
| abstract_inverted_index.subcorpora | 110 |
| abstract_inverted_index.recognition | 84 |
| abstract_inverted_index.architecture | 95 |
| abstract_inverted_index.requirements | 156 |
| abstract_inverted_index.transcription | 141 |
| abstract_inverted_index.\textit{ground | 11 |
| abstract_inverted_index.linguistically | 161 |
| abstract_inverted_index.transcription. | 28 |
| abstract_inverted_index.transcriptions. | 163 |
| abstract_inverted_index.state-of-the-art | 83 |
| abstract_inverted_index.image/transcription | 75 |
| abstract_inverted_index.\textit{GT4HistOCR}, | 32 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| sustainable_development_goals[0].id | https://metadata.un.org/sdg/4 |
| sustainable_development_goals[0].score | 0.7599999904632568 |
| sustainable_development_goals[0].display_name | Quality Education |
| citation_normalized_percentile |