AutoMeta-ETD500 Article Swipe

View

Related Concepts

Biology

Muntabir Hasan Choudhury , Himarsha R. Jayanetti , Jian Wu , William A. Ingram , Edward A. Fox ·

YOU? · · 2023 · Open Access · · DOI: https://doi.org/10.7910/dvn/18d6az · OA: W4398604133

AutoMeta-ETD500 contains 500 scanned Electronic Theses and Dissertations (ETDs). This dataset is used to develop a framework called AutoMeta, which automatically extracts seven key metadata fields (e.g., title, author, advisor, university, department, university, and year), which are ubiquitous to ETDs. For this task, the dataset has been derived into the following seven intermediate datasets: a) PDF.zip: This zip file contains 500 ETD samples from different US and non-US universities. b) XML_JSON.zip: This zip file contains 100 ETD metadata that have been downloaded from MIT and Virginia Tech ETD library repositories. c) HTML.zip: This zip file contains the remaining 400 ETD metadata, which have been downloaded from ProQuest. d) Tiff.zip: This zip file contains the Tiff images of cover pages of 500 scanned ETDs. e) noisy.zip: This zip file contains all the noisy data for 500 ETD samples. This is generated by tesseract OCR. f) clean.zip: This zip file contains all the clean data of 500 ETD samples, and this dataset has been manually rectified from noisy data. g) annotated.zip: This zip file contains all annotated data in XML. Annotation is done using the GATE annotation tool.