Daniel van Strien
YOU?
Author Swipe
View article: Documenting Geographically and Contextually Diverse Language Data Sources
Documenting Geographically and Contextually Diverse Language Data Sources Open
Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data …
View article: The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Open
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, w…
Datasheets for Digital Cultural Heritage Datasets Open
Sparked by issues of quality and lack of proper documentation for datasets, the machine learning community has begun developing standardised processes for establishing datasheets for machine learning datasets, with the intent to provide co…
Metadata Might Make Language Models Better Open
This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare …
View article: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Open
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich…
Computer Vision for the Humanities: An Introduction to Deep Learning for Image Classification (Part 2) Open
This is the second of a two-part lesson introducing deep learning based computer vision methods for humanities research. This lesson digs deeper into the details of training a deep learning based computer vision model. It covers some chall…
Computer Vision for the Humanities: An Introduction to Deep Learning for Image Classification (Part 1) Open
This is the first of a two-part lesson introducing deep learning based computer vision methods for humanities research. Using a dataset of historical newspaper advertisements and the fastai Python library, the lesson walks through the pipe…
Historic manuscript page images with noisy labels Open
Images of digitised manuscript pages sourced from https://iiif.biblissima.fr/collections/. This dataset aims to facilitate experiments using existing data/metadata to train computer vision models. In particular, using 'noisy' labels in som…
Historic manuscript page images with noisy labels Open
Images of manuscript pages sourced from https://iiif.biblissima.fr/collections/. This dataset includes the labels included in the IIIF manifests for these images.
Historic manuscript page images with noisy labels Open
Images of digitised manuscript pages sourced from https://iiif.biblissima.fr/collections/. This dataset aims to facilitate experiments using existing data/metadata to train computer vision models. In particular, using 'noisy' labels in som…
BigScience AI4LAM presentation Open
Presentation on BigScience project given at an AI4LAM community call on the 21st June 2022
AI training resources for GLAM: a snapshot Open
We take a snapshot of current resources available for teaching and learning AI with a focus on the Galleries, Libraries, Archives and Museums (GLAM) community. The review was carried out in 2021 and 2022. The review provides an overview of…
Sample of Digitised Books - Images identified as Embellishments. c. 1510 - c. 1900. JPG Open
A subsample of Digitised Books - Images identified as Embellishments. c. 1510 - c. 1900. JPG https://bl.iro.bl.uk/concern/datasets/59d1aa35-c2d7-46e5-9475-9d0cd8df721e?locale=en
Sample of Digitised Books - Images identified as Embellishments. c. 1510 - c. 1900. JPG Open
A subsample of Digitised Books - Images identified as Embellishments. c. 1510 - c. 1900. JPG https://bl.iro.bl.uk/concern/datasets/59d1aa35-c2d7-46e5-9475-9d0cd8df721e?locale=en
Sample of Digitised Books - Images identified as Embellishments. c. 1510 - c. 1900. JPG Open
A subsample of Digitised Books - Images identified as Embellishments. c. 1510 - c. 1900. JPG https://bl.iro.bl.uk/concern/datasets/59d1aa35-c2d7-46e5-9475-9d0cd8df721e?locale=en
View article: Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources Open
In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect …
19th Century United States Newspaper images predicted as Photographs with labels for "human", "animal", "human-structure" and "landscape" Open
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). \n\n\n[The Newspaper Navi…
Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0 Open
Francesco De Toni, Christopher Akiki, Javier De La Rosa, Clémentine Fourrier, Enrique Manjavacas, Stefan Schweter, Daniel Van Strien. Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language …
View article: A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers
A Dataset for Toponym Resolution in Nineteenth-Century English Newspapers Open
We present a new dataset for the task of toponym resolution in digitized historical newspapers in English. It consists of 343 annotated articles from newspapers based in four different locations in England (Manchester, Ashton-under-Lyne, P…
19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels Open
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator …
19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels Open
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator …
19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels Open
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator …
19th Century United States Newspaper Advert images with 'illustrated' or 'non illustrated' labels Open
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator …
British Library Books genre detection model Open
Model description This model is intended to predict, from the title of a book, whether it is 'fiction' or 'non-fiction'. This model was trained on data created from the Digitised printed books (18th-19th Century) book collection. The datas…
View article: Maps of a Nation? The Digitized Ordnance Survey for New Historical Research
Maps of a Nation? The Digitized Ordnance Survey for New Historical Research Open
Although the Ordnance Survey has itself been the subject of historical research, scholars have not systematically used its maps as primary sources of information. This is partly for disciplinary reasons and partly for the technical reason …
View article: A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching
A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching Open
Recognizing toponyms and resolving them to their real-world referents is required for providing advanced semantic access to textual data. This process is often hindered by the high degree of variation in toponyms. Candidate selection is th…
Images from Newspaper Navigator predicted as maps, with human corrected labels Open
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator …
Images from Newspaper Navigator predicted as maps, with human corrected labels Open
The Dataset contains images derived from the Newspaper Navigator (news-navigator.labs.loc.gov/), a dataset of images drawn from the Library of Congress Chronicling America collection (chroniclingamerica.loc.gov/). [The Newspaper Navigator …
View article: A Deep Learning Approach to Geographical Candidate Selection through\n Toponym Matching
A Deep Learning Approach to Geographical Candidate Selection through\n Toponym Matching Open
Recognizing toponyms and resolving them to their real-world referents is\nrequired for providing advanced semantic access to textual data. This process\nis often hindered by the high degree of variation in toponyms. Candidate\nselection is…