Constantine Lignos
YOU?
Author Swipe
View article: Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation
Beyond statistical significance: Quantifying uncertainty and statistical variability in multilingual and multitask NLP evaluation Open
We introduce a set of resampling-based methods for quantifying uncertainty and statistical precision of evaluation metrics in multilingual and/or multitask NLP benchmarks. We show how experimental variation in performance scores arises fro…
View article: OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages Open
We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotatio…
View article: CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English
CoNLL#: Fine-grained Error Analysis and a Corrected Test Set for CoNLL-03 English Open
Modern named entity recognition systems have steadily improved performance in the age of larger and more powerful neural models. However, over the past several years, the state-of-the-art has seemingly hit another plateau on the benchmark …
View article: QueryNER: Segmentation of E-commerce Queries
QueryNER: Segmentation of E-commerce Queries Open
We present QueryNER, a manually-annotated dataset and accompanying model for e-commerce query segmentation. Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions o…
View article: ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata Open
We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a…
View article: What changes when you randomly choose BPE merge operations? Not much
What changes when you randomly choose BPE merge operations? Not much Open
We introduce three simple randomized variants of byte pair encoding (BPE) and explore whether randomizing the selection of merge operations substantially affects a downstream machine translation task. We focus on translation into morpholog…
View article: Improving NER Research Workflows with SeqScore
Improving NER Research Workflows with SeqScore Open
We describe the features of SeqScore, an MIT-licensed Python toolkit for working with named entity recognition (NER) data.While SeqScore began as a tool for NER scoring, it has been expanded to help with the full lifecycle of working with …
View article: What changes when you randomly choose BPE merge operations? Not much.
What changes when you randomly choose BPE merge operations? Not much. Open
We introduce two simple randomized variants of byte pair encoding (BPE) and explore whether randomizing the selection of merge operations substantially affects a downstream machine translation task. We focus on translation into morphologic…
View article: LR-Sum: Summarization for Less-Resourced Languages
LR-Sum: Summarization for Less-Resourced Languages Open
We introduce LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages.LR-Sum contains human-written summaries for 40 languages, many of which are…
View article: LR-Sum: Summarization for Less-Resourced Languages
LR-Sum: Summarization for Less-Resourced Languages Open
This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for…
View article: MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition Open
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of…
View article: Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing
Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing Open
We present a new corpus of Twitter data annotated for codeswitching and borrowing between Spanish and English. The corpus contains 9,500 tweets annotated at the token level with codeswitches, borrowings, and named entities. This corpus dif…
View article: ParaNames: A Massively Multilingual Entity Name Corpus
ParaNames: A Massively Multilingual Entity Name Corpus Open
We introduce ParaNames, a multilingual parallel name resource consisting of 118 million names spanning across 400 languages. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG). Using Wi…
View article: Toward More Meaningful Resources for Lower-resourced Languages
Toward More Meaningful Resources for Lower-resourced Languages Open
In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. We first examine two massively multilingual resources in …
View article: ParaNames: A Massively Multilingual Entity Name Corpus
ParaNames: A Massively Multilingual Entity Name Corpus Open
We present ParaNames, a Wikidata-derived multilingual parallel name resource consisting of names for approximately 14 million entities spanning over 400 languages. ParaNames is useful for multilingual language processing, both in defining …
View article: Toward More Meaningful Resources for Lower-resourced Languages
Toward More Meaningful Resources for Lower-resourced Languages Open
In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. Before advancing that position, we first examine two mass…
View article: MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition Open
David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonav…
View article: Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling
Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling Open
This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings—words f…
View article: Overview of ADoBo 2021: Automatic Detection of Unassimilated Borrowings in the Spanish Press
Overview of ADoBo 2021: Automatic Detection of Unassimilated Borrowings in the Spanish Press Open
This paper summarizes the main findings of the ADoBo 2021 shared task, proposed in the context of IberLef 2021. In this task, we invited participants to detect lexical borrowings (coming mostly from English) in Spanish newswire texts. This…
View article: Overview of ADoBo 2021:: Automatic Detection of Unassimilated Borrowings in the Spanish Press
Overview of ADoBo 2021:: Automatic Detection of Unassimilated Borrowings in the Spanish Press Open
This paper summarizes the main findings of the ADoBo 2021 shared task, proposed in the context of IberLef 2021. In this task, we invited participants to detect lexical borrowings (coming mostly from English) in Spanish newswire texts. This…
View article: SeqScore: Addressing Barriers to Reproducible Named Entity Recognition\n Evaluation
SeqScore: Addressing Barriers to Reproducible Named Entity Recognition\n Evaluation Open
To address a looming crisis of unreproducible evaluation for named entity\nrecognition, we propose guidelines and introduce SeqScore, a software package\nto improve reproducibility. The guidelines we propose are extremely simple and\ncente…
View article: SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation
SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation Open
To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center a…
View article: Addressing Barriers to Reproducible Named Entity Recognition Evaluation
Addressing Barriers to Reproducible Named Entity Recognition Evaluation Open
To address what we believe is a looming crisis of unreproducible evaluation
for named entity recognition tasks, we present guidelines for reproducible
evaluation. The guidelines we propose are extremely simple, focusing on
transparency reg…
View article: MasakhaNER: Named entity recognition for African languages
MasakhaNER: Named entity recognition for African languages Open
International audience
View article: Mining Wikidata for Name Resources for African Languages
Mining Wikidata for Name Resources for African Languages Open
This work supports further development of language technology for the languages of Africa by providing a Wikidata-derived resource of name lists corresponding to common entity types (person, location, and organization). While we are not th…
View article: Mining Wikidata for Name Resources for African Languages
Mining Wikidata for Name Resources for African Languages Open
This work supports further development of language technology for the languages of Africa by providing a Wikidata-derived resource of name lists corresponding to common entity types (person, location, and organization). While we are not th…
View article: TMR: Evaluating NER Recall on Tough Mentions
TMR: Evaluating NER Recall on Tough Mentions Open
We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of "tough" mentions: unseen mentions, those whose tokens or token/type combinati…