Explanipedia

Improving Informally Romanized Language Identification Open

Adrian Benton, Alexander Gutkin, Christo Kirov, Brian Roark · 2025

The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such rom…

Language-Agnostic Multilingual Modeling Open

Arindrima Datta, Bhuvana Ramabhadran, Jesse Emond, Anjuli Kannan, Brian Roark · 2024

Computer science Philosophy

Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the…

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages Open

Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma , et al. · 2023

Computer science Geography Physics

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible…

Spelling convention sensitivity in neural language models Open

Elizabeth Nielsen, Christo Kirov, Brian Roark · 2023

Computer science Philosophy

We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistent…

Beyond Arabic: Software for Perso-Arabic Script Manipulation Open

Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat · 2023

Computer science Philosophy Sociology

This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operati…

Distinguishing Romanized Hindi from Romanized Urdu Open

Elizabeth Nielsen, Christo Kirov, Brian Roark · 2023

Computer science Philosophy

We examine the task of distinguishing between Hindi and Urdu when those languages are romanized, i.e., written in the Latin script. Both languages are widely informally romanized, and to the extent that they are identified in the Latin scr…

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages Open

Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Ma Min , et al. · 2023

Art Philosophy Computer science

Sebastian Ruder, Jonathan Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana Dickinson, Brian Roark, Bid…

Spelling convention sensitivity in neural language models Open

Elizabeth Nielsen, Christo Kirov, Brian Roark · 2023

Computer science Philosophy

We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistent…

Context-aware Transliteration of Romanized South Asian Languages Open

Christo Kirov, Cibu Johny, Anna Katanova, Alexander Gutkin, Brian Roark · 2023

Computer science History Philosophy

While most transliteration research is focused on single tokens such as named entities—for example, transliteration of from the Gujarati script to the Latin script “Ahmedabad” footnoteThe most populous city in the Indian state of Gujarat. …

Graphemic Normalization of the Perso-Arabic Script Open

Raiomond Doctor, Alexander Gutkin, Cibu Johny, Brian Roark, Richard Sproat · 2022

Computer science Sociology Philosophy

Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctua…

Beyond Arabic: Software for Perso-Arabic Script Manipulation Open

Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat · 2022

Computer science Philosophy Sociology

This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operati…

Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022) Open

Sarah Ebling, Emily Prud’hommeaux, Preethi Vaidyanathan, Sara Candeias, Cecilia Ovesdotter Alm , et al. · 2022

Computer science Physics

We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library.The intent of the library is to allow the ensembling of …

Design principles of an open-source language modeling microservice package for AAC text-entry applications Open

Brian Roark, Alexander Gutkin · 2022

Computer science Physics

We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library. The intent of the library is to allow the ensembling of…

Approximating Probabilistic Models as Weighted Finite Automata Open

Ananda Theertha Suresh, Brian Roark, Michael Riley, Vlad Schogol · 2021

Computer science Mathematics Philosophy

Weighted finite automata (WFAs) are often used to represent probabilistic models, such as ngram language models, because among other things, they are efficient for recognition tasks in time and space. The probabilistic source to be represe…

Finding Concept-specific Biases in Form–Meaning Associations Open

Tiago Pimentel, Brian Roark, Søren Wichmann, Ryan Cotterell, Damián E. Blasí · 2021

Computer science Psychology Philosophy

This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has be…

Disambiguatory Signals are Stronger in Word-initial Positions Open

Tiago Pimentel, Ryan Cotterell, Brian Roark · 2021

Computer science Psychology Mathematics

Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of re…

Finite-state script normalization and processing utilities: The Nisaba Brahmic library Open

Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, Brian Roark · 2021

Computer science Sociology Philosophy

Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, Brian Roark. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 2021.

Structured abbreviation expansion in context Open

Kyle Gorman, Christo Kirov, Brian Roark, Richard Sproat · 2021

Computer science Medicine Philosophy

Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The …

Processing South Asian Languages Written in the Latin Script: the\n Dakshina Dataset Open

Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny , et al. · 2020

Computer science Philosophy

This paper describes the Dakshina dataset, a new resource consisting of text\nin both the Latin and native scripts for 12 South Asian languages. The dataset\nincludes, for each language: 1) native script Wikipedia text; 2) a romanization\n…

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Open

Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny , et al. · 2020

Computer science Philosophy

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lex…

Phonotactic Complexity and Its Trade-offs Open

Tiago Pimentel, Brian Roark, Ryan Cotterell · 2020

Computer science Physics Philosophy

We present methods for calculating a measure of phonotactic complexity—bits per phoneme— that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the i…

Rethinking Phonotactic Complexity Open

Tiago Pimentel, Brian Roark, Ryan Cotterell · 2019

Computer science Philosophy Physics

In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this sim…

Meaning to Form: Measuring Systematicity as Information Open

Tiago Pimentel, Arya D. McCarthy, Damián E. Blasí, Brian Roark, Ryan Cotterell · 2019

Computer science Mathematics Philosophy

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? …

Neural Models of Text Normalization for Speech Applications Open

Hao Zhang, Richard Sproat, Axel H. Ng, Felix Stahlberg, Xiaochang Peng , et al. · 2019

Computer science Sociology

Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization f…

Are All Languages Equally Hard to Language Model? Open

Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, Brian Roark · 2019

Computer science Philosophy

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cros…

Distilling weighted finite automata from arbitrary probabilistic models Open

Ananda Theertha Suresh, Brian Roark, Michael Riley, Vlad Schogol · 2019

Computer science

Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however…

What Kind of Language Is Hard to Language-Model? Open

Sebastian J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner · 2019

Computer science Philosophy Chemistry

How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modelin…

Meaning to Form: Measuring Systematicity as Information Open

Tiago Pimentel, Arya D. McCarthy, Damián E. Blasí, Brian Roark, Ryan Cotterell · 2019

Computer science Mathematics Philosophy

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? …

Rethinking Phonotactic Complexity Open

Tiago Pimentel, Brian Roark, Ryan Cotterell · 2019

Computer science Mathematics Philosophy

In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this sim…

Brian Roark YOU? Author Swipe