Brian Roark
YOU?
Author Swipe
View article: Improving Informally Romanized Language Identification
Improving Informally Romanized Language Identification Open
The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such rom…
View article: Language-Agnostic Multilingual Modeling
Language-Agnostic Multilingual Modeling Open
Multilingual Automated Speech Recognition (ASR) systems allow for the joint training of data-rich and data-scarce languages in a single model. This enables data and parameter sharing across languages, which is especially beneficial for the…
View article: XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages Open
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible…
View article: Spelling convention sensitivity in neural language models
Spelling convention sensitivity in neural language models Open
We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistent…
View article: Beyond Arabic: Software for Perso-Arabic Script Manipulation
Beyond Arabic: Software for Perso-Arabic Script Manipulation Open
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operati…
View article: Distinguishing Romanized Hindi from Romanized Urdu
Distinguishing Romanized Hindi from Romanized Urdu Open
We examine the task of distinguishing between Hindi and Urdu when those languages are romanized, i.e., written in the Latin script. Both languages are widely informally romanized, and to the extent that they are identified in the Latin scr…
View article: XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages Open
Sebastian Ruder, Jonathan Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana Dickinson, Brian Roark, Bid…
View article: Spelling convention sensitivity in neural language models
Spelling convention sensitivity in neural language models Open
We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistent…
View article: Context-aware Transliteration of Romanized South Asian Languages
Context-aware Transliteration of Romanized South Asian Languages Open
While most transliteration research is focused on single tokens such as named entities—for example, transliteration of from the Gujarati script to the Latin script “Ahmedabad” footnoteThe most populous city in the Indian state of Gujarat. …
View article: Graphemic Normalization of the Perso-Arabic Script
Graphemic Normalization of the Perso-Arabic Script Open
Since its original appearance in 1991, the Perso-Arabic script representation in Unicode has grown from 169 to over 440 atomic isolated characters spread over several code pages representing standard letters, various diacritics and punctua…
View article: Beyond Arabic: Software for Perso-Arabic Script Manipulation
Beyond Arabic: Software for Perso-Arabic Script Manipulation Open
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operati…
View article: Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022) Open
We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library.The intent of the library is to allow the ensembling of …
View article: Design principles of an open-source language modeling microservice package for AAC text-entry applications
Design principles of an open-source language modeling microservice package for AAC text-entry applications Open
We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library. The intent of the library is to allow the ensembling of…
View article: Approximating Probabilistic Models as Weighted Finite Automata
Approximating Probabilistic Models as Weighted Finite Automata Open
Weighted finite automata (WFAs) are often used to represent probabilistic models, such as ngram language models, because among other things, they are efficient for recognition tasks in time and space. The probabilistic source to be represe…
View article: Finding Concept-specific Biases in Form–Meaning Associations
Finding Concept-specific Biases in Form–Meaning Associations Open
This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has be…
View article: Disambiguatory Signals are Stronger in Word-initial Positions
Disambiguatory Signals are Stronger in Word-initial Positions Open
Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of re…
View article: Finite-state script normalization and processing utilities: The Nisaba Brahmic library
Finite-state script normalization and processing utilities: The Nisaba Brahmic library Open
Cibu Johny, Lawrence Wolf-Sonkin, Alexander Gutkin, Brian Roark. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 2021.
View article: Structured abbreviation expansion in context
Structured abbreviation expansion in context Open
Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The …
View article: Processing South Asian Languages Written in the Latin Script: the\n Dakshina Dataset
Processing South Asian Languages Written in the Latin Script: the\n Dakshina Dataset Open
This paper describes the Dakshina dataset, a new resource consisting of text\nin both the Latin and native scripts for 12 South Asian languages. The dataset\nincludes, for each language: 1) native script Wikipedia text; 2) a romanization\n…
View article: Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset Open
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lex…
View article: Phonotactic Complexity and Its Trade-offs
Phonotactic Complexity and Its Trade-offs Open
We present methods for calculating a measure of phonotactic complexity—bits per phoneme— that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the i…
View article: Rethinking Phonotactic Complexity
Rethinking Phonotactic Complexity Open
In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this sim…
View article: Meaning to Form: Measuring Systematicity as Information
Meaning to Form: Measuring Systematicity as Information Open
A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? …
View article: Neural Models of Text Normalization for Speech Applications
Neural Models of Text Normalization for Speech Applications Open
Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization f…
View article: Are All Languages Equally Hard to Language Model?
Are All Languages Equally Hard to Language Model? Open
For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cros…
View article: Distilling weighted finite automata from arbitrary probabilistic models
Distilling weighted finite automata from arbitrary probabilistic models Open
Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however…
View article: What Kind of Language Is Hard to Language-Model?
What Kind of Language Is Hard to Language-Model? Open
How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modelin…
View article: Meaning to Form: Measuring Systematicity as Information
Meaning to Form: Measuring Systematicity as Information Open
A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? …
View article: Rethinking Phonotactic Complexity
Rethinking Phonotactic Complexity Open
In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this sim…