Down-sampling from hierarchically structured corpus data Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.31234/osf.io/4vtja
· OA: W4382699765
Resource constraints often require researchers to restrict their attention to a subset of the tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. The most prevalent approach, drawing a random sample from the list of corpus hits, has been shown to be inefficient if tokens are clustered by text file. We extend the evaluation of down-sampling designs to settings where tokens are also clustered by lexical item. Our case study, which deals with the replacement of third-person present-tense verb inflection -(e)th by -(e)s in Early Modern English, focuses on five predictors: time, gender, genre, frequency, and phonological context. Assuming we are able to analyze only 2,000 (out of 12,244) tokens, we compare two strategies for selecting a sub-sample of this size: simple down-sampling, where each hit has the same probability of being selected; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare estimates based on mixed-effects logistic regression to a reference model fit to the full set of cases. We observe that structured down-sampling outperforms simple down-sampling on several evaluation criteria.