Explaining Offensive Language Detection Article Swipe
YOU?
·
· 2020
· Open Access
·
· DOI: https://doi.org/10.21248/jlcl.34.2020.223
· OA: W4312760094
Machine learning approaches have proven to be on or even above human-level accuracy for the task of offensive language detection.In contrast to human experts, however, they often lack the capability of giving explanations for their decisions.This article compares four different approaches to make offensive language detection explainable: an interpretable machine learning model (naive Bayes), a model-agnostic explainability method (LIME), a model-based explainability method (LRP), and a self-explanatory model (LSTM with an attention mechanism).Three different classification methods: SVM, naive Bayes, and LSTM are paired with appropriate explanation methods.To this end, we investigate the trade-off between classification performance and explainability of the respective classifiers.We conclude that, with the appropriate explanation methods, the superior classification performance of more complex models is worth the initial lack of explainability. Explainability and InterpretabilityAutomatic classification of text happens in many different application scenarios.One area where explanations are particularly important is in the context of online discussion moderation since the users who participate in a discussion usually want to know why a certain post was not published or deleted.On the one hand, comment platforms need to consider automatic methods due to the large volume of comments they process every day.On the other hand, these platforms do not want to lose comment readers and writers by seemingly censoring opinions.If humans moderate online discussions, it is desirable to get an explanation of why they classify a user comment as offensive and decide to remove it from the platform.Thereby, to some extent, moderators can be held accountable for their decisions.They cannot randomly remove comments but need to give reasons -otherwise, users would not comprehend the platform's rules and could not act by them.Machine learning approaches have proven to be on or even above human-level accuracy for the task of offensive language detection (Wulczyn et al., 2017).A variety of shared tasks fosters further improvements of this classification accuracy, e.g., with focuses on hate speech against immigrants and women (Basile et al., 2019), offensive language (Zampieri et al., 2019;Struß et al., 2019), and aggression (Bhattacharya et al., 2020).As automated text classification applications find their way into our society and their decisions affect our lives (Risch & Krestel, 2018), it also becomes crucial that we can trust those systems in the same way that we can trust other humans.