arXiv (Cornell University)
Exploring the Plausibility of Hate and Counter Speech Detectors with Explainable AI
July 2024 • Adrian Jaques Böck, Djordje Slijepčević, Matthias Zeppelzauer
In this paper we investigate the explainability of transformer models and their plausibility for hate speech and counter speech detection. We compare representatives of four different explainability approaches, i.e., gradient-based, perturbation-based, attention-based, and prototype-based approaches, and analyze them quantitatively with an ablation study and qualitatively in a user study. Results show that perturbation-based explainability performs best, followed by gradient-based and attention-based explainabilit…