Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study Article Swipe

PDF

Lucky Onyekwelu-Udoka , Md Shafiqul Islam , Mahbub Hasan ·

YOU? · · 2025 · Open Access · · DOI: https://doi.org/10.48550/arxiv.2511.00402

Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT. Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices. The code is available at: https://github.com/luckymaduabuchi/Emotion-detection-.

Related Topics

Speech

Music And Emotion

The Cultural Politics Of Emotion

Rivers Of Blood Speech

Counterfeit Banknote Detection Pen

Mission Accomplished Speech

Emotion (Carly Rae Jepsen Album)

George S. Patton's Speech To The Third Army

Concepts

No concepts available.

Metadata

Type: preprint
Landing Page: http://arxiv.org/abs/2511.00402
PDF: https://arxiv.org/pdf/2511.00402
OA Status: green
OpenAlex ID: https://openalex.org/W4415937535

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W4415937535

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.48550/arxiv.2511.00402

Digital Object Identifier
Title: Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Work title
Type: preprint

OpenAlex work type
Publication year: 2025

Year of publication
Publication date: 2025-11-01

Full publication date if available
Authors: Lucky Onyekwelu-Udoka, Md Shafiqul Islam, Mahbub Hasan

List of authors in order
Landing page: https://arxiv.org/abs/2511.00402

Publisher landing page
PDF URL: https://arxiv.org/pdf/2511.00402

Direct link to full text PDF
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://arxiv.org/pdf/2511.00402

Direct OA link when available
Cited by: 0

Total citation count in OpenAlex

Full payload

id	https://openalex.org/W4415937535
doi	https://doi.org/10.48550/arxiv.2511.00402
ids.doi	https://doi.org/10.48550/arxiv.2511.00402
ids.openalex	https://openalex.org/W4415937535
fwci
type	preprint
title	Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
is_xpac	False
apc_list
apc_paid
language
locations[0].id	pmh:oai:arXiv.org:2511.00402
locations[0].is_oa	True
locations[0].source.id	https://openalex.org/S4306400194
locations[0].source.issn
locations[0].source.type	repository
locations[0].source.is_oa	True
locations[0].source.issn_l
locations[0].source.is_core	False
locations[0].source.is_in_doaj	False
locations[0].source.display_name	arXiv (Cornell University)
locations[0].source.host_organization	https://openalex.org/I205783295
locations[0].source.host_organization_name	Cornell University
locations[0].source.host_organization_lineage	https://openalex.org/I205783295
locations[0].license
locations[0].pdf_url	https://arxiv.org/pdf/2511.00402
locations[0].version	submittedVersion
locations[0].raw_type	text
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published	False
locations[0].raw_source_name
locations[0].landing_page_url	http://arxiv.org/abs/2511.00402
locations[1].id	doi:10.48550/arxiv.2511.00402
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306400194
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	True
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	arXiv (Cornell University)
locations[1].source.host_organization	https://openalex.org/I205783295
locations[1].source.host_organization_name	Cornell University
locations[1].source.host_organization_lineage	https://openalex.org/I205783295
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	article
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.48550/arxiv.2511.00402
indexed_in	arxiv, datacite
authorships[0].author.id	https://openalex.org/A5117410574
authorships[0].author.orcid
authorships[0].author.display_name	Lucky Onyekwelu-Udoka
authorships[0].author_position	first
authorships[0].raw_author_name	Onyekwelu-Udoka, Lucky
authorships[0].is_corresponding	False
authorships[1].author.id	https://openalex.org/A5083456021
authorships[1].author.orcid	https://orcid.org/0000-0002-1162-7023
authorships[1].author.display_name	Md Shafiqul Islam
authorships[1].author_position	middle
authorships[1].raw_author_name	Islam, Md Shafiqul
authorships[1].is_corresponding	False
authorships[2].author.id	https://openalex.org/A5091328918
authorships[2].author.orcid	https://orcid.org/0000-0003-3954-8500
authorships[2].author.display_name	Mahbub Hasan
authorships[2].author_position	last
authorships[2].raw_author_name	Hasan, Md Shahedul
authorships[2].is_corresponding	False
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://arxiv.org/pdf/2511.00402
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-11-05T00:00:00
display_name	Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic
cited_by_count	0
locations_count	2
best_oa_location.id	pmh:oai:arXiv.org:2511.00402
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306400194
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	True
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	arXiv (Cornell University)
best_oa_location.source.host_organization	https://openalex.org/I205783295
best_oa_location.source.host_organization_name	Cornell University
best_oa_location.source.host_organization_lineage	https://openalex.org/I205783295
best_oa_location.license
best_oa_location.pdf_url	https://arxiv.org/pdf/2511.00402
best_oa_location.version	submittedVersion
best_oa_location.raw_type	text
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	http://arxiv.org/abs/2511.00402
primary_location.id	pmh:oai:arXiv.org:2511.00402
primary_location.is_oa	True
primary_location.source.id	https://openalex.org/S4306400194
primary_location.source.issn
primary_location.source.type	repository
primary_location.source.is_oa	True
primary_location.source.issn_l
primary_location.source.is_core	False
primary_location.source.is_in_doaj	False
primary_location.source.display_name	arXiv (Cornell University)
primary_location.source.host_organization	https://openalex.org/I205783295
primary_location.source.host_organization_name	Cornell University
primary_location.source.host_organization_lineage	https://openalex.org/I205783295
primary_location.license
primary_location.pdf_url	https://arxiv.org/pdf/2511.00402
primary_location.version	submittedVersion
primary_location.raw_type	text
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	http://arxiv.org/abs/2511.00402
publication_date	2025-11-01
publication_year	2025
referenced_works_count	0
abstract_inverted_index.a	5, 19, 43, 152
abstract_inverted_index.F1	57
abstract_inverted_index.We	38
abstract_inverted_index.an	62, 78, 110
abstract_inverted_index.by	29
abstract_inverted_index.in	8
abstract_inverted_index.is	131, 165
abstract_inverted_index.of	11, 22, 84, 97, 124
abstract_inverted_index.on	81, 101, 160
abstract_inverted_index.to	93
abstract_inverted_index.we	76
abstract_inverted_index.MLP	111
abstract_inverted_index.Our	104
abstract_inverted_index.The	163
abstract_inverted_index.and	27, 56, 72, 89
abstract_inverted_index.at:	167
abstract_inverted_index.but	120
abstract_inverted_index.for	155
abstract_inverted_index.its	118
abstract_inverted_index.six	31
abstract_inverted_index.the	9, 35, 73, 85, 95, 114, 127, 133, 140
abstract_inverted_index.MB),	68
abstract_inverted_index.MFCC	49
abstract_inverted_index.MLP,	88
abstract_inverted_index.This	16
abstract_inverted_index.best	115
abstract_inverted_index.both	70
abstract_inverted_index.code	164
abstract_inverted_index.core	32
abstract_inverted_index.edge	161
abstract_inverted_index.from	2, 34
abstract_inverted_index.head	99, 112
abstract_inverted_index.like	149
abstract_inverted_index.most	134, 141
abstract_inverted_index.role	7
abstract_inverted_index.size	66
abstract_inverted_index.that	107, 146
abstract_inverted_index.with	109
abstract_inverted_index.(0.02	67
abstract_inverted_index.Among	126
abstract_inverted_index.PaSST	71, 108
abstract_inverted_index.These	143
abstract_inverted_index.among	117
abstract_inverted_index.angry	130
abstract_inverted_index.falls	122
abstract_inverted_index.model	47, 65, 102
abstract_inverted_index.offer	151
abstract_inverted_index.paper	17
abstract_inverted_index.plays	4
abstract_inverted_index.score	58
abstract_inverted_index.short	123
abstract_inverted_index.small	64
abstract_inverted_index.still	121
abstract_inverted_index.study	80
abstract_inverted_index.their	40
abstract_inverted_index.three	82
abstract_inverted_index.using	48
abstract_inverted_index.vital	6
abstract_inverted_index.while	60, 137
abstract_inverted_index.PaSST,	28, 86
abstract_inverted_index.effect	96
abstract_inverted_index.heads,	92
abstract_inverted_index.speech	3, 157
abstract_inverted_index.yields	113
abstract_inverted_index.CREMA-D	36
abstract_inverted_index.Emotion	0
abstract_inverted_index.Linear,	87
abstract_inverted_index.Pooling	91
abstract_inverted_index.against	42
abstract_inverted_index.disgust	138
abstract_inverted_index.emotion	128, 158
abstract_inverted_index.models,	25
abstract_inverted_index.remains	139
abstract_inverted_index.results	105
abstract_inverted_index.suggest	145
abstract_inverted_index.(70.36%)	59
abstract_inverted_index.(70.64%)	55
abstract_inverted_index.CNN-LSTM	45
abstract_inverted_index.ablation	79
abstract_inverted_index.accuracy	54
abstract_inverted_index.analysis	21
abstract_inverted_index.baseline	46
abstract_inverted_index.classes,	129
abstract_inverted_index.dataset.	37
abstract_inverted_index.devices.	162
abstract_inverted_index.emotions	33
abstract_inverted_index.findings	144
abstract_inverted_index.indicate	106
abstract_inverted_index.presents	18
abstract_inverted_index.solution	154
abstract_inverted_index.superior	53
abstract_inverted_index.systems.	15
abstract_inverted_index.variants	83, 119
abstract_inverted_index.Attentive	90
abstract_inverted_index.available	166
abstract_inverted_index.baseline.	74
abstract_inverted_index.benchmark	39
abstract_inverted_index.conducted	77
abstract_inverted_index.detected,	136
abstract_inverted_index.features.	50
abstract_inverted_index.real-time	156
abstract_inverted_index.accurately	135
abstract_inverted_index.compelling	153
abstract_inverted_index.empathetic	12
abstract_inverted_index.understand	94
abstract_inverted_index.classifying	30
abstract_inverted_index.comparative	20
abstract_inverted_index.development	10
abstract_inverted_index.interaction	14
abstract_inverted_index.lightweight	23, 147
abstract_inverted_index.maintaining	61
abstract_inverted_index.performance	41, 116
abstract_inverted_index.recognition	1, 159
abstract_inverted_index.traditional	44
abstract_inverted_index.DistilHuBERT	26, 51, 150
abstract_inverted_index.Furthermore,	75
abstract_inverted_index.architecture	100
abstract_inverted_index.challenging.	142
abstract_inverted_index.consistently	132
abstract_inverted_index.demonstrates	52
abstract_inverted_index.performance.	103
abstract_inverted_index.transformers	148
abstract_inverted_index.DistilHuBERT.	125
abstract_inverted_index.exceptionally	63
abstract_inverted_index.outperforming	69
abstract_inverted_index.classification	98
abstract_inverted_index.human-computer	13
abstract_inverted_index.transformer-based	24
abstract_inverted_index.https://github.com/luckymaduabuchi/Emotion-detection-.	168
cited_by_percentile_year
countries_distinct_count	0
institutions_distinct_count	3
citation_normalized_percentile