Hierarchical Control of Emotion Rendering in Speech Synthesis Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2412.12498
Emotional text-to-speech synthesis (TTS) aims to generate realistic emotional speech from input text. However, quantitatively controlling multi-level emotion rendering remains challenging. In this paper, we propose a flow-matching based emotional TTS framework with a novel approach for emotion intensity modeling to facilitate fine-grained control over emotion rendering at the phoneme, word, and utterance levels. We introduce a hierarchical emotion distribution (ED) extractor that captures a quantifiable ED embedding across different speech segment levels. Additionally, we explore various acoustic features and assess their impact on emotion intensity modeling. During TTS training, the hierarchical ED embedding effectively captures the variance in emotion intensity from the reference audio and correlates it with linguistic and speaker information. The TTS model not only generates emotional speech during inference, but also quantitatively controls the emotion rendering over the speech constituents. Both objective and subjective evaluations demonstrate the effectiveness of our framework in terms of speech quality, emotional expressiveness, and hierarchical emotion control.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2412.12498
- https://arxiv.org/pdf/2412.12498
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4405562097
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4405562097Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2412.12498Digital Object Identifier
- Title
-
Hierarchical Control of Emotion Rendering in Speech SynthesisWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-12-17Full publication date if available
- Authors
-
Sho Inoue, Kun Zhou, Shuai Wang, Haizhou LiList of authors in order
- Landing page
-
https://arxiv.org/abs/2412.12498Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2412.12498Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2412.12498Direct OA link when available
- Concepts
-
Rendering (computer graphics), Computer science, Control (management), Speech recognition, Speech synthesis, Psychology, Artificial intelligenceTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4405562097 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2412.12498 |
| ids.doi | https://doi.org/10.48550/arxiv.2412.12498 |
| ids.openalex | https://openalex.org/W4405562097 |
| fwci | |
| type | preprint |
| title | Hierarchical Control of Emotion Rendering in Speech Synthesis |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11448 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.7978000044822693 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Face recognition and analysis |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C205711294 |
| concepts[0].level | 2 |
| concepts[0].score | 0.8002371788024902 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q176953 |
| concepts[0].display_name | Rendering (computer graphics) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5059546828269958 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C2775924081 |
| concepts[2].level | 2 |
| concepts[2].score | 0.41879093647003174 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q55608371 |
| concepts[2].display_name | Control (management) |
| concepts[3].id | https://openalex.org/C28490314 |
| concepts[3].level | 1 |
| concepts[3].score | 0.4170069098472595 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[3].display_name | Speech recognition |
| concepts[4].id | https://openalex.org/C14999030 |
| concepts[4].level | 2 |
| concepts[4].score | 0.4136335253715515 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q16346 |
| concepts[4].display_name | Speech synthesis |
| concepts[5].id | https://openalex.org/C15744967 |
| concepts[5].level | 0 |
| concepts[5].score | 0.34474629163742065 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[5].display_name | Psychology |
| concepts[6].id | https://openalex.org/C154945302 |
| concepts[6].level | 1 |
| concepts[6].score | 0.23402506113052368 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[6].display_name | Artificial intelligence |
| keywords[0].id | https://openalex.org/keywords/rendering |
| keywords[0].score | 0.8002371788024902 |
| keywords[0].display_name | Rendering (computer graphics) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.5059546828269958 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/control |
| keywords[2].score | 0.41879093647003174 |
| keywords[2].display_name | Control (management) |
| keywords[3].id | https://openalex.org/keywords/speech-recognition |
| keywords[3].score | 0.4170069098472595 |
| keywords[3].display_name | Speech recognition |
| keywords[4].id | https://openalex.org/keywords/speech-synthesis |
| keywords[4].score | 0.4136335253715515 |
| keywords[4].display_name | Speech synthesis |
| keywords[5].id | https://openalex.org/keywords/psychology |
| keywords[5].score | 0.34474629163742065 |
| keywords[5].display_name | Psychology |
| keywords[6].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[6].score | 0.23402506113052368 |
| keywords[6].display_name | Artificial intelligence |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2412.12498 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2412.12498 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2412.12498 |
| locations[1].id | doi:10.48550/arxiv.2412.12498 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2412.12498 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5108413182 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Sho Inoue |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Inoue, Sho |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5101654458 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-7869-4474 |
| authorships[1].author.display_name | Kun Zhou |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhou, Kun |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100328312 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-7897-2024 |
| authorships[2].author.display_name | Shuai Wang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Wang, Shuai |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5032690182 |
| authorships[3].author.orcid | https://orcid.org/0000-0001-9158-9401 |
| authorships[3].author.display_name | Haizhou Li |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Li, Haizhou |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2412.12498 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Hierarchical Control of Emotion Rendering in Speech Synthesis |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11448 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.7978000044822693 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Face recognition and analysis |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2931662336, https://openalex.org/W2077865380, https://openalex.org/W3006817050, https://openalex.org/W4401768695, https://openalex.org/W2765597752, https://openalex.org/W2134894512, https://openalex.org/W2083375246, https://openalex.org/W2067108088 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2412.12498 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2412.12498 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2412.12498 |
| primary_location.id | pmh:oai:arXiv.org:2412.12498 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2412.12498 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2412.12498 |
| publication_date | 2024-12-17 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 26, 33, 56, 64 |
| abstract_inverted_index.ED | 66, 92 |
| abstract_inverted_index.In | 21 |
| abstract_inverted_index.We | 54 |
| abstract_inverted_index.at | 47 |
| abstract_inverted_index.in | 98, 145 |
| abstract_inverted_index.it | 107 |
| abstract_inverted_index.of | 142, 147 |
| abstract_inverted_index.on | 83 |
| abstract_inverted_index.to | 5, 40 |
| abstract_inverted_index.we | 24, 74 |
| abstract_inverted_index.TTS | 30, 88, 114 |
| abstract_inverted_index.The | 113 |
| abstract_inverted_index.and | 51, 79, 105, 110, 136, 152 |
| abstract_inverted_index.but | 123 |
| abstract_inverted_index.for | 36 |
| abstract_inverted_index.not | 116 |
| abstract_inverted_index.our | 143 |
| abstract_inverted_index.the | 48, 90, 96, 102, 127, 131, 140 |
| abstract_inverted_index.(ED) | 60 |
| abstract_inverted_index.Both | 134 |
| abstract_inverted_index.aims | 4 |
| abstract_inverted_index.also | 124 |
| abstract_inverted_index.from | 10, 101 |
| abstract_inverted_index.only | 117 |
| abstract_inverted_index.over | 44, 130 |
| abstract_inverted_index.that | 62 |
| abstract_inverted_index.this | 22 |
| abstract_inverted_index.with | 32, 108 |
| abstract_inverted_index.(TTS) | 3 |
| abstract_inverted_index.audio | 104 |
| abstract_inverted_index.based | 28 |
| abstract_inverted_index.input | 11 |
| abstract_inverted_index.model | 115 |
| abstract_inverted_index.novel | 34 |
| abstract_inverted_index.terms | 146 |
| abstract_inverted_index.text. | 12 |
| abstract_inverted_index.their | 81 |
| abstract_inverted_index.word, | 50 |
| abstract_inverted_index.During | 87 |
| abstract_inverted_index.across | 68 |
| abstract_inverted_index.assess | 80 |
| abstract_inverted_index.during | 121 |
| abstract_inverted_index.impact | 82 |
| abstract_inverted_index.paper, | 23 |
| abstract_inverted_index.speech | 9, 70, 120, 132, 148 |
| abstract_inverted_index.control | 43 |
| abstract_inverted_index.emotion | 17, 37, 45, 58, 84, 99, 128, 154 |
| abstract_inverted_index.explore | 75 |
| abstract_inverted_index.levels. | 53, 72 |
| abstract_inverted_index.propose | 25 |
| abstract_inverted_index.remains | 19 |
| abstract_inverted_index.segment | 71 |
| abstract_inverted_index.speaker | 111 |
| abstract_inverted_index.various | 76 |
| abstract_inverted_index.However, | 13 |
| abstract_inverted_index.acoustic | 77 |
| abstract_inverted_index.approach | 35 |
| abstract_inverted_index.captures | 63, 95 |
| abstract_inverted_index.control. | 155 |
| abstract_inverted_index.controls | 126 |
| abstract_inverted_index.features | 78 |
| abstract_inverted_index.generate | 6 |
| abstract_inverted_index.modeling | 39 |
| abstract_inverted_index.phoneme, | 49 |
| abstract_inverted_index.quality, | 149 |
| abstract_inverted_index.variance | 97 |
| abstract_inverted_index.Emotional | 0 |
| abstract_inverted_index.different | 69 |
| abstract_inverted_index.embedding | 67, 93 |
| abstract_inverted_index.emotional | 8, 29, 119, 150 |
| abstract_inverted_index.extractor | 61 |
| abstract_inverted_index.framework | 31, 144 |
| abstract_inverted_index.generates | 118 |
| abstract_inverted_index.intensity | 38, 85, 100 |
| abstract_inverted_index.introduce | 55 |
| abstract_inverted_index.modeling. | 86 |
| abstract_inverted_index.objective | 135 |
| abstract_inverted_index.realistic | 7 |
| abstract_inverted_index.reference | 103 |
| abstract_inverted_index.rendering | 18, 46, 129 |
| abstract_inverted_index.synthesis | 2 |
| abstract_inverted_index.training, | 89 |
| abstract_inverted_index.utterance | 52 |
| abstract_inverted_index.correlates | 106 |
| abstract_inverted_index.facilitate | 41 |
| abstract_inverted_index.inference, | 122 |
| abstract_inverted_index.linguistic | 109 |
| abstract_inverted_index.subjective | 137 |
| abstract_inverted_index.controlling | 15 |
| abstract_inverted_index.demonstrate | 139 |
| abstract_inverted_index.effectively | 94 |
| abstract_inverted_index.evaluations | 138 |
| abstract_inverted_index.multi-level | 16 |
| abstract_inverted_index.challenging. | 20 |
| abstract_inverted_index.distribution | 59 |
| abstract_inverted_index.fine-grained | 42 |
| abstract_inverted_index.hierarchical | 57, 91, 153 |
| abstract_inverted_index.information. | 112 |
| abstract_inverted_index.quantifiable | 65 |
| abstract_inverted_index.Additionally, | 73 |
| abstract_inverted_index.constituents. | 133 |
| abstract_inverted_index.effectiveness | 141 |
| abstract_inverted_index.flow-matching | 27 |
| abstract_inverted_index.quantitatively | 14, 125 |
| abstract_inverted_index.text-to-speech | 1 |
| abstract_inverted_index.expressiveness, | 151 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |