EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2410.12028
Recent progress in audio-language modeling, such as automated audio captioning, has benefited from training on synthetic data generated with the aid of large-language models. However, such approaches for environmental sound captioning have primarily focused on audio event tags and have not explored leveraging emotional information that may be present in recordings. In this work, we explore the benefit of generating emotion-augmented synthetic audio caption data by instructing ChatGPT with additional acoustic information in the form of estimated soundscape emotion. To do so, we introduce EmotionCaps, an audio captioning dataset comprised of approximately 120,000 audio clips with paired synthetic descriptions enriched with soundscape emotion recognition (SER) information. We hypothesize that this additional information will result in higher-quality captions that match the emotional tone of the audio recording, which will, in turn, improve the performance of captioning models trained with this data. We test this hypothesis through both objective and subjective evaluation, comparing models trained with the EmotionCaps dataset to multiple baseline models. Our findings challenge current approaches to captioning and suggest new directions for developing and assessing captioning models.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2410.12028
- https://arxiv.org/pdf/2410.12028
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403577269
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403577269Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2410.12028Digital Object Identifier
- Title
-
EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data GenerationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-10-15Full publication date if available
- Authors
-
M. Manivannan, Vignesh Nethrapalli, Mark CartwrightList of authors in order
- Landing page
-
https://arxiv.org/abs/2410.12028Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2410.12028Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2410.12028Direct OA link when available
- Concepts
-
Closed captioning, Computer science, Speech recognition, Multimedia, Artificial intelligence, Image (mathematics)Top concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403577269 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2410.12028 |
| ids.doi | https://doi.org/10.48550/arxiv.2410.12028 |
| ids.openalex | https://openalex.org/W4403577269 |
| fwci | |
| type | preprint |
| title | EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11439 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9850000143051147 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Video Analysis and Summarization |
| topics[1].id | https://openalex.org/T12290 |
| topics[1].field.id | https://openalex.org/fields/22 |
| topics[1].field.display_name | Engineering |
| topics[1].score | 0.96670001745224 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/2207 |
| topics[1].subfield.display_name | Control and Systems Engineering |
| topics[1].display_name | Human Motion and Animation |
| topics[2].id | https://openalex.org/T12031 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9656999707221985 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1702 |
| topics[2].subfield.display_name | Artificial Intelligence |
| topics[2].display_name | Speech and dialogue systems |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C157657479 |
| concepts[0].level | 3 |
| concepts[0].score | 0.9543792009353638 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q2367247 |
| concepts[0].display_name | Closed captioning |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5663846731185913 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C28490314 |
| concepts[2].level | 1 |
| concepts[2].score | 0.3921354115009308 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[2].display_name | Speech recognition |
| concepts[3].id | https://openalex.org/C49774154 |
| concepts[3].level | 1 |
| concepts[3].score | 0.38980531692504883 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q131765 |
| concepts[3].display_name | Multimedia |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.19842782616615295 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C115961682 |
| concepts[5].level | 2 |
| concepts[5].score | 0.13142678141593933 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q860623 |
| concepts[5].display_name | Image (mathematics) |
| keywords[0].id | https://openalex.org/keywords/closed-captioning |
| keywords[0].score | 0.9543792009353638 |
| keywords[0].display_name | Closed captioning |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.5663846731185913 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/speech-recognition |
| keywords[2].score | 0.3921354115009308 |
| keywords[2].display_name | Speech recognition |
| keywords[3].id | https://openalex.org/keywords/multimedia |
| keywords[3].score | 0.38980531692504883 |
| keywords[3].display_name | Multimedia |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.19842782616615295 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/image |
| keywords[5].score | 0.13142678141593933 |
| keywords[5].display_name | Image (mathematics) |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2410.12028 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2410.12028 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2410.12028 |
| locations[1].id | doi:10.48550/arxiv.2410.12028 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2410.12028 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5078222987 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-1162-1550 |
| authorships[0].author.display_name | M. Manivannan |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Manivannan, Mithun |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5114337200 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Vignesh Nethrapalli |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Nethrapalli, Vignesh |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5056532548 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-5908-390X |
| authorships[2].author.display_name | Mark Cartwright |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Cartwright, Mark |
| authorships[2].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2410.12028 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-10-20T00:00:00 |
| display_name | EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11439 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9850000143051147 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Video Analysis and Summarization |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W4210416330, https://openalex.org/W2775506363, https://openalex.org/W3088136942, https://openalex.org/W4290852288, https://openalex.org/W2949362007, https://openalex.org/W4283207562, https://openalex.org/W2963177403 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2410.12028 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2410.12028 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2410.12028 |
| primary_location.id | pmh:oai:arXiv.org:2410.12028 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2410.12028 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2410.12028 |
| publication_date | 2024-10-15 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.In | 51 |
| abstract_inverted_index.To | 79 |
| abstract_inverted_index.We | 106, 140 |
| abstract_inverted_index.an | 85 |
| abstract_inverted_index.as | 6 |
| abstract_inverted_index.be | 47 |
| abstract_inverted_index.by | 65 |
| abstract_inverted_index.do | 80 |
| abstract_inverted_index.in | 2, 49, 72, 114, 128 |
| abstract_inverted_index.of | 21, 58, 75, 90, 122, 133 |
| abstract_inverted_index.on | 14, 34 |
| abstract_inverted_index.to | 157, 166 |
| abstract_inverted_index.we | 54, 82 |
| abstract_inverted_index.Our | 161 |
| abstract_inverted_index.aid | 20 |
| abstract_inverted_index.and | 38, 147, 168, 174 |
| abstract_inverted_index.for | 27, 172 |
| abstract_inverted_index.has | 10 |
| abstract_inverted_index.may | 46 |
| abstract_inverted_index.new | 170 |
| abstract_inverted_index.not | 40 |
| abstract_inverted_index.so, | 81 |
| abstract_inverted_index.the | 19, 56, 73, 119, 123, 131, 154 |
| abstract_inverted_index.both | 145 |
| abstract_inverted_index.data | 16, 64 |
| abstract_inverted_index.form | 74 |
| abstract_inverted_index.from | 12 |
| abstract_inverted_index.have | 31, 39 |
| abstract_inverted_index.such | 5, 25 |
| abstract_inverted_index.tags | 37 |
| abstract_inverted_index.test | 141 |
| abstract_inverted_index.that | 45, 108, 117 |
| abstract_inverted_index.this | 52, 109, 138, 142 |
| abstract_inverted_index.tone | 121 |
| abstract_inverted_index.will | 112 |
| abstract_inverted_index.with | 18, 68, 95, 100, 137, 153 |
| abstract_inverted_index.(SER) | 104 |
| abstract_inverted_index.audio | 8, 35, 62, 86, 93, 124 |
| abstract_inverted_index.clips | 94 |
| abstract_inverted_index.data. | 139 |
| abstract_inverted_index.event | 36 |
| abstract_inverted_index.match | 118 |
| abstract_inverted_index.sound | 29 |
| abstract_inverted_index.turn, | 129 |
| abstract_inverted_index.which | 126 |
| abstract_inverted_index.will, | 127 |
| abstract_inverted_index.work, | 53 |
| abstract_inverted_index.Recent | 0 |
| abstract_inverted_index.models | 135, 151 |
| abstract_inverted_index.paired | 96 |
| abstract_inverted_index.result | 113 |
| abstract_inverted_index.120,000 | 92 |
| abstract_inverted_index.ChatGPT | 67 |
| abstract_inverted_index.benefit | 57 |
| abstract_inverted_index.caption | 63 |
| abstract_inverted_index.current | 164 |
| abstract_inverted_index.dataset | 88, 156 |
| abstract_inverted_index.emotion | 102 |
| abstract_inverted_index.explore | 55 |
| abstract_inverted_index.focused | 33 |
| abstract_inverted_index.improve | 130 |
| abstract_inverted_index.models. | 23, 160, 177 |
| abstract_inverted_index.present | 48 |
| abstract_inverted_index.suggest | 169 |
| abstract_inverted_index.through | 144 |
| abstract_inverted_index.trained | 136, 152 |
| abstract_inverted_index.However, | 24 |
| abstract_inverted_index.acoustic | 70 |
| abstract_inverted_index.baseline | 159 |
| abstract_inverted_index.captions | 116 |
| abstract_inverted_index.emotion. | 78 |
| abstract_inverted_index.enriched | 99 |
| abstract_inverted_index.explored | 41 |
| abstract_inverted_index.findings | 162 |
| abstract_inverted_index.multiple | 158 |
| abstract_inverted_index.progress | 1 |
| abstract_inverted_index.training | 13 |
| abstract_inverted_index.assessing | 175 |
| abstract_inverted_index.automated | 7 |
| abstract_inverted_index.benefited | 11 |
| abstract_inverted_index.challenge | 163 |
| abstract_inverted_index.comparing | 150 |
| abstract_inverted_index.comprised | 89 |
| abstract_inverted_index.emotional | 43, 120 |
| abstract_inverted_index.estimated | 76 |
| abstract_inverted_index.generated | 17 |
| abstract_inverted_index.introduce | 83 |
| abstract_inverted_index.modeling, | 4 |
| abstract_inverted_index.objective | 146 |
| abstract_inverted_index.primarily | 32 |
| abstract_inverted_index.synthetic | 15, 61, 97 |
| abstract_inverted_index.additional | 69, 110 |
| abstract_inverted_index.approaches | 26, 165 |
| abstract_inverted_index.captioning | 30, 87, 134, 167, 176 |
| abstract_inverted_index.developing | 173 |
| abstract_inverted_index.directions | 171 |
| abstract_inverted_index.generating | 59 |
| abstract_inverted_index.hypothesis | 143 |
| abstract_inverted_index.leveraging | 42 |
| abstract_inverted_index.recording, | 125 |
| abstract_inverted_index.soundscape | 77, 101 |
| abstract_inverted_index.subjective | 148 |
| abstract_inverted_index.EmotionCaps | 155 |
| abstract_inverted_index.captioning, | 9 |
| abstract_inverted_index.evaluation, | 149 |
| abstract_inverted_index.hypothesize | 107 |
| abstract_inverted_index.information | 44, 71, 111 |
| abstract_inverted_index.instructing | 66 |
| abstract_inverted_index.performance | 132 |
| abstract_inverted_index.recognition | 103 |
| abstract_inverted_index.recordings. | 50 |
| abstract_inverted_index.EmotionCaps, | 84 |
| abstract_inverted_index.descriptions | 98 |
| abstract_inverted_index.information. | 105 |
| abstract_inverted_index.approximately | 91 |
| abstract_inverted_index.environmental | 28 |
| abstract_inverted_index.audio-language | 3 |
| abstract_inverted_index.higher-quality | 115 |
| abstract_inverted_index.large-language | 22 |
| abstract_inverted_index.emotion-augmented | 60 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |