CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2412.08918
Singing Voice Synthesis (SVS) {aims} to generate singing voices {of high} fidelity and expressiveness. {Conventional SVS systems usually utilize} an acoustic model to transform a music score into acoustic features, {followed by a vocoder to reconstruct the} singing voice. It was recently shown that end-to-end modeling is effective in the fields of SVS and Text to Speech (TTS). In this work, we thus present a fully end-to-end SVS method together with a chunkwise streaming inference to address the latency issue for practical usages. Note that this is the first attempt to fully implement end-to-end streaming audio synthesis using latent representations in VAE. We have made specific improvements to enhance the performance of streaming SVS using latent representations. Experimental results demonstrate that the proposed method achieves synthesized audio with high expressiveness and pitch accuracy in both streaming SVS and TTS tasks.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2412.08918
- https://arxiv.org/pdf/2412.08918
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4405354845
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4405354845Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2412.08918Digital Object Identifier
- Title
-
CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational AutoencoderWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-12-12Full publication date if available
- Authors
-
Jianwei Cui, Yu Gu, Shihao Chen, Jie Zhang, Liping Chen, Li-Rong DaiList of authors in order
- Landing page
-
https://arxiv.org/abs/2412.08918Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2412.08918Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2412.08918Direct OA link when available
- Concepts
-
Autoencoder, Singing, End-to-end principle, Speech recognition, Computer science, Artificial intelligence, Acoustics, Deep learning, PhysicsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4405354845 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2412.08918 |
| ids.doi | https://doi.org/10.48550/arxiv.2412.08918 |
| ids.openalex | https://openalex.org/W4405354845 |
| fwci | |
| type | preprint |
| title | CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10201 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9925000071525574 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Speech Recognition and Synthesis |
| topics[1].id | https://openalex.org/T10860 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9769999980926514 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Speech and Audio Processing |
| topics[2].id | https://openalex.org/T11309 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9757000207901001 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1711 |
| topics[2].subfield.display_name | Signal Processing |
| topics[2].display_name | Music and Audio Processing |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C101738243 |
| concepts[0].level | 3 |
| concepts[0].score | 0.8707081079483032 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q786435 |
| concepts[0].display_name | Autoencoder |
| concepts[1].id | https://openalex.org/C44819458 |
| concepts[1].level | 2 |
| concepts[1].score | 0.7316684722900391 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q27939 |
| concepts[1].display_name | Singing |
| concepts[2].id | https://openalex.org/C74296488 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6640288233757019 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q2527392 |
| concepts[2].display_name | End-to-end principle |
| concepts[3].id | https://openalex.org/C28490314 |
| concepts[3].level | 1 |
| concepts[3].score | 0.5385170578956604 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[3].display_name | Speech recognition |
| concepts[4].id | https://openalex.org/C41008148 |
| concepts[4].level | 0 |
| concepts[4].score | 0.5323144793510437 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[4].display_name | Computer science |
| concepts[5].id | https://openalex.org/C154945302 |
| concepts[5].level | 1 |
| concepts[5].score | 0.29117393493652344 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[5].display_name | Artificial intelligence |
| concepts[6].id | https://openalex.org/C24890656 |
| concepts[6].level | 1 |
| concepts[6].score | 0.16324996948242188 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q82811 |
| concepts[6].display_name | Acoustics |
| concepts[7].id | https://openalex.org/C108583219 |
| concepts[7].level | 2 |
| concepts[7].score | 0.1160927414894104 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q197536 |
| concepts[7].display_name | Deep learning |
| concepts[8].id | https://openalex.org/C121332964 |
| concepts[8].level | 0 |
| concepts[8].score | 0.06429484486579895 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q413 |
| concepts[8].display_name | Physics |
| keywords[0].id | https://openalex.org/keywords/autoencoder |
| keywords[0].score | 0.8707081079483032 |
| keywords[0].display_name | Autoencoder |
| keywords[1].id | https://openalex.org/keywords/singing |
| keywords[1].score | 0.7316684722900391 |
| keywords[1].display_name | Singing |
| keywords[2].id | https://openalex.org/keywords/end-to-end-principle |
| keywords[2].score | 0.6640288233757019 |
| keywords[2].display_name | End-to-end principle |
| keywords[3].id | https://openalex.org/keywords/speech-recognition |
| keywords[3].score | 0.5385170578956604 |
| keywords[3].display_name | Speech recognition |
| keywords[4].id | https://openalex.org/keywords/computer-science |
| keywords[4].score | 0.5323144793510437 |
| keywords[4].display_name | Computer science |
| keywords[5].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[5].score | 0.29117393493652344 |
| keywords[5].display_name | Artificial intelligence |
| keywords[6].id | https://openalex.org/keywords/acoustics |
| keywords[6].score | 0.16324996948242188 |
| keywords[6].display_name | Acoustics |
| keywords[7].id | https://openalex.org/keywords/deep-learning |
| keywords[7].score | 0.1160927414894104 |
| keywords[7].display_name | Deep learning |
| keywords[8].id | https://openalex.org/keywords/physics |
| keywords[8].score | 0.06429484486579895 |
| keywords[8].display_name | Physics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2412.08918 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2412.08918 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2412.08918 |
| locations[1].id | doi:10.48550/arxiv.2412.08918 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2412.08918 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5007606170 |
| authorships[0].author.orcid | https://orcid.org/0009-0007-3634-2781 |
| authorships[0].author.display_name | Jianwei Cui |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Cui, Jianwei |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100352646 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-6939-0850 |
| authorships[1].author.display_name | Yu Gu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Gu, Yu |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5055000437 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-7646-8003 |
| authorships[2].author.display_name | Shihao Chen |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Chen, Shihao |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5100436848 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-1124-0854 |
| authorships[3].author.display_name | Jie Zhang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Zhang, Jie |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100430201 |
| authorships[4].author.orcid | https://orcid.org/0009-0003-6453-5645 |
| authorships[4].author.display_name | Liping Chen |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Chen, Liping |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5057227915 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-0859-2827 |
| authorships[5].author.display_name | Li-Rong Dai |
| authorships[5].author_position | last |
| authorships[5].raw_author_name | Dai, Lirong |
| authorships[5].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2412.08918 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10201 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9925000071525574 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Speech Recognition and Synthesis |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W3013693939, https://openalex.org/W2566616303, https://openalex.org/W2159052453, https://openalex.org/W3131327266, https://openalex.org/W2734887215, https://openalex.org/W2390529913, https://openalex.org/W4404782863 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2412.08918 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2412.08918 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2412.08918 |
| primary_location.id | pmh:oai:arXiv.org:2412.08918 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2412.08918 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2412.08918 |
| publication_date | 2024-12-12 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 24, 32, 64, 71 |
| abstract_inverted_index.In | 58 |
| abstract_inverted_index.It | 39 |
| abstract_inverted_index.We | 102 |
| abstract_inverted_index.an | 19 |
| abstract_inverted_index.by | 31 |
| abstract_inverted_index.in | 48, 100, 133 |
| abstract_inverted_index.is | 46, 86 |
| abstract_inverted_index.of | 51, 111 |
| abstract_inverted_index.to | 5, 22, 34, 55, 75, 90, 107 |
| abstract_inverted_index.we | 61 |
| abstract_inverted_index.SVS | 15, 52, 67, 113, 136 |
| abstract_inverted_index.TTS | 138 |
| abstract_inverted_index.and | 12, 53, 130, 137 |
| abstract_inverted_index.for | 80 |
| abstract_inverted_index.the | 49, 77, 87, 109, 121 |
| abstract_inverted_index.was | 40 |
| abstract_inverted_index.{of | 9 |
| abstract_inverted_index.Note | 83 |
| abstract_inverted_index.Text | 54 |
| abstract_inverted_index.VAE. | 101 |
| abstract_inverted_index.both | 134 |
| abstract_inverted_index.have | 103 |
| abstract_inverted_index.high | 128 |
| abstract_inverted_index.into | 27 |
| abstract_inverted_index.made | 104 |
| abstract_inverted_index.that | 43, 84, 120 |
| abstract_inverted_index.the} | 36 |
| abstract_inverted_index.this | 59, 85 |
| abstract_inverted_index.thus | 62 |
| abstract_inverted_index.with | 70, 127 |
| abstract_inverted_index.(SVS) | 3 |
| abstract_inverted_index.Voice | 1 |
| abstract_inverted_index.audio | 95, 126 |
| abstract_inverted_index.first | 88 |
| abstract_inverted_index.fully | 65, 91 |
| abstract_inverted_index.high} | 10 |
| abstract_inverted_index.issue | 79 |
| abstract_inverted_index.model | 21 |
| abstract_inverted_index.music | 25 |
| abstract_inverted_index.pitch | 131 |
| abstract_inverted_index.score | 26 |
| abstract_inverted_index.shown | 42 |
| abstract_inverted_index.using | 97, 114 |
| abstract_inverted_index.work, | 60 |
| abstract_inverted_index.(TTS). | 57 |
| abstract_inverted_index.Speech | 56 |
| abstract_inverted_index.fields | 50 |
| abstract_inverted_index.latent | 98, 115 |
| abstract_inverted_index.method | 68, 123 |
| abstract_inverted_index.tasks. | 139 |
| abstract_inverted_index.voice. | 38 |
| abstract_inverted_index.voices | 8 |
| abstract_inverted_index.{aims} | 4 |
| abstract_inverted_index.Singing | 0 |
| abstract_inverted_index.address | 76 |
| abstract_inverted_index.attempt | 89 |
| abstract_inverted_index.enhance | 108 |
| abstract_inverted_index.latency | 78 |
| abstract_inverted_index.present | 63 |
| abstract_inverted_index.results | 118 |
| abstract_inverted_index.singing | 7, 37 |
| abstract_inverted_index.systems | 16 |
| abstract_inverted_index.usages. | 82 |
| abstract_inverted_index.usually | 17 |
| abstract_inverted_index.vocoder | 33 |
| abstract_inverted_index.accuracy | 132 |
| abstract_inverted_index.achieves | 124 |
| abstract_inverted_index.acoustic | 20, 28 |
| abstract_inverted_index.fidelity | 11 |
| abstract_inverted_index.generate | 6 |
| abstract_inverted_index.modeling | 45 |
| abstract_inverted_index.proposed | 122 |
| abstract_inverted_index.recently | 41 |
| abstract_inverted_index.specific | 105 |
| abstract_inverted_index.together | 69 |
| abstract_inverted_index.utilize} | 18 |
| abstract_inverted_index.Synthesis | 2 |
| abstract_inverted_index.chunkwise | 72 |
| abstract_inverted_index.effective | 47 |
| abstract_inverted_index.features, | 29 |
| abstract_inverted_index.implement | 92 |
| abstract_inverted_index.inference | 74 |
| abstract_inverted_index.practical | 81 |
| abstract_inverted_index.streaming | 73, 94, 112, 135 |
| abstract_inverted_index.synthesis | 96 |
| abstract_inverted_index.transform | 23 |
| abstract_inverted_index.{followed | 30 |
| abstract_inverted_index.end-to-end | 44, 66, 93 |
| abstract_inverted_index.demonstrate | 119 |
| abstract_inverted_index.performance | 110 |
| abstract_inverted_index.reconstruct | 35 |
| abstract_inverted_index.synthesized | 125 |
| abstract_inverted_index.Experimental | 117 |
| abstract_inverted_index.improvements | 106 |
| abstract_inverted_index.{Conventional | 14 |
| abstract_inverted_index.expressiveness | 129 |
| abstract_inverted_index.expressiveness. | 13 |
| abstract_inverted_index.representations | 99 |
| abstract_inverted_index.representations. | 116 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 6 |
| citation_normalized_percentile |