LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models Article Swipe
YOU?
·
· 2023
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2306.10521
Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM - Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the coarse acoustic modeling.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2306.10521
- https://arxiv.org/pdf/2306.10521
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4381558479
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4381558479Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2306.10521Digital Object Identifier
- Title
-
LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language ModelsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2023Year of publication
- Publication date
-
2023-06-18Full publication date if available
- Authors
-
Zhichao Wang, Yuanzhe Chen, Lei Xie, Qiao Tian, Yu‐Ping WangList of authors in order
- Landing page
-
https://arxiv.org/abs/2306.10521Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2306.10521Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2306.10521Direct OA link when available
- Concepts
-
Computer science, Speech recognition, Utterance, Language model, Speaker diarisation, Context (archaeology), Pronunciation, Inference, Speaker recognition, Natural language processing, Artificial intelligence, Linguistics, Paleontology, Biology, PhilosophyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4381558479 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2306.10521 |
| ids.doi | https://doi.org/10.48550/arxiv.2306.10521 |
| ids.openalex | https://openalex.org/W4381558479 |
| fwci | |
| type | preprint |
| title | LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10201 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9968000054359436 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Speech Recognition and Synthesis |
| topics[1].id | https://openalex.org/T11309 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9437000155448914 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1711 |
| topics[1].subfield.display_name | Signal Processing |
| topics[1].display_name | Music and Audio Processing |
| topics[2].id | https://openalex.org/T10863 |
| topics[2].field.id | https://openalex.org/fields/27 |
| topics[2].field.display_name | Medicine |
| topics[2].score | 0.9379000067710876 |
| topics[2].domain.id | https://openalex.org/domains/4 |
| topics[2].domain.display_name | Health Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/2737 |
| topics[2].subfield.display_name | Physiology |
| topics[2].display_name | Voice and Speech Disorders |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.8194737434387207 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C28490314 |
| concepts[1].level | 1 |
| concepts[1].score | 0.7030028104782104 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q189436 |
| concepts[1].display_name | Speech recognition |
| concepts[2].id | https://openalex.org/C2775852435 |
| concepts[2].level | 2 |
| concepts[2].score | 0.6389868259429932 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q258403 |
| concepts[2].display_name | Utterance |
| concepts[3].id | https://openalex.org/C137293760 |
| concepts[3].level | 2 |
| concepts[3].score | 0.4646688401699066 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q3621696 |
| concepts[3].display_name | Language model |
| concepts[4].id | https://openalex.org/C149838564 |
| concepts[4].level | 3 |
| concepts[4].score | 0.4401414096355438 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q7574248 |
| concepts[4].display_name | Speaker diarisation |
| concepts[5].id | https://openalex.org/C2779343474 |
| concepts[5].level | 2 |
| concepts[5].score | 0.4204687476158142 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[5].display_name | Context (archaeology) |
| concepts[6].id | https://openalex.org/C2780844864 |
| concepts[6].level | 2 |
| concepts[6].score | 0.4180606007575989 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q184377 |
| concepts[6].display_name | Pronunciation |
| concepts[7].id | https://openalex.org/C2776214188 |
| concepts[7].level | 2 |
| concepts[7].score | 0.4173869788646698 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q408386 |
| concepts[7].display_name | Inference |
| concepts[8].id | https://openalex.org/C133892786 |
| concepts[8].level | 2 |
| concepts[8].score | 0.36805427074432373 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q1145189 |
| concepts[8].display_name | Speaker recognition |
| concepts[9].id | https://openalex.org/C204321447 |
| concepts[9].level | 1 |
| concepts[9].score | 0.35941553115844727 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[9].display_name | Natural language processing |
| concepts[10].id | https://openalex.org/C154945302 |
| concepts[10].level | 1 |
| concepts[10].score | 0.33203601837158203 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[10].display_name | Artificial intelligence |
| concepts[11].id | https://openalex.org/C41895202 |
| concepts[11].level | 1 |
| concepts[11].score | 0.15146178007125854 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q8162 |
| concepts[11].display_name | Linguistics |
| concepts[12].id | https://openalex.org/C151730666 |
| concepts[12].level | 1 |
| concepts[12].score | 0.0 |
| concepts[12].wikidata | https://www.wikidata.org/wiki/Q7205 |
| concepts[12].display_name | Paleontology |
| concepts[13].id | https://openalex.org/C86803240 |
| concepts[13].level | 0 |
| concepts[13].score | 0.0 |
| concepts[13].wikidata | https://www.wikidata.org/wiki/Q420 |
| concepts[13].display_name | Biology |
| concepts[14].id | https://openalex.org/C138885662 |
| concepts[14].level | 0 |
| concepts[14].score | 0.0 |
| concepts[14].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[14].display_name | Philosophy |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.8194737434387207 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/speech-recognition |
| keywords[1].score | 0.7030028104782104 |
| keywords[1].display_name | Speech recognition |
| keywords[2].id | https://openalex.org/keywords/utterance |
| keywords[2].score | 0.6389868259429932 |
| keywords[2].display_name | Utterance |
| keywords[3].id | https://openalex.org/keywords/language-model |
| keywords[3].score | 0.4646688401699066 |
| keywords[3].display_name | Language model |
| keywords[4].id | https://openalex.org/keywords/speaker-diarisation |
| keywords[4].score | 0.4401414096355438 |
| keywords[4].display_name | Speaker diarisation |
| keywords[5].id | https://openalex.org/keywords/context |
| keywords[5].score | 0.4204687476158142 |
| keywords[5].display_name | Context (archaeology) |
| keywords[6].id | https://openalex.org/keywords/pronunciation |
| keywords[6].score | 0.4180606007575989 |
| keywords[6].display_name | Pronunciation |
| keywords[7].id | https://openalex.org/keywords/inference |
| keywords[7].score | 0.4173869788646698 |
| keywords[7].display_name | Inference |
| keywords[8].id | https://openalex.org/keywords/speaker-recognition |
| keywords[8].score | 0.36805427074432373 |
| keywords[8].display_name | Speaker recognition |
| keywords[9].id | https://openalex.org/keywords/natural-language-processing |
| keywords[9].score | 0.35941553115844727 |
| keywords[9].display_name | Natural language processing |
| keywords[10].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[10].score | 0.33203601837158203 |
| keywords[10].display_name | Artificial intelligence |
| keywords[11].id | https://openalex.org/keywords/linguistics |
| keywords[11].score | 0.15146178007125854 |
| keywords[11].display_name | Linguistics |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2306.10521 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | public-domain |
| locations[0].pdf_url | https://arxiv.org/pdf/2306.10521 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | https://openalex.org/licenses/public-domain |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2306.10521 |
| locations[1].id | doi:10.48550/arxiv.2306.10521 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2306.10521 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5106698700 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-8075-1784 |
| authorships[0].author.display_name | Zhichao Wang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Wang, Zhichao |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5055175414 |
| authorships[1].author.orcid | |
| authorships[1].author.display_name | Yuanzhe Chen |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Chen, Yuanzhe |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100668966 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-8234-0823 |
| authorships[2].author.display_name | Lei Xie |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Xie, Lei |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5103162279 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-4078-1273 |
| authorships[3].author.display_name | Qiao Tian |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Tian, Qiao |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100339106 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-9340-5864 |
| authorships[4].author.display_name | Yu‐Ping Wang |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Wang, Yuping |
| authorships[4].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2306.10521 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2023-06-22T00:00:00 |
| display_name | LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models |
| has_fulltext | True |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10201 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9968000054359436 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Speech Recognition and Synthesis |
| related_works | https://openalex.org/W2206035908, https://openalex.org/W2162158162, https://openalex.org/W4247736853, https://openalex.org/W1493012537, https://openalex.org/W1999004162, https://openalex.org/W2175373321, https://openalex.org/W2125642021, https://openalex.org/W1521049138, https://openalex.org/W2938358845, https://openalex.org/W2997340161 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2306.10521 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | public-domain |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2306.10521 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | https://openalex.org/licenses/public-domain |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2306.10521 |
| primary_location.id | pmh:oai:arXiv.org:2306.10521 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | public-domain |
| primary_location.pdf_url | https://arxiv.org/pdf/2306.10521 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | https://openalex.org/licenses/public-domain |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2306.10521 |
| publication_date | 2023-06-18 |
| publication_year | 2023 |
| referenced_works_count | 0 |
| abstract_inverted_index.- | 39 |
| abstract_inverted_index.a | 158, 198, 203 |
| abstract_inverted_index.1) | 76 |
| abstract_inverted_index.2) | 105 |
| abstract_inverted_index.3) | 126 |
| abstract_inverted_index.An | 32 |
| abstract_inverted_index.In | 19 |
| abstract_inverted_index.LM | 135, 201 |
| abstract_inverted_index.To | 151 |
| abstract_inverted_index.an | 71, 250 |
| abstract_inverted_index.as | 186 |
| abstract_inverted_index.be | 115 |
| abstract_inverted_index.by | 48 |
| abstract_inverted_index.in | 15, 81, 95, 130, 247, 267 |
| abstract_inverted_index.is | 35, 207, 215, 263 |
| abstract_inverted_index.of | 26, 65, 133 |
| abstract_inverted_index.on | 62, 231 |
| abstract_inverted_index.to | 36, 57, 117, 138, 144, 190, 217, 241, 257, 265 |
| abstract_inverted_index.we | 22, 155 |
| abstract_inverted_index.LM, | 252 |
| abstract_inverted_index.LMs | 27 |
| abstract_inverted_index.and | 44, 50, 52, 147, 174, 178, 194, 226, 236 |
| abstract_inverted_index.can | 136 |
| abstract_inverted_index.for | 28, 168, 183, 209 |
| abstract_inverted_index.get | 85 |
| abstract_inverted_index.may | 84, 114 |
| abstract_inverted_index.new | 12 |
| abstract_inverted_index.the | 24, 66, 77, 91, 96, 106, 118, 122, 127, 131, 134, 170, 181, 219, 223, 232, 244, 248, 259, 268 |
| abstract_inverted_index.(LM) | 2 |
| abstract_inverted_index.This | 213 |
| abstract_inverted_index.even | 103 |
| abstract_inverted_index.fine | 182 |
| abstract_inverted_index.from | 222 |
| abstract_inverted_index.have | 9 |
| abstract_inverted_index.into | 42 |
| abstract_inverted_index.lead | 137 |
| abstract_inverted_index.mask | 204 |
| abstract_inverted_index.such | 70 |
| abstract_inverted_index.task | 99 |
| abstract_inverted_index.that | 163 |
| abstract_inverted_index.then | 179 |
| abstract_inverted_index.this | 20 |
| abstract_inverted_index.used | 208 |
| abstract_inverted_index.with | 202 |
| abstract_inverted_index.audio | 4, 17 |
| abstract_inverted_index.based | 3, 230 |
| abstract_inverted_index.e.g., | 7 |
| abstract_inverted_index.error | 246 |
| abstract_inverted_index.input | 94 |
| abstract_inverted_index.local | 260 |
| abstract_inverted_index.makes | 100 |
| abstract_inverted_index.model | 1, 214 |
| abstract_inverted_index.still | 109 |
| abstract_inverted_index.these | 153 |
| abstract_inverted_index.voice | 30, 97 |
| abstract_inverted_index.which | 113, 253 |
| abstract_inverted_index.while | 90 |
| abstract_inverted_index.HuBERT | 49 |
| abstract_inverted_index.LM-VC, | 157 |
| abstract_inverted_index.better | 196 |
| abstract_inverted_index.coarse | 165, 210, 269 |
| abstract_inverted_index.during | 87, 141 |
| abstract_inverted_index.follow | 37 |
| abstract_inverted_index.leaked | 116 |
| abstract_inverted_index.masked | 199, 220 |
| abstract_inverted_index.paper, | 21 |
| abstract_inverted_index.prefix | 200 |
| abstract_inverted_index.source | 54, 171 |
| abstract_inverted_index.speech | 41, 93, 148, 229 |
| abstract_inverted_index.target | 58, 67, 119, 123, 175, 228, 233 |
| abstract_inverted_index.tokens | 46, 56, 60, 64, 83, 108, 167 |
| abstract_inverted_index.window | 255 |
| abstract_inverted_index.AudioLM | 38 |
| abstract_inverted_index.capture | 258 |
| abstract_inverted_index.contain | 110 |
| abstract_inverted_index.content | 79, 173, 192, 221 |
| abstract_inverted_index.context | 225 |
| abstract_inverted_index.details | 185 |
| abstract_inverted_index.employs | 254 |
| abstract_inverted_index.enhance | 191 |
| abstract_inverted_index.explore | 23 |
| abstract_inverted_index.further | 242 |
| abstract_inverted_index.harder; | 104 |
| abstract_inverted_index.issues: | 75 |
| abstract_inverted_index.leading | 143 |
| abstract_inverted_index.lengthy | 92 |
| abstract_inverted_index.propose | 156 |
| abstract_inverted_index.quality | 149 |
| abstract_inverted_index.recover | 218 |
| abstract_inverted_index.several | 74 |
| abstract_inverted_index.speaker | 124 |
| abstract_inverted_index.speech, | 120 |
| abstract_inverted_index.speech. | 188 |
| abstract_inverted_index.timbre, | 177 |
| abstract_inverted_index.tokens. | 239 |
| abstract_inverted_index.AudioLM, | 8 |
| abstract_inverted_index.Besides, | 240 |
| abstract_inverted_index.However, | 69 |
| abstract_inverted_index.Language | 0 |
| abstract_inverted_index.achieved | 11 |
| abstract_inverted_index.acoustic | 45, 59, 63, 166, 184, 211, 261, 270 |
| abstract_inverted_index.approach | 34, 72, 162 |
| abstract_inverted_index.external | 251 |
| abstract_inverted_index.generate | 227 |
| abstract_inverted_index.language | 160 |
| abstract_inverted_index.learning | 102 |
| abstract_inverted_index.lowering | 121 |
| abstract_inverted_index.mitigate | 152 |
| abstract_inverted_index.modeling | 89, 161 |
| abstract_inverted_index.outcomes | 140 |
| abstract_inverted_index.recently | 10 |
| abstract_inverted_index.sampling | 132, 245 |
| abstract_inverted_index.semantic | 43, 55, 82, 107, 238 |
| abstract_inverted_index.speaker. | 68 |
| abstract_inverted_index.strategy | 206 |
| abstract_inverted_index.alleviate | 243 |
| abstract_inverted_index.attention | 256 |
| abstract_inverted_index.contained | 80 |
| abstract_inverted_index.converted | 187 |
| abstract_inverted_index.corrupted | 237 |
| abstract_inverted_index.dispersed | 86 |
| abstract_inverted_index.diversity | 129 |
| abstract_inverted_index.generates | 164 |
| abstract_inverted_index.intuitive | 33 |
| abstract_inverted_index.modeling. | 212, 271 |
| abstract_inverted_index.problems, | 154 |
| abstract_inverted_index.speaker's | 176, 234 |
| abstract_inverted_index.two-stage | 159 |
| abstract_inverted_index.unnatural | 145 |
| abstract_inverted_index.utterance | 235 |
| abstract_inverted_index.zero-shot | 16, 29 |
| abstract_inverted_index.Tokenizing | 40 |
| abstract_inverted_index.contextual | 101 |
| abstract_inverted_index.conversion | 98 |
| abstract_inverted_index.converting | 53 |
| abstract_inverted_index.encounters | 73 |
| abstract_inverted_index.encouraged | 216 |
| abstract_inverted_index.generation | 5, 128 |
| abstract_inverted_index.inference, | 142 |
| abstract_inverted_index.introduced | 264 |
| abstract_inverted_index.linguistic | 78, 172 |
| abstract_inverted_index.prediction | 205 |
| abstract_inverted_index.recovering | 169 |
| abstract_inverted_index.relations, | 262 |
| abstract_inverted_index.unexpected | 139 |
| abstract_inverted_index.conditioned | 61 |
| abstract_inverted_index.conversion. | 31 |
| abstract_inverted_index.facilitates | 195 |
| abstract_inverted_index.feasibility | 25 |
| abstract_inverted_index.frameworks, | 6 |
| abstract_inverted_index.generation, | 249 |
| abstract_inverted_index.generation. | 18 |
| abstract_inverted_index.multi-layer | 88 |
| abstract_inverted_index.participate | 266 |
| abstract_inverted_index.performance | 14 |
| abstract_inverted_index.similarity; | 125 |
| abstract_inverted_index.surrounding | 224 |
| abstract_inverted_index.SoundStream, | 51 |
| abstract_inverted_index.degradation. | 150 |
| abstract_inverted_index.information, | 112 |
| abstract_inverted_index.preservation | 193 |
| abstract_inverted_index.reconstructs | 180 |
| abstract_inverted_index.respectively | 47 |
| abstract_inverted_index.Specifically, | 189 |
| abstract_inverted_index.pronunciation | 146 |
| abstract_inverted_index.speaker-related | 111 |
| abstract_inverted_index.disentanglement, | 197 |
| abstract_inverted_index.state-of-the-art | 13 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |