ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2411.15436
Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: https://njust-yang.github.io/ConsistentAvatar.github.io/
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2411.15436
- https://arxiv.org/pdf/2411.15436
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4404986442
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4404986442Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2411.15436Digital Object Identifier
- Title
-
ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal GuidanceWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-11-23Full publication date if available
- Authors
-
Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian YangList of authors in order
- Landing page
-
https://arxiv.org/abs/2411.15436Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2411.15436Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2411.15436Direct OA link when available
- Concepts
-
Avatar, Head (geology), Computer science, Psychology, Human–computer interaction, Cognitive psychology, Geology, GeomorphologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4404986442 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2411.15436 |
| ids.doi | https://doi.org/10.48550/arxiv.2411.15436 |
| ids.openalex | https://openalex.org/W4404986442 |
| fwci | |
| type | preprint |
| title | ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.994700014591217 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T12031 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9922000169754028 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Speech and dialogue systems |
| topics[2].id | https://openalex.org/T10709 |
| topics[2].field.id | https://openalex.org/fields/32 |
| topics[2].field.display_name | Psychology |
| topics[2].score | 0.9879000186920166 |
| topics[2].domain.id | https://openalex.org/domains/2 |
| topics[2].domain.display_name | Social Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/3207 |
| topics[2].subfield.display_name | Social Psychology |
| topics[2].display_name | Social Robot Interaction and HRI |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2777365542 |
| concepts[0].level | 2 |
| concepts[0].score | 0.9048485159873962 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q83090 |
| concepts[0].display_name | Avatar |
| concepts[1].id | https://openalex.org/C2780312720 |
| concepts[1].level | 2 |
| concepts[1].score | 0.6733502745628357 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q5689100 |
| concepts[1].display_name | Head (geology) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.4479665756225586 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C15744967 |
| concepts[3].level | 0 |
| concepts[3].score | 0.38501715660095215 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q9418 |
| concepts[3].display_name | Psychology |
| concepts[4].id | https://openalex.org/C107457646 |
| concepts[4].level | 1 |
| concepts[4].score | 0.36510002613067627 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q207434 |
| concepts[4].display_name | Human–computer interaction |
| concepts[5].id | https://openalex.org/C180747234 |
| concepts[5].level | 1 |
| concepts[5].score | 0.3241029381752014 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q23373 |
| concepts[5].display_name | Cognitive psychology |
| concepts[6].id | https://openalex.org/C127313418 |
| concepts[6].level | 0 |
| concepts[6].score | 0.12713778018951416 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q1069 |
| concepts[6].display_name | Geology |
| concepts[7].id | https://openalex.org/C114793014 |
| concepts[7].level | 1 |
| concepts[7].score | 0.0 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q52109 |
| concepts[7].display_name | Geomorphology |
| keywords[0].id | https://openalex.org/keywords/avatar |
| keywords[0].score | 0.9048485159873962 |
| keywords[0].display_name | Avatar |
| keywords[1].id | https://openalex.org/keywords/head |
| keywords[1].score | 0.6733502745628357 |
| keywords[1].display_name | Head (geology) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.4479665756225586 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/psychology |
| keywords[3].score | 0.38501715660095215 |
| keywords[3].display_name | Psychology |
| keywords[4].id | https://openalex.org/keywords/human–computer-interaction |
| keywords[4].score | 0.36510002613067627 |
| keywords[4].display_name | Human–computer interaction |
| keywords[5].id | https://openalex.org/keywords/cognitive-psychology |
| keywords[5].score | 0.3241029381752014 |
| keywords[5].display_name | Cognitive psychology |
| keywords[6].id | https://openalex.org/keywords/geology |
| keywords[6].score | 0.12713778018951416 |
| keywords[6].display_name | Geology |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2411.15436 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2411.15436 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2411.15436 |
| locations[1].id | doi:10.48550/arxiv.2411.15436 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2411.15436 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5072993313 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-3489-0438 |
| authorships[0].author.display_name | Haijie Yang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Yang, Haijie |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100389499 |
| authorships[1].author.orcid | https://orcid.org/0000-0001-5727-9450 |
| authorships[1].author.display_name | Zhenyu Zhang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhang, Zhenyu |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5050748634 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-2077-1246 |
| authorships[2].author.display_name | Hao Tang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Tang, Hao |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5064363522 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-0968-8556 |
| authorships[3].author.display_name | Jianjun Qian |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Qian, Jianjun |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100726984 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-4800-832X |
| authorships[4].author.display_name | Jian Yang |
| authorships[4].author_position | last |
| authorships[4].raw_author_name | Yang, Jian |
| authorships[4].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2411.15436 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.994700014591217 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W3138471234, https://openalex.org/W4247958311, https://openalex.org/W4396832849, https://openalex.org/W1584662471, https://openalex.org/W2785089443, https://openalex.org/W2265117524, https://openalex.org/W1467576422, https://openalex.org/W3196465490 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2411.15436 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2411.15436 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2411.15436 |
| primary_location.id | pmh:oai:arXiv.org:2411.15436 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2411.15436 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2411.15436 |
| publication_date | 2024-11-23 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 46, 84, 102, 130 |
| abstract_inverted_index.3D | 24 |
| abstract_inverted_index.In | 40 |
| abstract_inverted_index.We | 147 |
| abstract_inverted_index.by | 129 |
| abstract_inverted_index.is | 127 |
| abstract_inverted_index.of | 36, 58, 112, 118, 175 |
| abstract_inverted_index.on | 6, 136, 186, 198 |
| abstract_inverted_index.or | 25 |
| abstract_inverted_index.to | 29, 63, 70, 109, 116, 162 |
| abstract_inverted_index.we | 43, 82, 107 |
| abstract_inverted_index.3D, | 202 |
| abstract_inverted_index.TSD | 111 |
| abstract_inverted_index.The | 124 |
| abstract_inverted_index.and | 13, 33, 52, 92, 143, 204 |
| abstract_inverted_index.are | 16 |
| abstract_inverted_index.due | 28 |
| abstract_inverted_index.for | 49, 76 |
| abstract_inverted_index.its | 169 |
| abstract_inverted_index.map | 88 |
| abstract_inverted_index.our | 67 |
| abstract_inverted_index.the | 30, 64, 73, 98, 113, 119, 137, 150, 155, 159, 173, 179, 184, 195, 199 |
| abstract_inverted_index.TSD, | 139, 152 |
| abstract_inverted_index.find | 148 |
| abstract_inverted_index.from | 22 |
| abstract_inverted_index.have | 2 |
| abstract_inverted_index.head | 8, 141 |
| abstract_inverted_index.that | 94, 117, 149, 192 |
| abstract_inverted_index.this | 41 |
| abstract_inverted_index.time | 99 |
| abstract_inverted_index.vary | 95 |
| abstract_inverted_index.(TSD) | 87 |
| abstract_inverted_index.Using | 101 |
| abstract_inverted_index.While | 10 |
| abstract_inverted_index.align | 110 |
| abstract_inverted_index.along | 97 |
| abstract_inverted_index.axis. | 100 |
| abstract_inverted_index.error | 31, 181 |
| abstract_inverted_index.final | 125 |
| abstract_inverted_index.first | 71 |
| abstract_inverted_index.frame | 121 |
| abstract_inverted_index.fully | 50, 131 |
| abstract_inverted_index.head. | 167 |
| abstract_inverted_index.learn | 108 |
| abstract_inverted_index.model | 72 |
| abstract_inverted_index.novel | 47 |
| abstract_inverted_index.other | 176 |
| abstract_inverted_index.page: | 208 |
| abstract_inverted_index.rough | 140 |
| abstract_inverted_index.shown | 3 |
| abstract_inverted_index.still | 20 |
| abstract_inverted_index.these | 18 |
| abstract_inverted_index.video | 120 |
| abstract_inverted_index.which | 153 |
| abstract_inverted_index.while | 182 |
| abstract_inverted_index.Detail | 86 |
| abstract_inverted_index.avatar | 55, 126 |
| abstract_inverted_index.effect | 15 |
| abstract_inverted_index.ground | 122 |
| abstract_inverted_index.learns | 69 |
| abstract_inverted_index.method | 68 |
| abstract_inverted_index.models | 1 |
| abstract_inverted_index.paper, | 42 |
| abstract_inverted_index.prompt | 145 |
| abstract_inverted_index.result | 115 |
| abstract_inverted_index.stable | 165 |
| abstract_inverted_index.suffer | 21 |
| abstract_inverted_index.truth. | 123 |
| abstract_inverted_index.Instead | 57 |
| abstract_inverted_index.Project | 207 |
| abstract_inverted_index.aligned | 138, 151 |
| abstract_inverted_index.between | 78 |
| abstract_inverted_index.emotion | 144 |
| abstract_inverted_index.feature | 91 |
| abstract_inverted_index.frames. | 80 |
| abstract_inverted_index.initial | 114 |
| abstract_inverted_index.methods | 19, 197 |
| abstract_inverted_index.module, | 106, 134 |
| abstract_inverted_index.normal, | 142 |
| abstract_inverted_index.process | 161 |
| abstract_inverted_index.propose | 44, 83 |
| abstract_inverted_index.talking | 7, 14, 54, 166 |
| abstract_inverted_index.various | 187 |
| abstract_inverted_index.Further, | 168 |
| abstract_inverted_index.ability. | 39 |
| abstract_inverted_index.adjacent | 79 |
| abstract_inverted_index.aspects. | 188 |
| abstract_inverted_index.contours | 93 |
| abstract_inverted_index.directly | 59 |
| abstract_inverted_index.generate | 163 |
| abstract_inverted_index.guidance | 171 |
| abstract_inverted_index.inherent | 34 |
| abstract_inverted_index.process, | 66 |
| abstract_inverted_index.reliable | 170 |
| abstract_inverted_index.temporal | 74, 103, 156, 205 |
| abstract_inverted_index.Diffusion | 0 |
| abstract_inverted_index.Extensive | 189 |
| abstract_inverted_index.achieved, | 17 |
| abstract_inverted_index.diffusion | 65, 105, 133, 160 |
| abstract_inverted_index.employing | 60 |
| abstract_inverted_index.framework | 48 |
| abstract_inverted_index.generated | 128, 200 |
| abstract_inverted_index.improving | 183 |
| abstract_inverted_index.patterns, | 157 |
| abstract_inverted_index.plausible | 11 |
| abstract_inverted_index.potential | 5 |
| abstract_inverted_index.stability | 77 |
| abstract_inverted_index.temporal, | 23 |
| abstract_inverted_index.appearance | 12 |
| abstract_inverted_index.conditions | 62 |
| abstract_inverted_index.consistent | 51, 104, 132 |
| abstract_inverted_index.constrains | 158 |
| abstract_inverted_index.containing | 89 |
| abstract_inverted_index.embedding. | 146 |
| abstract_inverted_index.expression | 26, 203 |
| abstract_inverted_index.generation | 38 |
| abstract_inverted_index.impressive | 4 |
| abstract_inverted_index.inaccuracy | 174 |
| abstract_inverted_index.limitation | 35 |
| abstract_inverted_index.represents | 154 |
| abstract_inverted_index.temporally | 164 |
| abstract_inverted_index.accumulated | 180 |
| abstract_inverted_index.appearance, | 201 |
| abstract_inverted_index.complements | 172 |
| abstract_inverted_index.conditioned | 135 |
| abstract_inverted_index.conditions, | 177 |
| abstract_inverted_index.consistency | 185 |
| abstract_inverted_index.demonstrate | 191 |
| abstract_inverted_index.experiments | 190 |
| abstract_inverted_index.generation. | 9, 56 |
| abstract_inverted_index.multi-modal | 61 |
| abstract_inverted_index.outperforms | 194 |
| abstract_inverted_index.suppressing | 178 |
| abstract_inverted_index.accumulation | 32 |
| abstract_inverted_index.consistency. | 206 |
| abstract_inverted_index.single-image | 37 |
| abstract_inverted_index.Specifically, | 81 |
| abstract_inverted_index.high-fidelity | 53 |
| abstract_inverted_index.inconsistency | 27 |
| abstract_inverted_index.significantly | 96 |
| abstract_inverted_index.high-frequency | 90 |
| abstract_inverted_index.representation | 75 |
| abstract_inverted_index.ConsistentAvatar | 193 |
| abstract_inverted_index.state-of-the-art | 196 |
| abstract_inverted_index.ConsistentAvatar, | 45 |
| abstract_inverted_index.Temporally-Sensitive | 85 |
| abstract_inverted_index.https://njust-yang.github.io/ConsistentAvatar.github.io/ | 209 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 5 |
| citation_normalized_percentile |