Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2401.14111
Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2401.14111
- https://arxiv.org/pdf/2401.14111
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4391272671
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4391272671Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2401.14111Digital Object Identifier
- Title
-
Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene GraphsWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-01-25Full publication date if available
- Authors
-
Rameshwar Mishra, A V SubramanyamList of authors in order
- Landing page
-
https://arxiv.org/abs/2401.14111Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2401.14111Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2401.14111Direct OA link when available
- Concepts
-
Computer science, Scene graph, Artificial intelligence, Graph, Leverage (statistics), Embedding, Computer vision, Encoder, Pattern recognition (psychology), Theoretical computer science, Rendering (computer graphics), Operating systemTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4391272671 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2401.14111 |
| ids.doi | https://doi.org/10.48550/arxiv.2401.14111 |
| ids.openalex | https://openalex.org/W4391272671 |
| fwci | |
| type | preprint |
| title | Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11714 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9810000061988831 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Multimodal Machine Learning Applications |
| topics[1].id | https://openalex.org/T10775 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9646999835968018 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Generative Adversarial Networks and Image Synthesis |
| topics[2].id | https://openalex.org/T10627 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9096999764442444 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Advanced Image and Video Retrieval Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.7406987547874451 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C179372163 |
| concepts[1].level | 3 |
| concepts[1].score | 0.6784405708312988 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q1406181 |
| concepts[1].display_name | Scene graph |
| concepts[2].id | https://openalex.org/C154945302 |
| concepts[2].level | 1 |
| concepts[2].score | 0.5889142751693726 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[2].display_name | Artificial intelligence |
| concepts[3].id | https://openalex.org/C132525143 |
| concepts[3].level | 2 |
| concepts[3].score | 0.5739718079566956 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q141488 |
| concepts[3].display_name | Graph |
| concepts[4].id | https://openalex.org/C153083717 |
| concepts[4].level | 2 |
| concepts[4].score | 0.49153217673301697 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q6535263 |
| concepts[4].display_name | Leverage (statistics) |
| concepts[5].id | https://openalex.org/C41608201 |
| concepts[5].level | 2 |
| concepts[5].score | 0.46660134196281433 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q980509 |
| concepts[5].display_name | Embedding |
| concepts[6].id | https://openalex.org/C31972630 |
| concepts[6].level | 1 |
| concepts[6].score | 0.46479979157447815 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q844240 |
| concepts[6].display_name | Computer vision |
| concepts[7].id | https://openalex.org/C118505674 |
| concepts[7].level | 2 |
| concepts[7].score | 0.45785391330718994 |
| concepts[7].wikidata | https://www.wikidata.org/wiki/Q42586063 |
| concepts[7].display_name | Encoder |
| concepts[8].id | https://openalex.org/C153180895 |
| concepts[8].level | 2 |
| concepts[8].score | 0.39497533440589905 |
| concepts[8].wikidata | https://www.wikidata.org/wiki/Q7148389 |
| concepts[8].display_name | Pattern recognition (psychology) |
| concepts[9].id | https://openalex.org/C80444323 |
| concepts[9].level | 1 |
| concepts[9].score | 0.29162073135375977 |
| concepts[9].wikidata | https://www.wikidata.org/wiki/Q2878974 |
| concepts[9].display_name | Theoretical computer science |
| concepts[10].id | https://openalex.org/C205711294 |
| concepts[10].level | 2 |
| concepts[10].score | 0.0 |
| concepts[10].wikidata | https://www.wikidata.org/wiki/Q176953 |
| concepts[10].display_name | Rendering (computer graphics) |
| concepts[11].id | https://openalex.org/C111919701 |
| concepts[11].level | 1 |
| concepts[11].score | 0.0 |
| concepts[11].wikidata | https://www.wikidata.org/wiki/Q9135 |
| concepts[11].display_name | Operating system |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.7406987547874451 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/scene-graph |
| keywords[1].score | 0.6784405708312988 |
| keywords[1].display_name | Scene graph |
| keywords[2].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[2].score | 0.5889142751693726 |
| keywords[2].display_name | Artificial intelligence |
| keywords[3].id | https://openalex.org/keywords/graph |
| keywords[3].score | 0.5739718079566956 |
| keywords[3].display_name | Graph |
| keywords[4].id | https://openalex.org/keywords/leverage |
| keywords[4].score | 0.49153217673301697 |
| keywords[4].display_name | Leverage (statistics) |
| keywords[5].id | https://openalex.org/keywords/embedding |
| keywords[5].score | 0.46660134196281433 |
| keywords[5].display_name | Embedding |
| keywords[6].id | https://openalex.org/keywords/computer-vision |
| keywords[6].score | 0.46479979157447815 |
| keywords[6].display_name | Computer vision |
| keywords[7].id | https://openalex.org/keywords/encoder |
| keywords[7].score | 0.45785391330718994 |
| keywords[7].display_name | Encoder |
| keywords[8].id | https://openalex.org/keywords/pattern-recognition |
| keywords[8].score | 0.39497533440589905 |
| keywords[8].display_name | Pattern recognition (psychology) |
| keywords[9].id | https://openalex.org/keywords/theoretical-computer-science |
| keywords[9].score | 0.29162073135375977 |
| keywords[9].display_name | Theoretical computer science |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2401.14111 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2401.14111 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2401.14111 |
| locations[1].id | doi:10.48550/arxiv.2401.14111 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2401.14111 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5102609896 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Rameshwar Mishra |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Mishra, Rameshwar |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5085785393 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-8873-4644 |
| authorships[1].author.display_name | A V Subramanyam |
| authorships[1].author_position | last |
| authorships[1].raw_author_name | Subramanyam, A V |
| authorships[1].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2401.14111 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2024-01-27T00:00:00 |
| display_name | Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11714 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9810000061988831 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Multimodal Machine Learning Applications |
| related_works | https://openalex.org/W2081900870, https://openalex.org/W4390516098, https://openalex.org/W2181948922, https://openalex.org/W2384362569, https://openalex.org/W2183306018, https://openalex.org/W2549990292, https://openalex.org/W2345479200, https://openalex.org/W2142795561, https://openalex.org/W2951819827, https://openalex.org/W4387129494 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2401.14111 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2401.14111 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2401.14111 |
| primary_location.id | pmh:oai:arXiv.org:2401.14111 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2401.14111 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2401.14111 |
| publication_date | 2024-01-25 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 44, 66, 83, 134, 158, 192 |
| abstract_inverted_index.In | 78, 165 |
| abstract_inverted_index.We | 100 |
| abstract_inverted_index.by | 63 |
| abstract_inverted_index.in | 1, 8, 46, 151 |
| abstract_inverted_index.is | 22 |
| abstract_inverted_index.of | 26, 40, 96, 130, 147, 174, 220 |
| abstract_inverted_index.on | 51, 184, 217 |
| abstract_inverted_index.to | 13, 19, 86, 109, 123, 156 |
| abstract_inverted_index.we | 81, 117, 139, 189 |
| abstract_inverted_index.GAN | 135 |
| abstract_inverted_index.and | 69, 106, 177, 204, 222 |
| abstract_inverted_index.are | 30 |
| abstract_inverted_index.one | 23 |
| abstract_inverted_index.our | 120, 212 |
| abstract_inverted_index.the | 33, 38, 55, 94, 141, 152, 166, 175, 197 |
| abstract_inverted_index.CLIP | 107, 128, 145, 161, 205 |
| abstract_inverted_index.fine | 190 |
| abstract_inverted_index.from | 72, 89 |
| abstract_inverted_index.fuse | 140 |
| abstract_inverted_index.have | 4 |
| abstract_inverted_index.into | 113 |
| abstract_inverted_index.need | 95 |
| abstract_inverted_index.such | 24 |
| abstract_inverted_index.task | 25, 62 |
| abstract_inverted_index.that | 211 |
| abstract_inverted_index.this | 61, 79 |
| abstract_inverted_index.tune | 191 |
| abstract_inverted_index.with | 32, 127, 144, 196, 202 |
| abstract_inverted_index.Scene | 17 |
| abstract_inverted_index.align | 124 |
| abstract_inverted_index.among | 186 |
| abstract_inverted_index.based | 50, 136, 183 |
| abstract_inverted_index.first | 64, 118 |
| abstract_inverted_index.given | 34, 153 |
| abstract_inverted_index.graph | 18, 111, 121, 125, 142, 155, 159, 178, 198 |
| abstract_inverted_index.image | 20, 176 |
| abstract_inverted_index.loss. | 207 |
| abstract_inverted_index.model | 195 |
| abstract_inverted_index.novel | 84 |
| abstract_inverted_index.poses | 43 |
| abstract_inverted_index.scene | 35, 56, 67, 90, 154 |
| abstract_inverted_index.these | 73 |
| abstract_inverted_index.this, | 116 |
| abstract_inverted_index.using | 75, 133 |
| abstract_inverted_index.which | 29, 92 |
| abstract_inverted_index.while | 11 |
| abstract_inverted_index.work, | 80 |
| abstract_inverted_index.Genome | 224 |
| abstract_inverted_index.Visual | 223 |
| abstract_inverted_index.coarse | 172 |
| abstract_inverted_index.create | 157 |
| abstract_inverted_index.graph. | 36, 57 |
| abstract_inverted_index.graphs | 91 |
| abstract_inverted_index.guided | 162 |
| abstract_inverted_index.images | 10, 28, 71, 88, 132 |
| abstract_inverted_index.input, | 168 |
| abstract_inverted_index.labels | 149 |
| abstract_inverted_index.layout | 68 |
| abstract_inverted_index.method | 213 |
| abstract_inverted_index.models | 3, 105 |
| abstract_inverted_index.object | 148, 169 |
| abstract_inverted_index.reveal | 210 |
| abstract_inverted_index.scenes | 42 |
| abstract_inverted_index.signal | 201 |
| abstract_inverted_index.visual | 41 |
| abstract_inverted_index.within | 54 |
| abstract_inverted_index.Towards | 115 |
| abstract_inverted_index.encoder | 122 |
| abstract_inverted_index.images. | 114 |
| abstract_inverted_index.layouts | 74 |
| abstract_inverted_index.methods | 59, 216 |
| abstract_inverted_index.objects | 49 |
| abstract_inverted_index.present | 150 |
| abstract_inverted_index.provide | 171, 180 |
| abstract_inverted_index.signal. | 164 |
| abstract_inverted_index.sparked | 5 |
| abstract_inverted_index.Existing | 58 |
| abstract_inverted_index.Finally, | 188 |
| abstract_inverted_index.Further, | 138 |
| abstract_inverted_index.However, | 37 |
| abstract_inverted_index.adhering | 12 |
| abstract_inverted_index.aligning | 48 |
| abstract_inverted_index.approach | 60, 85 |
| abstract_inverted_index.dataset. | 225 |
| abstract_inverted_index.existing | 215 |
| abstract_inverted_index.features | 126, 129, 143, 179 |
| abstract_inverted_index.generate | 87 |
| abstract_inverted_index.guidance | 108 |
| abstract_inverted_index.interest | 7 |
| abstract_inverted_index.layouts. | 99 |
| abstract_inverted_index.leverage | 101 |
| abstract_inverted_index.objects. | 187 |
| abstract_inverted_index.specific | 14 |
| abstract_inverted_index.standard | 218 |
| abstract_inverted_index.Elaborate | 208 |
| abstract_inverted_index.alignment | 182, 206 |
| abstract_inverted_index.challenge | 45 |
| abstract_inverted_index.diffusion | 104, 194 |
| abstract_inverted_index.embedding | 146 |
| abstract_inverted_index.introduce | 82 |
| abstract_inverted_index.knowledge | 112 |
| abstract_inverted_index.pre-train | 119 |
| abstract_inverted_index.relations | 53 |
| abstract_inverted_index.specified | 52 |
| abstract_inverted_index.structure | 173 |
| abstract_inverted_index.training. | 77, 137 |
| abstract_inverted_index.translate | 110 |
| abstract_inverted_index.COCO-stuff | 221 |
| abstract_inverted_index.accurately | 47 |
| abstract_inverted_index.benchmarks | 219 |
| abstract_inverted_index.complexity | 39 |
| abstract_inverted_index.consistent | 31, 160, 199 |
| abstract_inverted_index.eliminates | 93 |
| abstract_inverted_index.embeddings | 170 |
| abstract_inverted_index.generating | 9, 27, 70 |
| abstract_inverted_index.generation | 21 |
| abstract_inverted_index.generative | 2 |
| abstract_inverted_index.predicting | 65, 97 |
| abstract_inverted_index.structural | 15, 181 |
| abstract_inverted_index.adversarial | 76 |
| abstract_inverted_index.experiments | 209 |
| abstract_inverted_index.guidelines. | 16 |
| abstract_inverted_index.outperforms | 214 |
| abstract_inverted_index.pre-trained | 102, 193 |
| abstract_inverted_index.significant | 6 |
| abstract_inverted_index.Advancements | 0 |
| abstract_inverted_index.conditioning | 163, 167, 200 |
| abstract_inverted_index.intermediate | 98 |
| abstract_inverted_index.corresponding | 131 |
| abstract_inverted_index.relationships | 185 |
| abstract_inverted_index.text-to-image | 103 |
| abstract_inverted_index.reconstruction | 203 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 2 |
| citation_normalized_percentile |