SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2510.04961
Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2510.04961
- https://arxiv.org/pdf/2510.04961
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4414972763
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4414972763Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2510.04961Digital Object Identifier
- Title
-
SSDD: Single-Step Diffusion Decoder for Efficient Image TokenizationWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-10-06Full publication date if available
- Authors
-
Théophane Vallaeys, Jakob Verbeek, Matthieu CordList of authors in order
- Landing page
-
https://arxiv.org/abs/2510.04961Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2510.04961Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2510.04961Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4414972763 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2510.04961 |
| ids.doi | https://doi.org/10.48550/arxiv.2510.04961 |
| ids.openalex | https://openalex.org/W4414972763 |
| fwci | |
| type | preprint |
| title | SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10388 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9986000061035156 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Advanced Steganography and Watermarking Techniques |
| topics[1].id | https://openalex.org/T13579 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9772999882698059 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Image and Video Stabilization |
| topics[2].id | https://openalex.org/T10901 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9771000146865845 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Advanced Data Compression Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2510.04961 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | cc-by |
| locations[0].pdf_url | https://arxiv.org/pdf/2510.04961 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | https://openalex.org/licenses/cc-by |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2510.04961 |
| locations[1].id | doi:10.48550/arxiv.2510.04961 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2510.04961 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5081869992 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Théophane Vallaeys |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Vallaeys, Théophane |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5040312210 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-1419-1816 |
| authorships[1].author.display_name | Jakob Verbeek |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Verbeek, Jakob |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5108118084 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-0627-5844 |
| authorships[2].author.display_name | Matthieu Cord |
| authorships[2].author_position | last |
| authorships[2].raw_author_name | Cord, Matthieu |
| authorships[2].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2510.04961 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-09T00:00:00 |
| display_name | SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10388 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9986000061035156 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Advanced Steganography and Watermarking Techniques |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2510.04961 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | cc-by |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2510.04961 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | https://openalex.org/licenses/cc-by |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2510.04961 |
| primary_location.id | pmh:oai:arXiv.org:2510.04961 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | cc-by |
| primary_location.pdf_url | https://arxiv.org/pdf/2510.04961 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | https://openalex.org/licenses/cc-by |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2510.04961 |
| publication_date | 2025-10-06 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 2, 47, 74, 88, 178 |
| abstract_inverted_index.As | 171 |
| abstract_inverted_index.In | 147 |
| abstract_inverted_index.To | 82 |
| abstract_inverted_index.We | 107 |
| abstract_inverted_index.an | 119 |
| abstract_inverted_index.as | 46, 71, 73, 177 |
| abstract_inverted_index.be | 175 |
| abstract_inverted_index.in | 118 |
| abstract_inverted_index.of | 5, 65, 114, 165 |
| abstract_inverted_index.on | 29, 58 |
| abstract_inverted_index.to | 51, 79, 110, 155 |
| abstract_inverted_index.we | 86 |
| abstract_inverted_index.FID | 152 |
| abstract_inverted_index.and | 22, 38, 97, 104, 142, 161, 183, 187 |
| abstract_inverted_index.are | 1, 27 |
| abstract_inverted_index.can | 174 |
| abstract_inverted_index.due | 78 |
| abstract_inverted_index.for | 94, 131, 181, 184 |
| abstract_inverted_index.key | 3 |
| abstract_inverted_index.new | 89 |
| abstract_inverted_index.the | 11, 16, 53, 59, 63, 112, 115, 126 |
| abstract_inverted_index.use | 108 |
| abstract_inverted_index.DiTs | 166 |
| abstract_inverted_index.Most | 24 |
| abstract_inverted_index.SSDD | 125, 149, 173 |
| abstract_inverted_index.This | 123 |
| abstract_inverted_index.been | 44 |
| abstract_inverted_index.data | 20 |
| abstract_inverted_index.from | 15, 101, 153 |
| abstract_inverted_index.have | 43 |
| abstract_inverted_index.more | 48 |
| abstract_inverted_index.most | 12 |
| abstract_inverted_index.over | 55 |
| abstract_inverted_index.than | 145 |
| abstract_inverted_index.time | 77 |
| abstract_inverted_index.used | 176 |
| abstract_inverted_index.well | 72 |
| abstract_inverted_index.with | 35, 157, 167 |
| abstract_inverted_index.based | 28 |
| abstract_inverted_index.first | 127 |
| abstract_inverted_index.image | 8 |
| abstract_inverted_index.makes | 124 |
| abstract_inverted_index.model | 52 |
| abstract_inverted_index.pixel | 90 |
| abstract_inverted_index.still | 67 |
| abstract_inverted_index.such, | 172 |
| abstract_inverted_index.these | 84 |
| abstract_inverted_index.while | 18 |
| abstract_inverted_index.$0.50$ | 156 |
| abstract_inverted_index.$0.87$ | 154 |
| abstract_inverted_index.KL-VAE | 66 |
| abstract_inverted_index.faster | 143, 169, 188 |
| abstract_inverted_index.higher | 75, 139, 159 |
| abstract_inverted_index.images | 56 |
| abstract_inverted_index.signal | 17 |
| abstract_inverted_index.KL-VAE, | 182 |
| abstract_inverted_index.KL-VAE. | 146 |
| abstract_inverted_index.address | 83 |
| abstract_inverted_index.current | 25 |
| abstract_inverted_index.decoder | 92, 117, 129 |
| abstract_inverted_index.drop-in | 179 |
| abstract_inverted_index.latent. | 60 |
| abstract_inverted_index.losses, | 70, 137 |
| abstract_inverted_index.losses. | 40 |
| abstract_inverted_index.models, | 9 |
| abstract_inverted_index.models. | 190 |
| abstract_inverted_index.quality | 141, 164 |
| abstract_inverted_index.scaling | 96 |
| abstract_inverted_index.trained | 34, 134 |
| abstract_inverted_index.without | 135 |
| abstract_inverted_index.GAN-free | 105 |
| abstract_inverted_index.However, | 61 |
| abstract_inverted_index.building | 185 |
| abstract_inverted_index.decoder. | 122 |
| abstract_inverted_index.decoders | 42 |
| abstract_inverted_index.decoding | 76 |
| abstract_inverted_index.features | 14 |
| abstract_inverted_index.improved | 95 |
| abstract_inverted_index.improves | 150 |
| abstract_inverted_index.matching | 62 |
| abstract_inverted_index.preserve | 162 |
| abstract_inverted_index.proposed | 45 |
| abstract_inverted_index.reaching | 138 |
| abstract_inverted_index.reducing | 19 |
| abstract_inverted_index.requires | 68 |
| abstract_inverted_index.sampling | 144 |
| abstract_inverted_index.training | 98 |
| abstract_inverted_index.(KL-VAE), | 33 |
| abstract_inverted_index.Diffusion | 41 |
| abstract_inverted_index.component | 4 |
| abstract_inverted_index.diffusion | 91, 116, 128 |
| abstract_inverted_index.dimension | 21 |
| abstract_inverted_index.efficient | 120 |
| abstract_inverted_index.important | 13 |
| abstract_inverted_index.introduce | 87 |
| abstract_inverted_index.iterative | 80 |
| abstract_inverted_index.optimized | 130 |
| abstract_inverted_index.replicate | 111 |
| abstract_inverted_index.sampling. | 81, 170 |
| abstract_inverted_index.training. | 106 |
| abstract_inverted_index.Tokenizers | 0 |
| abstract_inverted_index.benefiting | 100 |
| abstract_inverted_index.components | 103 |
| abstract_inverted_index.extracting | 10 |
| abstract_inverted_index.generation | 163 |
| abstract_inverted_index.generative | 7, 189 |
| abstract_inverted_index.perceptual | 37 |
| abstract_inverted_index.principled | 49 |
| abstract_inverted_index.stability, | 99 |
| abstract_inverted_index.throughput | 160 |
| abstract_inverted_index.tokenizers | 26 |
| abstract_inverted_index.$1.4\times$ | 158 |
| abstract_inverted_index.$3.8\times$ | 168 |
| abstract_inverted_index.adversarial | 39, 69, 136 |
| abstract_inverted_index.alternative | 50 |
| abstract_inverted_index.conditioned | 57 |
| abstract_inverted_index.particular, | 148 |
| abstract_inverted_index.performance | 64, 113 |
| abstract_inverted_index.redundancy. | 23 |
| abstract_inverted_index.replacement | 180 |
| abstract_inverted_index.single-step | 121, 132 |
| abstract_inverted_index.transformer | 102 |
| abstract_inverted_index.variational | 31 |
| abstract_inverted_index.architecture | 93 |
| abstract_inverted_index.autoencoders | 32 |
| abstract_inverted_index.distillation | 109 |
| abstract_inverted_index.distribution | 54 |
| abstract_inverted_index.limitations, | 85 |
| abstract_inverted_index.KL-regularized | 30 |
| abstract_inverted_index.higher-quality | 186 |
| abstract_inverted_index.reconstruction | 133, 140, 151 |
| abstract_inverted_index.reconstruction, | 36 |
| abstract_inverted_index.state-of-the-art | 6 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 3 |
| citation_normalized_percentile |