Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2505.17011
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2505.17011
- https://arxiv.org/pdf/2505.17011
- OA Status
- green
- OpenAlex ID
- https://openalex.org/W4415329721
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4415329721Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2505.17011Digital Object Identifier
- Title
-
Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent SpaceWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-05-22Full publication date if available
- Authors
-
Mengqi Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue YangList of authors in order
- Landing page
-
https://arxiv.org/abs/2505.17011Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2505.17011Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2505.17011Direct OA link when available
- Cited by
-
0Total citation count in OpenAlex
Full payload
| id | https://openalex.org/W4415329721 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2505.17011 |
| ids.doi | https://doi.org/10.48550/arxiv.2505.17011 |
| ids.openalex | https://openalex.org/W4415329721 |
| fwci | 0.0 |
| type | preprint |
| title | Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10775 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9984999895095825 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Generative Adversarial Networks and Image Synthesis |
| topics[1].id | https://openalex.org/T11439 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9779999852180481 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Video Analysis and Summarization |
| topics[2].id | https://openalex.org/T11105 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9731000065803528 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Advanced Image Processing Techniques |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2505.17011 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2505.17011 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2505.17011 |
| locations[1].id | doi:10.48550/arxiv.2505.17011 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2505.17011 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100662610 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-9952-3690 |
| authorships[0].author.display_name | Mengqi Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Yan |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5087792832 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-3285-4671 |
| authorships[1].author.display_name | Changyao Tian |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Tian, Changyao |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5113021225 |
| authorships[2].author.orcid | |
| authorships[2].author.display_name | Renqiu Xia |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Xia, Renqiu |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5111696111 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-3764-2555 |
| authorships[3].author.display_name | Ning Liao |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Liao, Ning |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5027639981 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-5037-0972 |
| authorships[4].author.display_name | Weiwei Guo |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Guo, Weiwei |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5087158377 |
| authorships[5].author.orcid | https://orcid.org/0000-0001-9639-7679 |
| authorships[5].author.display_name | Junchi Yan |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Yan, Junchi |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5100732450 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-2664-7975 |
| authorships[6].author.display_name | Hongsheng Li |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Li, Hongsheng |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5026944066 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-6785-0785 |
| authorships[7].author.display_name | Jifeng Dai |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Dai, Jifeng |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5100348486 |
| authorships[8].author.orcid | https://orcid.org/0000-0001-6861-9430 |
| authorships[8].author.display_name | Hao Li |
| authorships[8].author_position | middle |
| authorships[8].raw_author_name | Li, Hao |
| authorships[8].is_corresponding | False |
| authorships[9].author.id | https://openalex.org/A5101548928 |
| authorships[9].author.orcid | https://orcid.org/0000-0001-6053-8855 |
| authorships[9].author.display_name | Xue Yang |
| authorships[9].author_position | last |
| authorships[9].raw_author_name | Yang, Xue |
| authorships[9].is_corresponding | False |
| has_content.pdf | True |
| has_content.grobid_xml | True |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2505.17011 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-19T00:00:00 |
| display_name | Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10775 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9984999895095825 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Generative Adversarial Networks and Image Synthesis |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2505.17011 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2505.17011 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2505.17011 |
| primary_location.id | pmh:oai:arXiv.org:2505.17011 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2505.17011 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2505.17011 |
| publication_date | 2025-05-22 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 25, 40, 91 |
| abstract_inverted_index.We | 0 |
| abstract_inverted_index.an | 3, 59 |
| abstract_inverted_index.is | 22, 69 |
| abstract_inverted_index.of | 34, 49, 55, 109 |
| abstract_inverted_index.on | 18, 65, 102 |
| abstract_inverted_index.to | 44, 72 |
| abstract_inverted_index.and | 39, 85, 100, 104, 121, 132 |
| abstract_inverted_index.can | 10 |
| abstract_inverted_index.for | 14, 82, 97, 129 |
| abstract_inverted_index.our | 110 |
| abstract_inverted_index.the | 46, 107 |
| abstract_inverted_index.Such | 79 |
| abstract_inverted_index.each | 35 |
| abstract_inverted_index.more | 130 |
| abstract_inverted_index.tail | 32 |
| abstract_inverted_index.that | 9, 29 |
| abstract_inverted_index.with | 24 |
| abstract_inverted_index.based | 17, 64 |
| abstract_inverted_index.block | 36, 41 |
| abstract_inverted_index.data, | 115 |
| abstract_inverted_index.drops | 31 |
| abstract_inverted_index.given | 76 |
| abstract_inverted_index.image | 114 |
| abstract_inverted_index.token | 61, 74, 88, 126 |
| abstract_inverted_index.under | 90, 124 |
| abstract_inverted_index.usage | 75 |
| abstract_inverted_index.using | 52 |
| abstract_inverted_index.video | 7, 19, 50, 98, 135 |
| abstract_inverted_index.During | 57 |
| abstract_inverted_index.adjust | 73 |
| abstract_inverted_index.allows | 81 |
| abstract_inverted_index.causal | 6, 42 |
| abstract_inverted_index.design | 80 |
| abstract_inverted_index.during | 37 |
| abstract_inverted_index.frames | 16, 51 |
| abstract_inverted_index.linear | 67 |
| abstract_inverted_index.scorer | 43 |
| abstract_inverted_index.tokens | 13, 33 |
| abstract_inverted_index.AdapTok | 21, 116 |
| abstract_inverted_index.UCF-101 | 103 |
| abstract_inverted_index.Without | 112 |
| abstract_inverted_index.budget. | 94 |
| abstract_inverted_index.dynamic | 87 |
| abstract_inverted_index.further | 70 |
| abstract_inverted_index.integer | 66 |
| abstract_inverted_index.masking | 27 |
| abstract_inverted_index.numbers | 54 |
| abstract_inverted_index.overall | 93 |
| abstract_inverted_index.predict | 45 |
| abstract_inverted_index.propose | 1 |
| abstract_inverted_index.quality | 48, 120 |
| abstract_inverted_index.scores. | 78 |
| abstract_inverted_index.tokens. | 56 |
| abstract_inverted_index.AdapTok, | 2 |
| abstract_inverted_index.adaptive | 4, 60 |
| abstract_inverted_index.allocate | 12 |
| abstract_inverted_index.allowing | 128 |
| abstract_inverted_index.budgets, | 127 |
| abstract_inverted_index.content. | 20 |
| abstract_inverted_index.equipped | 23 |
| abstract_inverted_index.flexibly | 11 |
| abstract_inverted_index.improves | 118 |
| abstract_inverted_index.proposed | 71 |
| abstract_inverted_index.randomly | 30 |
| abstract_inverted_index.scalable | 131 |
| abstract_inverted_index.strategy | 28, 63 |
| abstract_inverted_index.temporal | 5 |
| abstract_inverted_index.Extensive | 95 |
| abstract_inverted_index.approach. | 111 |
| abstract_inverted_index.different | 15, 53, 125 |
| abstract_inverted_index.modeling. | 136 |
| abstract_inverted_index.predicted | 77 |
| abstract_inverted_index.tokenizer | 8 |
| abstract_inverted_index.training, | 38 |
| abstract_inverted_index.additional | 113 |
| abstract_inverted_index.allocation | 62, 89 |
| abstract_inverted_index.block-wise | 26 |
| abstract_inverted_index.generation | 101, 122 |
| abstract_inverted_index.generative | 134 |
| abstract_inverted_index.inference, | 58 |
| abstract_inverted_index.temporally | 86 |
| abstract_inverted_index.demonstrate | 106 |
| abstract_inverted_index.experiments | 96 |
| abstract_inverted_index.performance | 123 |
| abstract_inverted_index.programming | 68 |
| abstract_inverted_index.Kinetics-600 | 105 |
| abstract_inverted_index.consistently | 117 |
| abstract_inverted_index.controllable | 92 |
| abstract_inverted_index.sample-wise, | 83 |
| abstract_inverted_index.effectiveness | 108 |
| abstract_inverted_index.content-aware, | 84 |
| abstract_inverted_index.reconstruction | 47, 99, 119 |
| abstract_inverted_index.token-efficient | 133 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 10 |
| citation_normalized_percentile |