Temporal Reasoning Transfer from Text to Video Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2410.06166
Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2410.06166
- https://arxiv.org/pdf/2410.06166
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4403344663
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4403344663Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2410.06166Digital Object Identifier
- Title
-
Temporal Reasoning Transfer from Text to VideoWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-10-08Full publication date if available
- Authors
-
Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, Qi LiuList of authors in order
- Landing page
-
https://arxiv.org/abs/2410.06166Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2410.06166Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2410.06166Direct OA link when available
- Concepts
-
Computer science, Transfer (computing), Natural language processing, Parallel computingTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4403344663 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2410.06166 |
| ids.doi | https://doi.org/10.48550/arxiv.2410.06166 |
| ids.openalex | https://openalex.org/W4403344663 |
| fwci | |
| type | preprint |
| title | Temporal Reasoning Transfer from Text to Video |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10181 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9793999791145325 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Natural Language Processing Techniques |
| topics[1].id | https://openalex.org/T11439 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9445000290870667 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Video Analysis and Summarization |
| topics[2].id | https://openalex.org/T11714 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9326000213623047 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Multimodal Machine Learning Applications |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.5931463241577148 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C2776175482 |
| concepts[1].level | 2 |
| concepts[1].score | 0.43998560309410095 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q1195816 |
| concepts[1].display_name | Transfer (computing) |
| concepts[2].id | https://openalex.org/C204321447 |
| concepts[2].level | 1 |
| concepts[2].score | 0.3283807039260864 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[2].display_name | Natural language processing |
| concepts[3].id | https://openalex.org/C173608175 |
| concepts[3].level | 1 |
| concepts[3].score | 0.0 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q232661 |
| concepts[3].display_name | Parallel computing |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.5931463241577148 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/transfer |
| keywords[1].score | 0.43998560309410095 |
| keywords[1].display_name | Transfer (computing) |
| keywords[2].id | https://openalex.org/keywords/natural-language-processing |
| keywords[2].score | 0.3283807039260864 |
| keywords[2].display_name | Natural language processing |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2410.06166 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2410.06166 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2410.06166 |
| locations[1].id | doi:10.48550/arxiv.2410.06166 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | cc-by |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | https://openalex.org/licenses/cc-by |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2410.06166 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5112885011 |
| authorships[0].author.orcid | |
| authorships[0].author.display_name | Lei Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Lei |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5044541656 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-0673-4887 |
| authorships[1].author.display_name | Yuanxin Liu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Liu, Yuanxin |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5100599428 |
| authorships[2].author.orcid | https://orcid.org/0000-0002-9809-8864 |
| authorships[2].author.display_name | Linli Yao |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Yao, Linli |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5001609288 |
| authorships[3].author.orcid | https://orcid.org/0000-0003-4086-3436 |
| authorships[3].author.display_name | Peiyuan Zhang |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Zhang, Peiyuan |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5055648957 |
| authorships[4].author.orcid | |
| authorships[4].author.display_name | Chenxin An |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | An, Chenxin |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5034611488 |
| authorships[5].author.orcid | |
| authorships[5].author.display_name | Lean Wang |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Wang, Lean |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5101441137 |
| authorships[6].author.orcid | https://orcid.org/0000-0001-8241-9320 |
| authorships[6].author.display_name | Xu Sun |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Sun, Xu |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5014554970 |
| authorships[7].author.orcid | https://orcid.org/0000-0002-9033-2724 |
| authorships[7].author.display_name | Lingpeng Kong |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Kong, Lingpeng |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5115590504 |
| authorships[8].author.orcid | https://orcid.org/0000-0001-6956-5550 |
| authorships[8].author.display_name | Qi Liu |
| authorships[8].author_position | last |
| authorships[8].raw_author_name | Liu, Qi |
| authorships[8].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2410.06166 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Temporal Reasoning Transfer from Text to Video |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10181 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9793999791145325 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Natural Language Processing Techniques |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052, https://openalex.org/W4402327032, https://openalex.org/W2382290278 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2410.06166 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2410.06166 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2410.06166 |
| primary_location.id | pmh:oai:arXiv.org:2410.06166 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2410.06166 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2410.06166 |
| publication_date | 2024-10-08 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 139, 177, 199 |
| abstract_inverted_index.T3 | 103, 133 |
| abstract_inverted_index.as | 81, 192 |
| abstract_inverted_index.by | 83 |
| abstract_inverted_index.in | 10, 65, 109 |
| abstract_inverted_index.it | 175 |
| abstract_inverted_index.of | 36, 120, 185, 212 |
| abstract_inverted_index.on | 86, 92, 144, 157, 169, 180 |
| abstract_inverted_index.to | 31, 54, 153, 219 |
| abstract_inverted_index.we | 59, 95 |
| abstract_inverted_index.5.3 | 140 |
| abstract_inverted_index.For | 173 |
| abstract_inverted_index.and | 20, 194, 204 |
| abstract_inverted_index.any | 130 |
| abstract_inverted_index.for | 49 |
| abstract_inverted_index.key | 63 |
| abstract_inverted_index.our | 39, 151 |
| abstract_inverted_index.the | 32, 62, 73, 97, 118, 145, 162, 181, 210 |
| abstract_inverted_index.yet | 13 |
| abstract_inverted_index.49.7 | 178 |
| abstract_inverted_index.even | 50 |
| abstract_inverted_index.find | 60 |
| abstract_inverted_index.from | 72, 113, 217 |
| abstract_inverted_index.have | 6 |
| abstract_inverted_index.poor | 84 |
| abstract_inverted_index.pure | 110 |
| abstract_inverted_index.such | 191 |
| abstract_inverted_index.task | 184, 207 |
| abstract_inverted_index.text | 111, 218 |
| abstract_inverted_index.that | 43, 61 |
| abstract_inverted_index.they | 14 |
| abstract_inverted_index.this | 29, 93 |
| abstract_inverted_index.with | 16, 78, 123 |
| abstract_inverted_index.(T3). | 102 |
| abstract_inverted_index.LLM's | 75 |
| abstract_inverted_index.LLMs' | 67 |
| abstract_inverted_index.LLMs) | 5 |
| abstract_inverted_index.Large | 1 |
| abstract_inverted_index.Video | 0, 66 |
| abstract_inverted_index.While | 25 |
| abstract_inverted_index.about | 22 |
| abstract_inverted_index.data, | 132 |
| abstract_inverted_index.model | 152, 165 |
| abstract_inverted_index.shown | 7 |
| abstract_inverted_index.small | 51 |
| abstract_inverted_index.stems | 71 |
| abstract_inverted_index.study | 41 |
| abstract_inverted_index.tasks | 108 |
| abstract_inverted_index.using | 129 |
| abstract_inverted_index.video | 11, 44, 121, 131, 159, 171, 205, 220 |
| abstract_inverted_index.which | 149 |
| abstract_inverted_index.(Video | 4 |
| abstract_inverted_index.28,000 | 158 |
| abstract_inverted_index.Models | 3 |
| abstract_inverted_index.format | 112 |
| abstract_inverted_index.models | 190 |
| abstract_inverted_index.strong | 200 |
| abstract_inverted_index.tasks. | 90 |
| abstract_inverted_index.visual | 37 |
| abstract_inverted_index.Further | 196 |
| abstract_inverted_index.Textual | 98 |
| abstract_inverted_index.achieve | 55 |
| abstract_inverted_index.between | 202 |
| abstract_inverted_index.changes | 19 |
| abstract_inverted_index.complex | 124 |
| abstract_inverted_index.contain | 46 |
| abstract_inverted_index.diverse | 105 |
| abstract_inverted_index.enables | 150 |
| abstract_inverted_index.inputs, | 38 |
| abstract_inverted_index.perfect | 56 |
| abstract_inverted_index.probing | 52 |
| abstract_inverted_index.reveals | 42, 198 |
| abstract_inverted_index.samples | 122 |
| abstract_inverted_index.textual | 87, 203 |
| abstract_inverted_index.trained | 156 |
| abstract_inverted_index.without | 128 |
| abstract_inverted_index.Building | 91 |
| abstract_inverted_index.Language | 2 |
| abstract_inverted_index.Temporal | 99, 182 |
| abstract_inverted_index.Transfer | 101 |
| abstract_inverted_index.absolute | 141 |
| abstract_inverted_index.accuracy | 142, 179 |
| abstract_inverted_index.achieves | 166, 176 |
| abstract_inverted_index.analysis | 197 |
| abstract_inverted_index.domains. | 221 |
| abstract_inverted_index.efficacy | 211 |
| abstract_inverted_index.encoding | 35 |
| abstract_inverted_index.enhanced | 163 |
| abstract_inverted_index.enhances | 134 |
| abstract_inverted_index.example, | 174 |
| abstract_inverted_index.existing | 114 |
| abstract_inverted_index.inherent | 76 |
| abstract_inverted_index.powerful | 188 |
| abstract_inverted_index.previous | 26 |
| abstract_inverted_index.research | 27 |
| abstract_inverted_index.samples. | 160 |
| abstract_inverted_index.scarcity | 119 |
| abstract_inverted_index.struggle | 15 |
| abstract_inverted_index.temporal | 18, 23, 34, 68, 79, 88, 106, 125, 136, 206, 214 |
| abstract_inverted_index.tracking | 17 |
| abstract_inverted_index.yielding | 138 |
| abstract_inverted_index.LongVA-7B | 164 |
| abstract_inverted_index.Reasoning | 183 |
| abstract_inverted_index.abilities | 216 |
| abstract_inverted_index.accuracy. | 57 |
| abstract_inverted_index.concepts, | 80 |
| abstract_inverted_index.datasets, | 116 |
| abstract_inverted_index.evidenced | 82 |
| abstract_inverted_index.introduce | 96 |
| abstract_inverted_index.promising | 8 |
| abstract_inverted_index.reasoning | 21, 69, 100, 107, 215 |
| abstract_inverted_index.Video-MME, | 186 |
| abstract_inverted_index.addressing | 117 |
| abstract_inverted_index.attributed | 28 |
| abstract_inverted_index.benchmark, | 148 |
| abstract_inverted_index.bottleneck | 64 |
| abstract_inverted_index.capability | 70 |
| abstract_inverted_index.diagnostic | 40 |
| abstract_inverted_index.difficulty | 77 |
| abstract_inverted_index.discovery, | 94 |
| abstract_inverted_index.image-text | 115 |
| abstract_inverted_index.limitation | 30 |
| abstract_inverted_index.outperform | 154 |
| abstract_inverted_index.scenarios. | 126 |
| abstract_inverted_index.sufficient | 47 |
| abstract_inverted_index.surpassing | 187 |
| abstract_inverted_index.underlying | 74 |
| abstract_inverted_index.validating | 209 |
| abstract_inverted_index.LongVA-7B's | 135 |
| abstract_inverted_index.Remarkably, | 127 |
| abstract_inverted_index.TempCompass | 147 |
| abstract_inverted_index.benchmarks. | 172 |
| abstract_inverted_index.challenging | 146 |
| abstract_inverted_index.classifiers | 53 |
| abstract_inverted_index.competitive | 167 |
| abstract_inverted_index.correlation | 201 |
| abstract_inverted_index.improvement | 143 |
| abstract_inverted_index.ineffective | 33 |
| abstract_inverted_index.information | 48 |
| abstract_inverted_index.large-scale | 189 |
| abstract_inverted_index.performance | 85, 168 |
| abstract_inverted_index.synthesizes | 104 |
| abstract_inverted_index.VILA1.5-40B. | 195 |
| abstract_inverted_index.capabilities | 9 |
| abstract_inverted_index.performance, | 208 |
| abstract_inverted_index.transferring | 213 |
| abstract_inverted_index.Additionally, | 161 |
| abstract_inverted_index.Surprisingly, | 58 |
| abstract_inverted_index.comprehensive | 170 |
| abstract_inverted_index.comprehension, | 12 |
| abstract_inverted_index.relationships. | 24 |
| abstract_inverted_index.understanding, | 137 |
| abstract_inverted_index.representations | 45 |
| abstract_inverted_index.ShareGPT4Video-8B | 155 |
| abstract_inverted_index.question-answering | 89 |
| abstract_inverted_index.InternVL-Chat-V1.5-20B | 193 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 9 |
| citation_normalized_percentile |