Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding Article Swipe
YOU?
·
· 2025
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2501.16786
Applying Multimodal Large Language Models (MLLMs) to video understanding presents significant challenges due to the need to model temporal relations across frames. Existing approaches adopt either implicit temporal modeling, relying solely on the LLM decoder, or explicit temporal modeling, employing auxiliary temporal encoders. To investigate this debate between the two paradigms, we propose the Stackable Temporal Encoder (STE). STE enables flexible explicit temporal modeling with adjustable temporal receptive fields and token compression ratios. Using STE, we systematically compare implicit and explicit temporal modeling across dimensions such as overall performance, token compression effectiveness, and temporal-specific understanding. We also explore STE's design considerations and broader impacts as a plug-in module and in image modalities. Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2501.16786
- https://arxiv.org/pdf/2501.16786
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4406959517
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4406959517Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2501.16786Digital Object Identifier
- Title
-
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video UnderstandingWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2025Year of publication
- Publication date
-
2025-01-28Full publication date if available
- Authors
-
Yun Li, Zhe Liu, Yan‐Chen Kong, Guangrui Li, Jiyuan Zhang, Chao Bian, Feng Liu, Lina Yao, Zhenbang SunList of authors in order
- Landing page
-
https://arxiv.org/abs/2501.16786Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2501.16786Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2501.16786Direct OA link when available
- Concepts
-
Computer science, Natural language processing, Linguistics, PhilosophyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4406959517 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2501.16786 |
| ids.doi | https://doi.org/10.48550/arxiv.2501.16786 |
| ids.openalex | https://openalex.org/W4406959517 |
| fwci | |
| type | preprint |
| title | Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T10028 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9616000056266785 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1702 |
| topics[0].subfield.display_name | Artificial Intelligence |
| topics[0].display_name | Topic Modeling |
| topics[1].id | https://openalex.org/T10181 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9320999979972839 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1702 |
| topics[1].subfield.display_name | Artificial Intelligence |
| topics[1].display_name | Natural Language Processing Techniques |
| topics[2].id | https://openalex.org/T11714 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9290000200271606 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Multimodal Machine Learning Applications |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C41008148 |
| concepts[0].level | 0 |
| concepts[0].score | 0.6244266033172607 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[0].display_name | Computer science |
| concepts[1].id | https://openalex.org/C204321447 |
| concepts[1].level | 1 |
| concepts[1].score | 0.3390963077545166 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q30642 |
| concepts[1].display_name | Natural language processing |
| concepts[2].id | https://openalex.org/C41895202 |
| concepts[2].level | 1 |
| concepts[2].score | 0.3250155448913574 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q8162 |
| concepts[2].display_name | Linguistics |
| concepts[3].id | https://openalex.org/C138885662 |
| concepts[3].level | 0 |
| concepts[3].score | 0.05929562449455261 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q5891 |
| concepts[3].display_name | Philosophy |
| keywords[0].id | https://openalex.org/keywords/computer-science |
| keywords[0].score | 0.6244266033172607 |
| keywords[0].display_name | Computer science |
| keywords[1].id | https://openalex.org/keywords/natural-language-processing |
| keywords[1].score | 0.3390963077545166 |
| keywords[1].display_name | Natural language processing |
| keywords[2].id | https://openalex.org/keywords/linguistics |
| keywords[2].score | 0.3250155448913574 |
| keywords[2].display_name | Linguistics |
| keywords[3].id | https://openalex.org/keywords/philosophy |
| keywords[3].score | 0.05929562449455261 |
| keywords[3].display_name | Philosophy |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2501.16786 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2501.16786 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2501.16786 |
| locations[1].id | doi:10.48550/arxiv.2501.16786 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2501.16786 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5100369214 |
| authorships[0].author.orcid | https://orcid.org/0000-0002-5784-1877 |
| authorships[0].author.display_name | Yun Li |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Li, Yun |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5100600813 |
| authorships[1].author.orcid | https://orcid.org/0000-0002-2251-469X |
| authorships[1].author.display_name | Zhe Liu |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Liu, Zhe |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5111133285 |
| authorships[2].author.orcid | https://orcid.org/0009-0004-2006-9870 |
| authorships[2].author.display_name | Yan‐Chen Kong |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Kong, Yajing |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5103035781 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-6535-1464 |
| authorships[3].author.display_name | Guangrui Li |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Li, Guangrui |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5100691286 |
| authorships[4].author.orcid | https://orcid.org/0000-0001-9281-9264 |
| authorships[4].author.display_name | Jiyuan Zhang |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Zhang, Jiyuan |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5048613467 |
| authorships[5].author.orcid | https://orcid.org/0000-0002-8208-8792 |
| authorships[5].author.display_name | Chao Bian |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Bian, Chao |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5100639398 |
| authorships[6].author.orcid | https://orcid.org/0000-0002-5005-9129 |
| authorships[6].author.display_name | Feng Liu |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Liu, Feng |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5037147868 |
| authorships[7].author.orcid | https://orcid.org/0000-0003-2235-7556 |
| authorships[7].author.display_name | Lina Yao |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Yao, Lina |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5113347986 |
| authorships[8].author.orcid | |
| authorships[8].author.display_name | Zhenbang Sun |
| authorships[8].author_position | last |
| authorships[8].raw_author_name | Sun, Zhenbang |
| authorships[8].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2501.16786 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T10028 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9616000056266785 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1702 |
| primary_topic.subfield.display_name | Artificial Intelligence |
| primary_topic.display_name | Topic Modeling |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W4391913857, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2501.16786 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2501.16786 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2501.16786 |
| primary_location.id | pmh:oai:arXiv.org:2501.16786 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2501.16786 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2501.16786 |
| publication_date | 2025-01-28 |
| publication_year | 2025 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 105 |
| abstract_inverted_index.To | 43 |
| abstract_inverted_index.We | 95 |
| abstract_inverted_index.as | 86, 104 |
| abstract_inverted_index.in | 109 |
| abstract_inverted_index.of | 118 |
| abstract_inverted_index.on | 31 |
| abstract_inverted_index.or | 35 |
| abstract_inverted_index.to | 6, 13, 16, 125 |
| abstract_inverted_index.we | 51, 75 |
| abstract_inverted_index.LLM | 33 |
| abstract_inverted_index.Our | 112 |
| abstract_inverted_index.STE | 58 |
| abstract_inverted_index.and | 69, 79, 92, 101, 108 |
| abstract_inverted_index.due | 12 |
| abstract_inverted_index.the | 14, 32, 48, 53, 115 |
| abstract_inverted_index.two | 49 |
| abstract_inverted_index.STE, | 74 |
| abstract_inverted_index.also | 96 |
| abstract_inverted_index.need | 15 |
| abstract_inverted_index.role | 117 |
| abstract_inverted_index.such | 85 |
| abstract_inverted_index.this | 45 |
| abstract_inverted_index.with | 64 |
| abstract_inverted_index.Large | 2 |
| abstract_inverted_index.STE's | 98 |
| abstract_inverted_index.Using | 73 |
| abstract_inverted_index.adopt | 24 |
| abstract_inverted_index.image | 110 |
| abstract_inverted_index.model | 17 |
| abstract_inverted_index.token | 70, 89 |
| abstract_inverted_index.video | 7, 127 |
| abstract_inverted_index.(STE). | 57 |
| abstract_inverted_index.MLLMs. | 128 |
| abstract_inverted_index.Models | 4 |
| abstract_inverted_index.across | 20, 83 |
| abstract_inverted_index.debate | 46 |
| abstract_inverted_index.design | 99 |
| abstract_inverted_index.either | 25 |
| abstract_inverted_index.fields | 68 |
| abstract_inverted_index.module | 107 |
| abstract_inverted_index.solely | 30 |
| abstract_inverted_index.(MLLMs) | 5 |
| abstract_inverted_index.Encoder | 56 |
| abstract_inverted_index.advance | 126 |
| abstract_inverted_index.between | 47 |
| abstract_inverted_index.broader | 102 |
| abstract_inverted_index.compare | 77 |
| abstract_inverted_index.enables | 59 |
| abstract_inverted_index.explore | 97 |
| abstract_inverted_index.frames. | 21 |
| abstract_inverted_index.impacts | 103 |
| abstract_inverted_index.overall | 87 |
| abstract_inverted_index.plug-in | 106 |
| abstract_inverted_index.propose | 52 |
| abstract_inverted_index.ratios. | 72 |
| abstract_inverted_index.relying | 29 |
| abstract_inverted_index.Applying | 0 |
| abstract_inverted_index.Existing | 22 |
| abstract_inverted_index.Language | 3 |
| abstract_inverted_index.Temporal | 55 |
| abstract_inverted_index.critical | 116 |
| abstract_inverted_index.decoder, | 34 |
| abstract_inverted_index.explicit | 36, 61, 80, 119 |
| abstract_inverted_index.findings | 113 |
| abstract_inverted_index.flexible | 60 |
| abstract_inverted_index.implicit | 26, 78 |
| abstract_inverted_index.insights | 124 |
| abstract_inverted_index.modeling | 63, 82 |
| abstract_inverted_index.presents | 9 |
| abstract_inverted_index.temporal | 18, 27, 37, 41, 62, 66, 81, 120 |
| abstract_inverted_index.Stackable | 54 |
| abstract_inverted_index.auxiliary | 40 |
| abstract_inverted_index.emphasize | 114 |
| abstract_inverted_index.employing | 39 |
| abstract_inverted_index.encoders. | 42 |
| abstract_inverted_index.modeling, | 28, 38, 121 |
| abstract_inverted_index.providing | 122 |
| abstract_inverted_index.receptive | 67 |
| abstract_inverted_index.relations | 19 |
| abstract_inverted_index.Multimodal | 1 |
| abstract_inverted_index.actionable | 123 |
| abstract_inverted_index.adjustable | 65 |
| abstract_inverted_index.approaches | 23 |
| abstract_inverted_index.challenges | 11 |
| abstract_inverted_index.dimensions | 84 |
| abstract_inverted_index.paradigms, | 50 |
| abstract_inverted_index.compression | 71, 90 |
| abstract_inverted_index.investigate | 44 |
| abstract_inverted_index.modalities. | 111 |
| abstract_inverted_index.significant | 10 |
| abstract_inverted_index.performance, | 88 |
| abstract_inverted_index.understanding | 8 |
| abstract_inverted_index.considerations | 100 |
| abstract_inverted_index.effectiveness, | 91 |
| abstract_inverted_index.systematically | 76 |
| abstract_inverted_index.understanding. | 94 |
| abstract_inverted_index.temporal-specific | 93 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 9 |
| citation_normalized_percentile |