Learning Video Context as Interleaved Multimodal Sequences Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2407.21757
Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at https://github.com/showlab/MovieSeq.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2407.21757
- https://arxiv.org/pdf/2407.21757
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4401306823
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4401306823Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2407.21757Digital Object Identifier
- Title
-
Learning Video Context as Interleaved Multimodal SequencesWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-07-31Full publication date if available
- Authors
-
Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng ShouList of authors in order
- Landing page
-
https://arxiv.org/abs/2407.21757Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2407.21757Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2407.21757Direct OA link when available
- Concepts
-
Context (archaeology), Computer science, Artificial intelligence, Geography, ArchaeologyTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4401306823 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2407.21757 |
| ids.doi | https://doi.org/10.48550/arxiv.2407.21757 |
| ids.openalex | https://openalex.org/W4401306823 |
| fwci | |
| type | preprint |
| title | Learning Video Context as Interleaved Multimodal Sequences |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T14081 |
| topics[0].field.id | https://openalex.org/fields/12 |
| topics[0].field.display_name | Arts and Humanities |
| topics[0].score | 0.423799991607666 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1203 |
| topics[0].subfield.display_name | Language and Linguistics |
| topics[0].display_name | Linguistic Education and Pedagogy |
| topics[1].id | https://openalex.org/T13071 |
| topics[1].field.id | https://openalex.org/fields/36 |
| topics[1].field.display_name | Health Professions |
| topics[1].score | 0.38609999418258667 |
| topics[1].domain.id | https://openalex.org/domains/4 |
| topics[1].domain.display_name | Health Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/3616 |
| topics[1].subfield.display_name | Speech and Hearing |
| topics[1].display_name | Digital Storytelling and Education |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C2779343474 |
| concepts[0].level | 2 |
| concepts[0].score | 0.6242024898529053 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q3109175 |
| concepts[0].display_name | Context (archaeology) |
| concepts[1].id | https://openalex.org/C41008148 |
| concepts[1].level | 0 |
| concepts[1].score | 0.5858369469642639 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[1].display_name | Computer science |
| concepts[2].id | https://openalex.org/C154945302 |
| concepts[2].level | 1 |
| concepts[2].score | 0.3401775360107422 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[2].display_name | Artificial intelligence |
| concepts[3].id | https://openalex.org/C205649164 |
| concepts[3].level | 0 |
| concepts[3].score | 0.09392091631889343 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q1071 |
| concepts[3].display_name | Geography |
| concepts[4].id | https://openalex.org/C166957645 |
| concepts[4].level | 1 |
| concepts[4].score | 0.0 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q23498 |
| concepts[4].display_name | Archaeology |
| keywords[0].id | https://openalex.org/keywords/context |
| keywords[0].score | 0.6242024898529053 |
| keywords[0].display_name | Context (archaeology) |
| keywords[1].id | https://openalex.org/keywords/computer-science |
| keywords[1].score | 0.5858369469642639 |
| keywords[1].display_name | Computer science |
| keywords[2].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[2].score | 0.3401775360107422 |
| keywords[2].display_name | Artificial intelligence |
| keywords[3].id | https://openalex.org/keywords/geography |
| keywords[3].score | 0.09392091631889343 |
| keywords[3].display_name | Geography |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2407.21757 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2407.21757 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2407.21757 |
| locations[1].id | doi:10.48550/arxiv.2407.21757 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2407.21757 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5083481253 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-2568-2346 |
| authorships[0].author.display_name | Kevin Qinghong Lin |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Lin, Kevin Qinghong |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5059735251 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-1155-9507 |
| authorships[1].author.display_name | Pengchuan Zhang |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | Zhang, Pengchuan |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5001133932 |
| authorships[2].author.orcid | https://orcid.org/0000-0001-8494-3492 |
| authorships[2].author.display_name | Difei Gao |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Gao, Difei |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5091179204 |
| authorships[3].author.orcid | |
| authorships[3].author.display_name | Xide Xia |
| authorships[3].author_position | middle |
| authorships[3].raw_author_name | Xia, Xide |
| authorships[3].is_corresponding | False |
| authorships[4].author.id | https://openalex.org/A5000312791 |
| authorships[4].author.orcid | https://orcid.org/0000-0003-2021-9475 |
| authorships[4].author.display_name | Joya Chen |
| authorships[4].author_position | middle |
| authorships[4].raw_author_name | Chen, Joya |
| authorships[4].is_corresponding | False |
| authorships[5].author.id | https://openalex.org/A5107808441 |
| authorships[5].author.orcid | https://orcid.org/0009-0005-6815-6209 |
| authorships[5].author.display_name | Ziteng Gao |
| authorships[5].author_position | middle |
| authorships[5].raw_author_name | Gao, Ziteng |
| authorships[5].is_corresponding | False |
| authorships[6].author.id | https://openalex.org/A5044361651 |
| authorships[6].author.orcid | https://orcid.org/0000-0001-5678-4500 |
| authorships[6].author.display_name | Jinheng Xie |
| authorships[6].author_position | middle |
| authorships[6].raw_author_name | Xie, Jinheng |
| authorships[6].is_corresponding | False |
| authorships[7].author.id | https://openalex.org/A5112053801 |
| authorships[7].author.orcid | |
| authorships[7].author.display_name | Xuhong Xiao |
| authorships[7].author_position | middle |
| authorships[7].raw_author_name | Xiao, Xuhong |
| authorships[7].is_corresponding | False |
| authorships[8].author.id | https://openalex.org/A5068937750 |
| authorships[8].author.orcid | https://orcid.org/0000-0002-7681-2166 |
| authorships[8].author.display_name | Mike Zheng Shou |
| authorships[8].author_position | last |
| authorships[8].raw_author_name | Shou, Mike Zheng |
| authorships[8].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2407.21757 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Learning Video Context as Interleaved Multimodal Sequences |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T14081 |
| primary_topic.field.id | https://openalex.org/fields/12 |
| primary_topic.field.display_name | Arts and Humanities |
| primary_topic.score | 0.423799991607666 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1203 |
| primary_topic.subfield.display_name | Language and Linguistics |
| primary_topic.display_name | Linguistic Education and Pedagogy |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2748952813, https://openalex.org/W2390279801, https://openalex.org/W2358668433, https://openalex.org/W4396701345, https://openalex.org/W2376932109, https://openalex.org/W2001405890, https://openalex.org/W4396696052, https://openalex.org/W2382290278, https://openalex.org/W4395014643 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2407.21757 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2407.21757 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2407.21757 |
| primary_location.id | pmh:oai:arXiv.org:2407.21757 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2407.21757 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2407.21757 |
| publication_date | 2024-07-31 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 33 |
| abstract_inverted_index.In | 27 |
| abstract_inverted_index.To | 129 |
| abstract_inverted_index.as | 3, 56, 77, 105 |
| abstract_inverted_index.at | 165 |
| abstract_inverted_index.be | 163 |
| abstract_inverted_index.by | 67 |
| abstract_inverted_index.in | 8, 45 |
| abstract_inverted_index.is | 52 |
| abstract_inverted_index.of | 43, 100 |
| abstract_inverted_index.on | 103, 137 |
| abstract_inverted_index.or | 72 |
| abstract_inverted_index.to | 12, 38, 53, 89, 120 |
| abstract_inverted_index.we | 30, 107, 133 |
| abstract_inverted_index.For | 97 |
| abstract_inverted_index.Our | 49 |
| abstract_inverted_index.The | 160 |
| abstract_inverted_index.and | 19, 25, 64, 115, 124, 157 |
| abstract_inverted_index.due | 11 |
| abstract_inverted_index.for | 79 |
| abstract_inverted_index.its | 131 |
| abstract_inverted_index.six | 138 |
| abstract_inverted_index.the | 40, 86, 118 |
| abstract_inverted_index.CMD, | 143 |
| abstract_inverted_index.MAD, | 141 |
| abstract_inverted_index.TVC, | 144 |
| abstract_inverted_index.code | 161 |
| abstract_inverted_index.core | 50 |
| abstract_inverted_index.five | 147 |
| abstract_inverted_index.idea | 51 |
| abstract_inverted_index.more | 126 |
| abstract_inverted_index.pose | 5 |
| abstract_inverted_index.rich | 14 |
| abstract_inverted_index.such | 2 |
| abstract_inverted_index.this | 28, 83 |
| abstract_inverted_index.who, | 23 |
| abstract_inverted_index.wide | 41 |
| abstract_inverted_index.will | 162 |
| abstract_inverted_index.with | 91 |
| abstract_inverted_index.(LVU, | 140 |
| abstract_inverted_index.(such | 76 |
| abstract_inverted_index.audio | 151 |
| abstract_inverted_index.model | 36, 88, 119 |
| abstract_inverted_index.names | 114 |
| abstract_inverted_index.range | 42 |
| abstract_inverted_index.their | 13, 113 |
| abstract_inverted_index.these | 122 |
| abstract_inverted_index.using | 73, 93 |
| abstract_inverted_index.video | 9, 47, 104, 155, 158 |
| abstract_inverted_index.(video | 149 |
| abstract_inverted_index.across | 146 |
| abstract_inverted_index.either | 66 |
| abstract_inverted_index.input, | 106 |
| abstract_inverted_index.models | 75 |
| abstract_inverted_index.paper, | 29 |
| abstract_inverted_index.photos | 111 |
| abstract_inverted_index.plots, | 62 |
| abstract_inverted_index.public | 164 |
| abstract_inverted_index.solely | 101 |
| abstract_inverted_index.videos | 55, 92 |
| abstract_inverted_index.Through | 81 |
| abstract_inverted_index.address | 39 |
| abstract_inverted_index.demands | 21 |
| abstract_inverted_index.diverse | 20 |
| abstract_inverted_index.images, | 61 |
| abstract_inverted_index.instead | 99 |
| abstract_inverted_index.jointly | 108 |
| abstract_inverted_index.linking | 68 |
| abstract_inverted_index.movies, | 4 |
| abstract_inverted_index.offline | 74 |
| abstract_inverted_index.provide | 109 |
| abstract_inverted_index.relying | 102 |
| abstract_inverted_index.videos, | 1, 63 |
| abstract_inverted_index.whisper | 78 |
| abstract_inverted_index.MovieQA) | 145 |
| abstract_inverted_index.allowing | 117 |
| abstract_inverted_index.approach | 84 |
| abstract_inverted_index.contexts | 15 |
| abstract_inverted_index.datasets | 139 |
| abstract_inverted_index.elements | 123 |
| abstract_inverted_index.empowers | 85 |
| abstract_inverted_index.example, | 98 |
| abstract_inverted_index.external | 69 |
| abstract_inverted_index.generate | 125 |
| abstract_inverted_index.interact | 90 |
| abstract_inverted_index.language | 35, 87 |
| abstract_inverted_index.reason). | 26 |
| abstract_inverted_index.settings | 148 |
| abstract_inverted_index.validate | 134 |
| abstract_inverted_index.(identify | 22 |
| abstract_inverted_index.MovieSeq, | 32 |
| abstract_inverted_index.Movienet, | 142 |
| abstract_inverted_index.Narrative | 0 |
| abstract_inverted_index.alongside | 112 |
| abstract_inverted_index.associate | 121 |
| abstract_inverted_index.character | 110 |
| abstract_inverted_index.contexts. | 48 |
| abstract_inverted_index.databases | 71 |
| abstract_inverted_index.developed | 37 |
| abstract_inverted_index.introduce | 31 |
| abstract_inverted_index.knowledge | 70 |
| abstract_inverted_index.represent | 54 |
| abstract_inverted_index.sequences | 59 |
| abstract_inverted_index.(including | 60 |
| abstract_inverted_index.MovieSeq's | 135 |
| abstract_inverted_index.challenges | 7, 44 |
| abstract_inverted_index.dialogues, | 17, 116 |
| abstract_inverted_index.multimodal | 34, 58, 95 |
| abstract_inverted_index.responses. | 128 |
| abstract_inverted_index.retrieval, | 154 |
| abstract_inverted_index.video-text | 153 |
| abstract_inverted_index.captioning, | 156 |
| abstract_inverted_index.demonstrate | 130 |
| abstract_inverted_index.interleaved | 57, 94 |
| abstract_inverted_index.performance | 136 |
| abstract_inverted_index.significant | 6 |
| abstract_inverted_index.storylines) | 18 |
| abstract_inverted_index.subtitles), | 65 |
| abstract_inverted_index.subtitles). | 80 |
| abstract_inverted_index.(characters, | 16 |
| abstract_inverted_index.description, | 152 |
| abstract_inverted_index.comprehensive | 127 |
| abstract_inverted_index.instructions. | 96 |
| abstract_inverted_index.relationship, | 24 |
| abstract_inverted_index.understanding | 10, 46 |
| abstract_inverted_index.effectiveness, | 132 |
| abstract_inverted_index.classification, | 150 |
| abstract_inverted_index.instruction-tuning, | 82 |
| abstract_inverted_index.question-answering). | 159 |
| abstract_inverted_index.https://github.com/showlab/MovieSeq. | 166 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 9 |
| citation_normalized_percentile |