Progress-Aware Video Frame Captioning Article Swipe
YOU?
·
· 2024
· Open Access
·
· DOI: https://doi.org/10.48550/arxiv.2412.02071
While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.
Related Topics
- Type
- preprint
- Language
- en
- Landing Page
- http://arxiv.org/abs/2412.02071
- https://arxiv.org/pdf/2412.02071
- OA Status
- green
- Related Works
- 10
- OpenAlex ID
- https://openalex.org/W4405037057
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W4405037057Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.48550/arxiv.2412.02071Digital Object Identifier
- Title
-
Progress-Aware Video Frame CaptioningWork title
- Type
-
preprintOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2024Year of publication
- Publication date
-
2024-12-03Full publication date if available
- Authors
-
Zihui Xue, Joungbin An, Xitong Yang, Kristen GraumanList of authors in order
- Landing page
-
https://arxiv.org/abs/2412.02071Publisher landing page
- PDF URL
-
https://arxiv.org/pdf/2412.02071Direct link to full text PDF
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://arxiv.org/pdf/2412.02071Direct OA link when available
- Concepts
-
Closed captioning, Frame (networking), Computer science, Multimedia, Artificial intelligence, Telecommunications, Image (mathematics)Top concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
10Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W4405037057 |
|---|---|
| doi | https://doi.org/10.48550/arxiv.2412.02071 |
| ids.doi | https://doi.org/10.48550/arxiv.2412.02071 |
| ids.openalex | https://openalex.org/W4405037057 |
| fwci | |
| type | preprint |
| title | Progress-Aware Video Frame Captioning |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T11439 |
| topics[0].field.id | https://openalex.org/fields/17 |
| topics[0].field.display_name | Computer Science |
| topics[0].score | 0.9987999796867371 |
| topics[0].domain.id | https://openalex.org/domains/3 |
| topics[0].domain.display_name | Physical Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/1707 |
| topics[0].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[0].display_name | Video Analysis and Summarization |
| topics[1].id | https://openalex.org/T10531 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9983000159263611 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Advanced Vision and Imaging |
| topics[2].id | https://openalex.org/T10812 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9937000274658203 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Human Pose and Action Recognition |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C157657479 |
| concepts[0].level | 3 |
| concepts[0].score | 0.9478937983512878 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q2367247 |
| concepts[0].display_name | Closed captioning |
| concepts[1].id | https://openalex.org/C126042441 |
| concepts[1].level | 2 |
| concepts[1].score | 0.718812108039856 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q1324888 |
| concepts[1].display_name | Frame (networking) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.6333023905754089 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C49774154 |
| concepts[3].level | 1 |
| concepts[3].score | 0.3750544488430023 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q131765 |
| concepts[3].display_name | Multimedia |
| concepts[4].id | https://openalex.org/C154945302 |
| concepts[4].level | 1 |
| concepts[4].score | 0.25773757696151733 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q11660 |
| concepts[4].display_name | Artificial intelligence |
| concepts[5].id | https://openalex.org/C76155785 |
| concepts[5].level | 1 |
| concepts[5].score | 0.24631640315055847 |
| concepts[5].wikidata | https://www.wikidata.org/wiki/Q418 |
| concepts[5].display_name | Telecommunications |
| concepts[6].id | https://openalex.org/C115961682 |
| concepts[6].level | 2 |
| concepts[6].score | 0.10270115733146667 |
| concepts[6].wikidata | https://www.wikidata.org/wiki/Q860623 |
| concepts[6].display_name | Image (mathematics) |
| keywords[0].id | https://openalex.org/keywords/closed-captioning |
| keywords[0].score | 0.9478937983512878 |
| keywords[0].display_name | Closed captioning |
| keywords[1].id | https://openalex.org/keywords/frame |
| keywords[1].score | 0.718812108039856 |
| keywords[1].display_name | Frame (networking) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.6333023905754089 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/multimedia |
| keywords[3].score | 0.3750544488430023 |
| keywords[3].display_name | Multimedia |
| keywords[4].id | https://openalex.org/keywords/artificial-intelligence |
| keywords[4].score | 0.25773757696151733 |
| keywords[4].display_name | Artificial intelligence |
| keywords[5].id | https://openalex.org/keywords/telecommunications |
| keywords[5].score | 0.24631640315055847 |
| keywords[5].display_name | Telecommunications |
| keywords[6].id | https://openalex.org/keywords/image |
| keywords[6].score | 0.10270115733146667 |
| keywords[6].display_name | Image (mathematics) |
| language | en |
| locations[0].id | pmh:oai:arXiv.org:2412.02071 |
| locations[0].is_oa | True |
| locations[0].source.id | https://openalex.org/S4306400194 |
| locations[0].source.issn | |
| locations[0].source.type | repository |
| locations[0].source.is_oa | True |
| locations[0].source.issn_l | |
| locations[0].source.is_core | False |
| locations[0].source.is_in_doaj | False |
| locations[0].source.display_name | arXiv (Cornell University) |
| locations[0].source.host_organization | https://openalex.org/I205783295 |
| locations[0].source.host_organization_name | Cornell University |
| locations[0].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[0].license | |
| locations[0].pdf_url | https://arxiv.org/pdf/2412.02071 |
| locations[0].version | submittedVersion |
| locations[0].raw_type | text |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | False |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | http://arxiv.org/abs/2412.02071 |
| locations[1].id | doi:10.48550/arxiv.2412.02071 |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306400194 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | True |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | arXiv (Cornell University) |
| locations[1].source.host_organization | https://openalex.org/I205783295 |
| locations[1].source.host_organization_name | Cornell University |
| locations[1].source.host_organization_lineage | https://openalex.org/I205783295 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | article |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.48550/arxiv.2412.02071 |
| indexed_in | arxiv, datacite |
| authorships[0].author.id | https://openalex.org/A5059038346 |
| authorships[0].author.orcid | https://orcid.org/0000-0001-7394-5169 |
| authorships[0].author.display_name | Zihui Xue |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Xue, Zihui |
| authorships[0].is_corresponding | False |
| authorships[1].author.id | https://openalex.org/A5050585004 |
| authorships[1].author.orcid | https://orcid.org/0000-0003-0418-900X |
| authorships[1].author.display_name | Joungbin An |
| authorships[1].author_position | middle |
| authorships[1].raw_author_name | An, Joungbin |
| authorships[1].is_corresponding | False |
| authorships[2].author.id | https://openalex.org/A5091064356 |
| authorships[2].author.orcid | https://orcid.org/0000-0003-4372-241X |
| authorships[2].author.display_name | Xitong Yang |
| authorships[2].author_position | middle |
| authorships[2].raw_author_name | Yang, Xitong |
| authorships[2].is_corresponding | False |
| authorships[3].author.id | https://openalex.org/A5012765543 |
| authorships[3].author.orcid | https://orcid.org/0000-0002-9591-5873 |
| authorships[3].author.display_name | Kristen Grauman |
| authorships[3].author_position | last |
| authorships[3].raw_author_name | Grauman, Kristen |
| authorships[3].is_corresponding | False |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://arxiv.org/pdf/2412.02071 |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | Progress-Aware Video Frame Captioning |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T11439 |
| primary_topic.field.id | https://openalex.org/fields/17 |
| primary_topic.field.display_name | Computer Science |
| primary_topic.score | 0.9987999796867371 |
| primary_topic.domain.id | https://openalex.org/domains/3 |
| primary_topic.domain.display_name | Physical Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/1707 |
| primary_topic.subfield.display_name | Computer Vision and Pattern Recognition |
| primary_topic.display_name | Video Analysis and Summarization |
| related_works | https://openalex.org/W4391375266, https://openalex.org/W2899084033, https://openalex.org/W2748952813, https://openalex.org/W4210416330, https://openalex.org/W2775506363, https://openalex.org/W3088136942, https://openalex.org/W2963177403, https://openalex.org/W4290852288, https://openalex.org/W2949362007, https://openalex.org/W4283207562 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | pmh:oai:arXiv.org:2412.02071 |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306400194 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | True |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | arXiv (Cornell University) |
| best_oa_location.source.host_organization | https://openalex.org/I205783295 |
| best_oa_location.source.host_organization_name | Cornell University |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| best_oa_location.license | |
| best_oa_location.pdf_url | https://arxiv.org/pdf/2412.02071 |
| best_oa_location.version | submittedVersion |
| best_oa_location.raw_type | text |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | http://arxiv.org/abs/2412.02071 |
| primary_location.id | pmh:oai:arXiv.org:2412.02071 |
| primary_location.is_oa | True |
| primary_location.source.id | https://openalex.org/S4306400194 |
| primary_location.source.issn | |
| primary_location.source.type | repository |
| primary_location.source.is_oa | True |
| primary_location.source.issn_l | |
| primary_location.source.is_core | False |
| primary_location.source.is_in_doaj | False |
| primary_location.source.display_name | arXiv (Cornell University) |
| primary_location.source.host_organization | https://openalex.org/I205783295 |
| primary_location.source.host_organization_name | Cornell University |
| primary_location.source.host_organization_lineage | https://openalex.org/I205783295 |
| primary_location.license | |
| primary_location.pdf_url | https://arxiv.org/pdf/2412.02071 |
| primary_location.version | submittedVersion |
| primary_location.raw_type | text |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | http://arxiv.org/abs/2412.02071 |
| publication_date | 2024-12-03 |
| publication_year | 2024 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 60, 89, 140 |
| abstract_inverted_index.To | 83 |
| abstract_inverted_index.an | 17, 24, 100 |
| abstract_inverted_index.at | 31 |
| abstract_inverted_index.in | 146, 158 |
| abstract_inverted_index.of | 57, 67, 80, 154 |
| abstract_inverted_index.to | 39, 76, 93, 109, 116 |
| abstract_inverted_index.we | 86, 104, 150 |
| abstract_inverted_index.The | 120 |
| abstract_inverted_index.and | 9, 112, 138, 162 |
| abstract_inverted_index.but | 51 |
| abstract_inverted_index.for | 6, 16, 143 |
| abstract_inverted_index.its | 167 |
| abstract_inverted_index.new | 141 |
| abstract_inverted_index.not | 45 |
| abstract_inverted_index.one | 13 |
| abstract_inverted_index.our | 21, 155 |
| abstract_inverted_index.set | 139 |
| abstract_inverted_index.the | 32, 54, 64, 78, 95, 106, 113 |
| abstract_inverted_index.This | 35 |
| abstract_inverted_index.aims | 38 |
| abstract_inverted_index.also | 52 |
| abstract_inverted_index.each | 49 |
| abstract_inverted_index.only | 46 |
| abstract_inverted_index.task | 37 |
| abstract_inverted_index.that | 44, 123, 133 |
| abstract_inverted_index.they | 73 |
| abstract_inverted_index.work | 22 |
| abstract_inverted_index.While | 0 |
| abstract_inverted_index.broad | 168 |
| abstract_inverted_index.clip, | 20 |
| abstract_inverted_index.frame | 33, 50 |
| abstract_inverted_index.image | 1 |
| abstract_inverted_index.model | 91 |
| abstract_inverted_index.novel | 36 |
| abstract_inverted_index.often | 74 |
| abstract_inverted_index.this, | 85 |
| abstract_inverted_index.video | 10, 19, 29, 61, 147, 164 |
| abstract_inverted_index.action | 101, 136 |
| abstract_inverted_index.aiding | 159 |
| abstract_inverted_index.assess | 117 |
| abstract_inverted_index.entire | 18 |
| abstract_inverted_index.level. | 34 |
| abstract_inverted_index.middle | 26 |
| abstract_inverted_index.offers | 12 |
| abstract_inverted_index.single | 14 |
| abstract_inverted_index.strong | 65 |
| abstract_inverted_index.subtle | 55 |
| abstract_inverted_index.vision | 70 |
| abstract_inverted_index.within | 99 |
| abstract_inverted_index.Despite | 63 |
| abstract_inverted_index.actions | 58 |
| abstract_inverted_index.address | 84 |
| abstract_inverted_index.caption | 118 |
| abstract_inverted_index.capture | 53, 94, 135 |
| abstract_inverted_index.dataset | 108 |
| abstract_inverted_index.develop | 105 |
| abstract_inverted_index.discern | 77 |
| abstract_inverted_index.ground: | 27 |
| abstract_inverted_index.images, | 8 |
| abstract_inverted_index.leading | 69, 127 |
| abstract_inverted_index.models, | 72, 129 |
| abstract_inverted_index.nuances | 79 |
| abstract_inverted_index.precise | 131 |
| abstract_inverted_index.propose | 87 |
| abstract_inverted_index.results | 121 |
| abstract_inverted_index.support | 110 |
| abstract_inverted_index.Finally, | 149 |
| abstract_inverted_index.FrameCap | 107 |
| abstract_inverted_index.captions | 43, 132 |
| abstract_inverted_index.describe | 48 |
| abstract_inverted_index.designed | 92 |
| abstract_inverted_index.dynamics | 98 |
| abstract_inverted_index.existing | 68 |
| abstract_inverted_index.explores | 23 |
| abstract_inverted_index.generate | 40 |
| abstract_inverted_index.isolated | 4 |
| abstract_inverted_index.keyframe | 160 |
| abstract_inverted_index.language | 71 |
| abstract_inverted_index.provides | 3 |
| abstract_inverted_index.quality. | 119 |
| abstract_inverted_index.showcase | 151 |
| abstract_inverted_index.standard | 142 |
| abstract_inverted_index.struggle | 75 |
| abstract_inverted_index.temporal | 97, 144 |
| abstract_inverted_index.training | 111 |
| abstract_inverted_index.utility. | 169 |
| abstract_inverted_index.advancing | 163 |
| abstract_inverted_index.approach, | 156 |
| abstract_inverted_index.benchmark | 115 |
| abstract_inverted_index.important | 25 |
| abstract_inverted_index.narrative | 15 |
| abstract_inverted_index.practical | 152 |
| abstract_inverted_index.precision | 145 |
| abstract_inverted_index.producing | 130 |
| abstract_inverted_index.selection | 161 |
| abstract_inverted_index.sequence. | 62, 102 |
| abstract_inverted_index.surpasses | 126 |
| abstract_inverted_index.Alongside, | 103 |
| abstract_inverted_index.accurately | 47, 134 |
| abstract_inverted_index.captioning | 2, 11, 30, 90, 128 |
| abstract_inverted_index.frame-wise | 81 |
| abstract_inverted_index.individual | 7 |
| abstract_inverted_index.temporally | 41 |
| abstract_inverted_index.throughout | 59 |
| abstract_inverted_index.captioning. | 148 |
| abstract_inverted_index.demonstrate | 122 |
| abstract_inverted_index.progression | 56, 137 |
| abstract_inverted_index.FrameCapEval | 114 |
| abstract_inverted_index.applications | 153 |
| abstract_inverted_index.capabilities | 66 |
| abstract_inverted_index.descriptions | 5 |
| abstract_inverted_index.differences. | 82 |
| abstract_inverted_index.fine-grained | 42, 96 |
| abstract_inverted_index.highlighting | 166 |
| abstract_inverted_index.specifically | 157 |
| abstract_inverted_index.significantly | 125 |
| abstract_inverted_index.progress-aware | 28 |
| abstract_inverted_index.understanding, | 165 |
| abstract_inverted_index.ProgressCaptioner | 124 |
| abstract_inverted_index.ProgressCaptioner, | 88 |
| cited_by_percentile_year | |
| countries_distinct_count | 0 |
| institutions_distinct_count | 4 |
| citation_normalized_percentile |