LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING Article Swipe
The tremendous growth in video data, both on the internet and in real life, has encouraged the development of intelligent systems that can automatically analyze video contents and understand human actions. Therefore, video understanding has been one of the fundamental research topics in computer vision.Encouraged by the success of deep neural networks on image classification, many efforts have been made in recent years to extend the deep networks to video understanding. However, new challenges arise when the temporal characteristic of videos is taken into account. In this dissertation, we study two long-standing problems that play important roles in effective temporal modeling in videos: (1) How to extract motion information from raw video frames? (2) How to capture long-range dependencies in time and model their temporal dynamics? To address the above issues, we first introduce hierarchical contrastive motion learning, a novel self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features, from low-level pixel movements to higher-level semantic dynamics, in a fully self-supervised manner.Next, we investigate the self-attention mechanism for long-range temporal modeling, and demonstrate that the entangled modeling of spatio-temporal information fails to capture temporal relationships among frames explicitly. To this end, we propose Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples. While the performance of video action recognition has been significantly improved by the aforementioned methods, they are still restricted to model temporal information within short clips. To overcome this limitation, we introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead. Finally, we present a spatio-temporal progressive learning framework (STEP) for spatio-temporal action detection. Our approach performs a multi-step optimization process that progressively refines the initial proposals towards the final solution. In this way, our approach can effectively make use of long-term temporal information by handling the spatial displacement problem in long action tubes.
Related Topics
- Type
- dissertation
- Language
- en
- Landing Page
- https://drum.lib.umd.edu/handle/1903/27783
- OA Status
- green
- Related Works
- 3
- OpenAlex ID
- https://openalex.org/W3198981841
Raw OpenAlex JSON
- OpenAlex ID
-
https://openalex.org/W3198981841Canonical identifier for this work in OpenAlex
- DOI
-
https://doi.org/10.13016/g5fj-kgxoDigital Object Identifier
- Title
-
LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDINGWork title
- Type
-
dissertationOpenAlex work type
- Language
-
enPrimary language
- Publication year
-
2021Year of publication
- Publication date
-
2021-01-01Full publication date if available
- Authors
-
Xitong YangList of authors in order
- Landing page
-
https://drum.lib.umd.edu/handle/1903/27783Publisher landing page
- Open access
-
YesWhether a free full text is available
- OA status
-
greenOpen access status per OpenAlex
- OA URL
-
https://doi.org/10.13016/g5fj-kgxoDirect OA link when available
- Concepts
-
Term (time), Action (physics), Computer science, Physics, Quantum mechanicsTop concepts (fields/topics) attached by OpenAlex
- Cited by
-
0Total citation count in OpenAlex
- Related works (count)
-
3Other works algorithmically related by OpenAlex
Full payload
| id | https://openalex.org/W3198981841 |
|---|---|
| doi | https://doi.org/10.13016/g5fj-kgxo |
| ids.doi | https://doi.org/10.13016/g5fj-kgxo |
| ids.mag | 3198981841 |
| ids.openalex | https://openalex.org/W3198981841 |
| fwci | |
| type | dissertation |
| title | LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING |
| biblio.issue | |
| biblio.volume | |
| biblio.last_page | |
| biblio.first_page | |
| topics[0].id | https://openalex.org/T12720 |
| topics[0].field.id | https://openalex.org/fields/33 |
| topics[0].field.display_name | Social Sciences |
| topics[0].score | 0.9366999864578247 |
| topics[0].domain.id | https://openalex.org/domains/2 |
| topics[0].domain.display_name | Social Sciences |
| topics[0].subfield.id | https://openalex.org/subfields/3312 |
| topics[0].subfield.display_name | Sociology and Political Science |
| topics[0].display_name | Multimedia Communication and Technology |
| topics[1].id | https://openalex.org/T11439 |
| topics[1].field.id | https://openalex.org/fields/17 |
| topics[1].field.display_name | Computer Science |
| topics[1].score | 0.9254000186920166 |
| topics[1].domain.id | https://openalex.org/domains/3 |
| topics[1].domain.display_name | Physical Sciences |
| topics[1].subfield.id | https://openalex.org/subfields/1707 |
| topics[1].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[1].display_name | Video Analysis and Summarization |
| topics[2].id | https://openalex.org/T11165 |
| topics[2].field.id | https://openalex.org/fields/17 |
| topics[2].field.display_name | Computer Science |
| topics[2].score | 0.9215999841690063 |
| topics[2].domain.id | https://openalex.org/domains/3 |
| topics[2].domain.display_name | Physical Sciences |
| topics[2].subfield.id | https://openalex.org/subfields/1707 |
| topics[2].subfield.display_name | Computer Vision and Pattern Recognition |
| topics[2].display_name | Image and Video Quality Assessment |
| is_xpac | False |
| apc_list | |
| apc_paid | |
| concepts[0].id | https://openalex.org/C61797465 |
| concepts[0].level | 2 |
| concepts[0].score | 0.7061080932617188 |
| concepts[0].wikidata | https://www.wikidata.org/wiki/Q1188986 |
| concepts[0].display_name | Term (time) |
| concepts[1].id | https://openalex.org/C2780791683 |
| concepts[1].level | 2 |
| concepts[1].score | 0.4950477182865143 |
| concepts[1].wikidata | https://www.wikidata.org/wiki/Q846785 |
| concepts[1].display_name | Action (physics) |
| concepts[2].id | https://openalex.org/C41008148 |
| concepts[2].level | 0 |
| concepts[2].score | 0.43856024742126465 |
| concepts[2].wikidata | https://www.wikidata.org/wiki/Q21198 |
| concepts[2].display_name | Computer science |
| concepts[3].id | https://openalex.org/C121332964 |
| concepts[3].level | 0 |
| concepts[3].score | 0.08814746141433716 |
| concepts[3].wikidata | https://www.wikidata.org/wiki/Q413 |
| concepts[3].display_name | Physics |
| concepts[4].id | https://openalex.org/C62520636 |
| concepts[4].level | 1 |
| concepts[4].score | 0.0 |
| concepts[4].wikidata | https://www.wikidata.org/wiki/Q944 |
| concepts[4].display_name | Quantum mechanics |
| keywords[0].id | https://openalex.org/keywords/term |
| keywords[0].score | 0.7061080932617188 |
| keywords[0].display_name | Term (time) |
| keywords[1].id | https://openalex.org/keywords/action |
| keywords[1].score | 0.4950477182865143 |
| keywords[1].display_name | Action (physics) |
| keywords[2].id | https://openalex.org/keywords/computer-science |
| keywords[2].score | 0.43856024742126465 |
| keywords[2].display_name | Computer science |
| keywords[3].id | https://openalex.org/keywords/physics |
| keywords[3].score | 0.08814746141433716 |
| keywords[3].display_name | Physics |
| language | en |
| locations[0].id | mag:3198981841 |
| locations[0].is_oa | False |
| locations[0].source | |
| locations[0].license | |
| locations[0].pdf_url | |
| locations[0].version | |
| locations[0].raw_type | |
| locations[0].license_id | |
| locations[0].is_accepted | False |
| locations[0].is_published | |
| locations[0].raw_source_name | |
| locations[0].landing_page_url | https://drum.lib.umd.edu/handle/1903/27783 |
| locations[1].id | doi:10.13016/g5fj-kgxo |
| locations[1].is_oa | True |
| locations[1].source.id | https://openalex.org/S4306402644 |
| locations[1].source.issn | |
| locations[1].source.type | repository |
| locations[1].source.is_oa | False |
| locations[1].source.issn_l | |
| locations[1].source.is_core | False |
| locations[1].source.is_in_doaj | False |
| locations[1].source.display_name | Digital Repository at the University of Maryland (University of Maryland College Park) |
| locations[1].source.host_organization | https://openalex.org/I66946132 |
| locations[1].source.host_organization_name | University of Maryland, College Park |
| locations[1].source.host_organization_lineage | https://openalex.org/I66946132 |
| locations[1].license | |
| locations[1].pdf_url | |
| locations[1].version | |
| locations[1].raw_type | thesis |
| locations[1].license_id | |
| locations[1].is_accepted | False |
| locations[1].is_published | |
| locations[1].raw_source_name | |
| locations[1].landing_page_url | https://doi.org/10.13016/g5fj-kgxo |
| indexed_in | datacite |
| authorships[0].author.id | https://openalex.org/A5091064356 |
| authorships[0].author.orcid | https://orcid.org/0000-0003-4372-241X |
| authorships[0].author.display_name | Xitong Yang |
| authorships[0].author_position | first |
| authorships[0].raw_author_name | Xitong Yang |
| authorships[0].is_corresponding | True |
| has_content.pdf | False |
| has_content.grobid_xml | False |
| is_paratext | False |
| open_access.is_oa | True |
| open_access.oa_url | https://doi.org/10.13016/g5fj-kgxo |
| open_access.oa_status | green |
| open_access.any_repository_has_fulltext | False |
| created_date | 2025-10-10T00:00:00 |
| display_name | LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING |
| has_fulltext | False |
| is_retracted | False |
| updated_date | 2025-11-06T06:51:31.235846 |
| primary_topic.id | https://openalex.org/T12720 |
| primary_topic.field.id | https://openalex.org/fields/33 |
| primary_topic.field.display_name | Social Sciences |
| primary_topic.score | 0.9366999864578247 |
| primary_topic.domain.id | https://openalex.org/domains/2 |
| primary_topic.domain.display_name | Social Sciences |
| primary_topic.subfield.id | https://openalex.org/subfields/3312 |
| primary_topic.subfield.display_name | Sociology and Political Science |
| primary_topic.display_name | Multimedia Communication and Technology |
| related_works | https://openalex.org/W3037564206, https://openalex.org/W2280866249, https://openalex.org/W3097484674 |
| cited_by_count | 0 |
| locations_count | 2 |
| best_oa_location.id | doi:10.13016/g5fj-kgxo |
| best_oa_location.is_oa | True |
| best_oa_location.source.id | https://openalex.org/S4306402644 |
| best_oa_location.source.issn | |
| best_oa_location.source.type | repository |
| best_oa_location.source.is_oa | False |
| best_oa_location.source.issn_l | |
| best_oa_location.source.is_core | False |
| best_oa_location.source.is_in_doaj | False |
| best_oa_location.source.display_name | Digital Repository at the University of Maryland (University of Maryland College Park) |
| best_oa_location.source.host_organization | https://openalex.org/I66946132 |
| best_oa_location.source.host_organization_name | University of Maryland, College Park |
| best_oa_location.source.host_organization_lineage | https://openalex.org/I66946132 |
| best_oa_location.license | |
| best_oa_location.pdf_url | |
| best_oa_location.version | |
| best_oa_location.raw_type | thesis |
| best_oa_location.license_id | |
| best_oa_location.is_accepted | False |
| best_oa_location.is_published | False |
| best_oa_location.raw_source_name | |
| best_oa_location.landing_page_url | https://doi.org/10.13016/g5fj-kgxo |
| primary_location.id | mag:3198981841 |
| primary_location.is_oa | False |
| primary_location.source | |
| primary_location.license | |
| primary_location.pdf_url | |
| primary_location.version | |
| primary_location.raw_type | |
| primary_location.license_id | |
| primary_location.is_accepted | False |
| primary_location.is_published | False |
| primary_location.raw_source_name | |
| primary_location.landing_page_url | https://drum.lib.umd.edu/handle/1903/27783 |
| publication_date | 2021-01-01 |
| publication_year | 2021 |
| referenced_works_count | 0 |
| abstract_inverted_index.a | 138, 156, 170, 220, 235, 283, 295, 316, 323, 336 |
| abstract_inverted_index.In | 85, 350 |
| abstract_inverted_index.To | 126, 200, 277 |
| abstract_inverted_index.an | 228 |
| abstract_inverted_index.at | 297, 315 |
| abstract_inverted_index.by | 45, 262, 363 |
| abstract_inverted_index.in | 3, 11, 42, 60, 97, 101, 119, 169, 219, 369 |
| abstract_inverted_index.is | 81, 240, 304 |
| abstract_inverted_index.of | 18, 37, 48, 79, 158, 189, 216, 254, 294, 312, 359 |
| abstract_inverted_index.on | 7, 52, 214 |
| abstract_inverted_index.to | 63, 68, 105, 115, 143, 165, 193, 242, 270 |
| abstract_inverted_index.we | 88, 131, 174, 203, 281, 321 |
| abstract_inverted_index.(1) | 103 |
| abstract_inverted_index.(2) | 113 |
| abstract_inverted_index.GTA | 232 |
| abstract_inverted_index.How | 104, 114 |
| abstract_inverted_index.Our | 152, 301, 333 |
| abstract_inverted_index.The | 0 |
| abstract_inverted_index.and | 10, 27, 121, 183, 307 |
| abstract_inverted_index.are | 267 |
| abstract_inverted_index.can | 22, 355 |
| abstract_inverted_index.for | 179, 329 |
| abstract_inverted_index.has | 14, 34, 258 |
| abstract_inverted_index.new | 72 |
| abstract_inverted_index.one | 36 |
| abstract_inverted_index.our | 353 |
| abstract_inverted_index.raw | 110, 149 |
| abstract_inverted_index.the | 8, 16, 38, 46, 65, 76, 128, 176, 186, 252, 263, 310, 343, 347, 365 |
| abstract_inverted_index.top | 215 |
| abstract_inverted_index.two | 90 |
| abstract_inverted_index.use | 358 |
| abstract_inverted_index.been | 35, 58, 259 |
| abstract_inverted_index.both | 6 |
| abstract_inverted_index.deep | 49, 66 |
| abstract_inverted_index.each | 298 |
| abstract_inverted_index.end, | 202 |
| abstract_inverted_index.from | 109, 148, 161 |
| abstract_inverted_index.have | 57 |
| abstract_inverted_index.into | 83 |
| abstract_inverted_index.long | 370 |
| abstract_inverted_index.made | 59 |
| abstract_inverted_index.make | 357 |
| abstract_inverted_index.many | 55 |
| abstract_inverted_index.play | 94 |
| abstract_inverted_index.real | 12 |
| abstract_inverted_index.that | 21, 93, 185, 226, 239, 246, 287, 340 |
| abstract_inverted_index.they | 266 |
| abstract_inverted_index.this | 86, 201, 279, 351 |
| abstract_inverted_index.time | 120 |
| abstract_inverted_index.way, | 352 |
| abstract_inverted_index.when | 75 |
| abstract_inverted_index.While | 251 |
| abstract_inverted_index.above | 129 |
| abstract_inverted_index.among | 197 |
| abstract_inverted_index.arise | 74 |
| abstract_inverted_index.clips | 293 |
| abstract_inverted_index.data, | 5 |
| abstract_inverted_index.fails | 192 |
| abstract_inverted_index.final | 348 |
| abstract_inverted_index.first | 132 |
| abstract_inverted_index.fully | 171 |
| abstract_inverted_index.human | 29 |
| abstract_inverted_index.image | 53 |
| abstract_inverted_index.life, | 13 |
| abstract_inverted_index.model | 122, 271 |
| abstract_inverted_index.novel | 139 |
| abstract_inverted_index.pixel | 163 |
| abstract_inverted_index.roles | 96 |
| abstract_inverted_index.short | 275 |
| abstract_inverted_index.still | 268 |
| abstract_inverted_index.study | 89 |
| abstract_inverted_index.taken | 82 |
| abstract_inverted_index.their | 123 |
| abstract_inverted_index.video | 4, 25, 32, 69, 111, 150, 255, 296, 313 |
| abstract_inverted_index.which | 209 |
| abstract_inverted_index.years | 62 |
| abstract_inverted_index.(GTA), | 208 |
| abstract_inverted_index.(STEP) | 328 |
| abstract_inverted_index.Global | 205 |
| abstract_inverted_index.Unlike | 223 |
| abstract_inverted_index.across | 248, 290 |
| abstract_inverted_index.action | 256, 331, 371 |
| abstract_inverted_index.clips. | 276 |
| abstract_inverted_index.encode | 243 |
| abstract_inverted_index.extend | 64 |
| abstract_inverted_index.frames | 198 |
| abstract_inverted_index.global | 211, 236 |
| abstract_inverted_index.growth | 2 |
| abstract_inverted_index.learns | 155, 234 |
| abstract_inverted_index.matrix | 238 |
| abstract_inverted_index.memory | 285 |
| abstract_inverted_index.motion | 107, 136, 146, 159 |
| abstract_inverted_index.neural | 50 |
| abstract_inverted_index.recent | 61 |
| abstract_inverted_index.topics | 41 |
| abstract_inverted_index.tubes. | 372 |
| abstract_inverted_index.videos | 80 |
| abstract_inverted_index.within | 274 |
| abstract_inverted_index.address | 127 |
| abstract_inverted_index.analyze | 24 |
| abstract_inverted_index.capture | 116, 194 |
| abstract_inverted_index.efforts | 56 |
| abstract_inverted_index.encodes | 288 |
| abstract_inverted_index.extract | 106, 144 |
| abstract_inverted_index.frames. | 151 |
| abstract_inverted_index.frames? | 112 |
| abstract_inverted_index.initial | 344 |
| abstract_inverted_index.issues, | 130 |
| abstract_inverted_index.manner. | 222 |
| abstract_inverted_index.matrix, | 231 |
| abstract_inverted_index.present | 322 |
| abstract_inverted_index.problem | 368 |
| abstract_inverted_index.process | 339 |
| abstract_inverted_index.propose | 204 |
| abstract_inverted_index.refines | 342 |
| abstract_inverted_index.sampled | 292 |
| abstract_inverted_index.spatial | 217, 366 |
| abstract_inverted_index.success | 47 |
| abstract_inverted_index.systems | 20 |
| abstract_inverted_index.towards | 346 |
| abstract_inverted_index.videos: | 102 |
| abstract_inverted_index.Finally, | 320 |
| abstract_inverted_index.However, | 71 |
| abstract_inverted_index.Temporal | 206 |
| abstract_inverted_index.account. | 84 |
| abstract_inverted_index.accuracy | 311 |
| abstract_inverted_index.actions. | 30 |
| abstract_inverted_index.approach | 153, 334, 354 |
| abstract_inverted_index.computer | 43 |
| abstract_inverted_index.computes | 227 |
| abstract_inverted_index.contents | 26 |
| abstract_inverted_index.directly | 233 |
| abstract_inverted_index.handling | 364 |
| abstract_inverted_index.improved | 261 |
| abstract_inverted_index.improves | 309 |
| abstract_inverted_index.intended | 241 |
| abstract_inverted_index.internet | 9 |
| abstract_inverted_index.learning | 141, 326 |
| abstract_inverted_index.methods, | 265 |
| abstract_inverted_index.modeling | 100, 188 |
| abstract_inverted_index.multiple | 291 |
| abstract_inverted_index.networks | 51, 67 |
| abstract_inverted_index.overcome | 278 |
| abstract_inverted_index.performs | 210, 335 |
| abstract_inverted_index.problems | 92 |
| abstract_inverted_index.proposed | 302 |
| abstract_inverted_index.research | 40 |
| abstract_inverted_index.samples. | 250 |
| abstract_inverted_index.semantic | 167 |
| abstract_inverted_index.temporal | 77, 99, 124, 181, 195, 212, 244, 272, 361 |
| abstract_inverted_index.training | 299 |
| abstract_inverted_index.Attention | 207 |
| abstract_inverted_index.attention | 213, 218, 230, 237 |
| abstract_inverted_index.decoupled | 221 |
| abstract_inverted_index.different | 249 |
| abstract_inverted_index.dynamics, | 168 |
| abstract_inverted_index.dynamics? | 125 |
| abstract_inverted_index.effective | 98, 145 |
| abstract_inverted_index.entangled | 187 |
| abstract_inverted_index.features, | 160 |
| abstract_inverted_index.framework | 142, 303, 327 |
| abstract_inverted_index.hierarchy | 157 |
| abstract_inverted_index.important | 95 |
| abstract_inverted_index.introduce | 133, 282 |
| abstract_inverted_index.learning, | 137 |
| abstract_inverted_index.long-term | 360 |
| abstract_inverted_index.low-level | 162 |
| abstract_inverted_index.mechanism | 178, 286 |
| abstract_inverted_index.modeling, | 182 |
| abstract_inverted_index.movements | 164 |
| abstract_inverted_index.overhead. | 319 |
| abstract_inverted_index.proposals | 345 |
| abstract_inverted_index.solution. | 349 |
| abstract_inverted_index.trainable | 306 |
| abstract_inverted_index.Therefore, | 31 |
| abstract_inverted_index.challenges | 73 |
| abstract_inverted_index.detection. | 332 |
| abstract_inverted_index.encouraged | 15 |
| abstract_inverted_index.end-to-end | 305 |
| abstract_inverted_index.generalize | 247 |
| abstract_inverted_index.iteration. | 300 |
| abstract_inverted_index.long-range | 117, 180 |
| abstract_inverted_index.multi-step | 337 |
| abstract_inverted_index.negligible | 317 |
| abstract_inverted_index.restricted | 269 |
| abstract_inverted_index.structures | 245 |
| abstract_inverted_index.tremendous | 1 |
| abstract_inverted_index.understand | 28 |
| abstract_inverted_index.contrastive | 135 |
| abstract_inverted_index.demonstrate | 184 |
| abstract_inverted_index.development | 17 |
| abstract_inverted_index.effectively | 356 |
| abstract_inverted_index.explicitly. | 199 |
| abstract_inverted_index.fundamental | 39 |
| abstract_inverted_index.information | 108, 191, 273, 289, 362 |
| abstract_inverted_index.intelligent | 19 |
| abstract_inverted_index.investigate | 175 |
| abstract_inverted_index.limitation, | 280 |
| abstract_inverted_index.performance | 253 |
| abstract_inverted_index.progressive | 325 |
| abstract_inverted_index.recognition | 257 |
| abstract_inverted_index.conventional | 224 |
| abstract_inverted_index.dependencies | 118 |
| abstract_inverted_index.displacement | 367 |
| abstract_inverted_index.hierarchical | 134 |
| abstract_inverted_index.higher-level | 166 |
| abstract_inverted_index.manner.Next, | 173 |
| abstract_inverted_index.optimization | 338 |
| abstract_inverted_index.automatically | 23 |
| abstract_inverted_index.collaborative | 284 |
| abstract_inverted_index.computational | 318 |
| abstract_inverted_index.dissertation, | 87 |
| abstract_inverted_index.long-standing | 91 |
| abstract_inverted_index.progressively | 154, 341 |
| abstract_inverted_index.relationships | 196 |
| abstract_inverted_index.significantly | 260, 308 |
| abstract_inverted_index.understanding | 33 |
| abstract_inverted_index.aforementioned | 264 |
| abstract_inverted_index.characteristic | 78 |
| abstract_inverted_index.classification | 314 |
| abstract_inverted_index.self-attention | 177, 225 |
| abstract_inverted_index.understanding. | 70 |
| abstract_inverted_index.classification, | 54 |
| abstract_inverted_index.representations | 147 |
| abstract_inverted_index.self-supervised | 140, 172 |
| abstract_inverted_index.spatio-temporal | 190, 324, 330 |
| abstract_inverted_index.instance-specific | 229 |
| abstract_inverted_index.vision.Encouraged | 44 |
| cited_by_percentile_year | |
| corresponding_author_ids | https://openalex.org/A5091064356 |
| countries_distinct_count | 0 |
| institutions_distinct_count | 1 |
| citation_normalized_percentile |