LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING Article Swipe

View

Xitong Yang ·

YOU? · · 2021 · Open Access · · DOI: https://doi.org/10.13016/g5fj-kgxo

The tremendous growth in video data, both on the internet and in real life, has encouraged the development of intelligent systems that can automatically analyze video contents and understand human actions. Therefore, video understanding has been one of the fundamental research topics in computer vision.Encouraged by the success of deep neural networks on image classification, many efforts have been made in recent years to extend the deep networks to video understanding. However, new challenges arise when the temporal characteristic of videos is taken into account. In this dissertation, we study two long-standing problems that play important roles in effective temporal modeling in videos: (1) How to extract motion information from raw video frames? (2) How to capture long-range dependencies in time and model their temporal dynamics? To address the above issues, we first introduce hierarchical contrastive motion learning, a novel self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features, from low-level pixel movements to higher-level semantic dynamics, in a fully self-supervised manner.Next, we investigate the self-attention mechanism for long-range temporal modeling, and demonstrate that the entangled modeling of spatio-temporal information fails to capture temporal relationships among frames explicitly. To this end, we propose Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples. While the performance of video action recognition has been significantly improved by the aforementioned methods, they are still restricted to model temporal information within short clips. To overcome this limitation, we introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead. Finally, we present a spatio-temporal progressive learning framework (STEP) for spatio-temporal action detection. Our approach performs a multi-step optimization process that progressively refines the initial proposals towards the final solution. In this way, our approach can effectively make use of long-term temporal information by handling the spatial displacement problem in long action tubes.

Related Topics

Computer Science

Physics

Quantum Mechanics

Concepts

Term (time) Action (physics) Computer science Physics Quantum mechanics

Metadata

Type: dissertation
Language: en
Landing Page: https://drum.lib.umd.edu/handle/1903/27783
OA Status: green
Related Works: 3
OpenAlex ID: https://openalex.org/W3198981841

All OpenAlex metadata

Raw OpenAlex JSON

OpenAlex ID: https://openalex.org/W3198981841

Canonical identifier for this work in OpenAlex
DOI: https://doi.org/10.13016/g5fj-kgxo

Digital Object Identifier
Title: LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING

Work title
Type: dissertation

OpenAlex work type
Language: en

Primary language
Publication year: 2021

Year of publication
Publication date: 2021-01-01

Full publication date if available
Authors: Xitong Yang

List of authors in order
Landing page: https://drum.lib.umd.edu/handle/1903/27783

Publisher landing page
Open access: Yes

Whether a free full text is available
OA status: green

Open access status per OpenAlex
OA URL: https://doi.org/10.13016/g5fj-kgxo

Direct OA link when available
Concepts: Term (time), Action (physics), Computer science, Physics, Quantum mechanics

Top concepts (fields/topics) attached by OpenAlex
Cited by: 0

Total citation count in OpenAlex
Related works (count): 3

Other works algorithmically related by OpenAlex

Full payload

id	https://openalex.org/W3198981841
doi	https://doi.org/10.13016/g5fj-kgxo
ids.doi	https://doi.org/10.13016/g5fj-kgxo
ids.mag	3198981841
ids.openalex	https://openalex.org/W3198981841
fwci
type	dissertation
title	LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING
biblio.issue
biblio.volume
biblio.last_page
biblio.first_page
topics[0].id	https://openalex.org/T12720
topics[0].field.id	https://openalex.org/fields/33
topics[0].field.display_name	Social Sciences
topics[0].score	0.9366999864578247
topics[0].domain.id	https://openalex.org/domains/2
topics[0].domain.display_name	Social Sciences
topics[0].subfield.id	https://openalex.org/subfields/3312
topics[0].subfield.display_name	Sociology and Political Science
topics[0].display_name	Multimedia Communication and Technology
topics[1].id	https://openalex.org/T11439
topics[1].field.id	https://openalex.org/fields/17
topics[1].field.display_name	Computer Science
topics[1].score	0.9254000186920166
topics[1].domain.id	https://openalex.org/domains/3
topics[1].domain.display_name	Physical Sciences
topics[1].subfield.id	https://openalex.org/subfields/1707
topics[1].subfield.display_name	Computer Vision and Pattern Recognition
topics[1].display_name	Video Analysis and Summarization
topics[2].id	https://openalex.org/T11165
topics[2].field.id	https://openalex.org/fields/17
topics[2].field.display_name	Computer Science
topics[2].score	0.9215999841690063
topics[2].domain.id	https://openalex.org/domains/3
topics[2].domain.display_name	Physical Sciences
topics[2].subfield.id	https://openalex.org/subfields/1707
topics[2].subfield.display_name	Computer Vision and Pattern Recognition
topics[2].display_name	Image and Video Quality Assessment
is_xpac	False
apc_list
apc_paid
concepts[0].id	https://openalex.org/C61797465
concepts[0].level	2
concepts[0].score	0.7061080932617188
concepts[0].wikidata	https://www.wikidata.org/wiki/Q1188986
concepts[0].display_name	Term (time)
concepts[1].id	https://openalex.org/C2780791683
concepts[1].level	2
concepts[1].score	0.4950477182865143
concepts[1].wikidata	https://www.wikidata.org/wiki/Q846785
concepts[1].display_name	Action (physics)
concepts[2].id	https://openalex.org/C41008148
concepts[2].level	0
concepts[2].score	0.43856024742126465
concepts[2].wikidata	https://www.wikidata.org/wiki/Q21198
concepts[2].display_name	Computer science
concepts[3].id	https://openalex.org/C121332964
concepts[3].level	0
concepts[3].score	0.08814746141433716
concepts[3].wikidata	https://www.wikidata.org/wiki/Q413
concepts[3].display_name	Physics
concepts[4].id	https://openalex.org/C62520636
concepts[4].level	1
concepts[4].score	0.0
concepts[4].wikidata	https://www.wikidata.org/wiki/Q944
concepts[4].display_name	Quantum mechanics
keywords[0].id	https://openalex.org/keywords/term
keywords[0].score	0.7061080932617188
keywords[0].display_name	Term (time)
keywords[1].id	https://openalex.org/keywords/action
keywords[1].score	0.4950477182865143
keywords[1].display_name	Action (physics)
keywords[2].id	https://openalex.org/keywords/computer-science
keywords[2].score	0.43856024742126465
keywords[2].display_name	Computer science
keywords[3].id	https://openalex.org/keywords/physics
keywords[3].score	0.08814746141433716
keywords[3].display_name	Physics
language	en
locations[0].id	mag:3198981841
locations[0].is_oa	False
locations[0].source
locations[0].license
locations[0].pdf_url
locations[0].version
locations[0].raw_type
locations[0].license_id
locations[0].is_accepted	False
locations[0].is_published
locations[0].raw_source_name
locations[0].landing_page_url	https://drum.lib.umd.edu/handle/1903/27783
locations[1].id	doi:10.13016/g5fj-kgxo
locations[1].is_oa	True
locations[1].source.id	https://openalex.org/S4306402644
locations[1].source.issn
locations[1].source.type	repository
locations[1].source.is_oa	False
locations[1].source.issn_l
locations[1].source.is_core	False
locations[1].source.is_in_doaj	False
locations[1].source.display_name	Digital Repository at the University of Maryland (University of Maryland College Park)
locations[1].source.host_organization	https://openalex.org/I66946132
locations[1].source.host_organization_name	University of Maryland, College Park
locations[1].source.host_organization_lineage	https://openalex.org/I66946132
locations[1].license
locations[1].pdf_url
locations[1].version
locations[1].raw_type	thesis
locations[1].license_id
locations[1].is_accepted	False
locations[1].is_published
locations[1].raw_source_name
locations[1].landing_page_url	https://doi.org/10.13016/g5fj-kgxo
indexed_in	datacite
authorships[0].author.id	https://openalex.org/A5091064356
authorships[0].author.orcid	https://orcid.org/0000-0003-4372-241X
authorships[0].author.display_name	Xitong Yang
authorships[0].author_position	first
authorships[0].raw_author_name	Xitong Yang
authorships[0].is_corresponding	True
has_content.pdf	False
has_content.grobid_xml	False
is_paratext	False
open_access.is_oa	True
open_access.oa_url	https://doi.org/10.13016/g5fj-kgxo
open_access.oa_status	green
open_access.any_repository_has_fulltext	False
created_date	2025-10-10T00:00:00
display_name	LONG-TERM TEMPORAL MODELING FOR VIDEO ACTION UNDERSTANDING
has_fulltext	False
is_retracted	False
updated_date	2025-11-06T06:51:31.235846
primary_topic.id	https://openalex.org/T12720
primary_topic.field.id	https://openalex.org/fields/33
primary_topic.field.display_name	Social Sciences
primary_topic.score	0.9366999864578247
primary_topic.domain.id	https://openalex.org/domains/2
primary_topic.domain.display_name	Social Sciences
primary_topic.subfield.id	https://openalex.org/subfields/3312
primary_topic.subfield.display_name	Sociology and Political Science
primary_topic.display_name	Multimedia Communication and Technology
related_works	https://openalex.org/W3037564206, https://openalex.org/W2280866249, https://openalex.org/W3097484674
cited_by_count	0
locations_count	2
best_oa_location.id	doi:10.13016/g5fj-kgxo
best_oa_location.is_oa	True
best_oa_location.source.id	https://openalex.org/S4306402644
best_oa_location.source.issn
best_oa_location.source.type	repository
best_oa_location.source.is_oa	False
best_oa_location.source.issn_l
best_oa_location.source.is_core	False
best_oa_location.source.is_in_doaj	False
best_oa_location.source.display_name	Digital Repository at the University of Maryland (University of Maryland College Park)
best_oa_location.source.host_organization	https://openalex.org/I66946132
best_oa_location.source.host_organization_name	University of Maryland, College Park
best_oa_location.source.host_organization_lineage	https://openalex.org/I66946132
best_oa_location.license
best_oa_location.pdf_url
best_oa_location.version
best_oa_location.raw_type	thesis
best_oa_location.license_id
best_oa_location.is_accepted	False
best_oa_location.is_published	False
best_oa_location.raw_source_name
best_oa_location.landing_page_url	https://doi.org/10.13016/g5fj-kgxo
primary_location.id	mag:3198981841
primary_location.is_oa	False
primary_location.source
primary_location.license
primary_location.pdf_url
primary_location.version
primary_location.raw_type
primary_location.license_id
primary_location.is_accepted	False
primary_location.is_published	False
primary_location.raw_source_name
primary_location.landing_page_url	https://drum.lib.umd.edu/handle/1903/27783
publication_date	2021-01-01
publication_year	2021
referenced_works_count	0
abstract_inverted_index.a	138, 156, 170, 220, 235, 283, 295, 316, 323, 336
abstract_inverted_index.In	85, 350
abstract_inverted_index.To	126, 200, 277
abstract_inverted_index.an	228
abstract_inverted_index.at	297, 315
abstract_inverted_index.by	45, 262, 363
abstract_inverted_index.in	3, 11, 42, 60, 97, 101, 119, 169, 219, 369
abstract_inverted_index.is	81, 240, 304
abstract_inverted_index.of	18, 37, 48, 79, 158, 189, 216, 254, 294, 312, 359
abstract_inverted_index.on	7, 52, 214
abstract_inverted_index.to	63, 68, 105, 115, 143, 165, 193, 242, 270
abstract_inverted_index.we	88, 131, 174, 203, 281, 321
abstract_inverted_index.(1)	103
abstract_inverted_index.(2)	113
abstract_inverted_index.GTA	232
abstract_inverted_index.How	104, 114
abstract_inverted_index.Our	152, 301, 333
abstract_inverted_index.The	0
abstract_inverted_index.and	10, 27, 121, 183, 307
abstract_inverted_index.are	267
abstract_inverted_index.can	22, 355
abstract_inverted_index.for	179, 329
abstract_inverted_index.has	14, 34, 258
abstract_inverted_index.new	72
abstract_inverted_index.one	36
abstract_inverted_index.our	353
abstract_inverted_index.raw	110, 149
abstract_inverted_index.the	8, 16, 38, 46, 65, 76, 128, 176, 186, 252, 263, 310, 343, 347, 365
abstract_inverted_index.top	215
abstract_inverted_index.two	90
abstract_inverted_index.use	358
abstract_inverted_index.been	35, 58, 259
abstract_inverted_index.both	6
abstract_inverted_index.deep	49, 66
abstract_inverted_index.each	298
abstract_inverted_index.end,	202
abstract_inverted_index.from	109, 148, 161
abstract_inverted_index.have	57
abstract_inverted_index.into	83
abstract_inverted_index.long	370
abstract_inverted_index.made	59
abstract_inverted_index.make	357
abstract_inverted_index.many	55
abstract_inverted_index.play	94
abstract_inverted_index.real	12
abstract_inverted_index.that	21, 93, 185, 226, 239, 246, 287, 340
abstract_inverted_index.they	266
abstract_inverted_index.this	86, 201, 279, 351
abstract_inverted_index.time	120
abstract_inverted_index.way,	352
abstract_inverted_index.when	75
abstract_inverted_index.While	251
abstract_inverted_index.above	129
abstract_inverted_index.among	197
abstract_inverted_index.arise	74
abstract_inverted_index.clips	293
abstract_inverted_index.data,	5
abstract_inverted_index.fails	192
abstract_inverted_index.final	348
abstract_inverted_index.first	132
abstract_inverted_index.fully	171
abstract_inverted_index.human	29
abstract_inverted_index.image	53
abstract_inverted_index.life,	13
abstract_inverted_index.model	122, 271
abstract_inverted_index.novel	139
abstract_inverted_index.pixel	163
abstract_inverted_index.roles	96
abstract_inverted_index.short	275
abstract_inverted_index.still	268
abstract_inverted_index.study	89
abstract_inverted_index.taken	82
abstract_inverted_index.their	123
abstract_inverted_index.video	4, 25, 32, 69, 111, 150, 255, 296, 313
abstract_inverted_index.which	209
abstract_inverted_index.years	62
abstract_inverted_index.(GTA),	208
abstract_inverted_index.(STEP)	328
abstract_inverted_index.Global	205
abstract_inverted_index.Unlike	223
abstract_inverted_index.across	248, 290
abstract_inverted_index.action	256, 331, 371
abstract_inverted_index.clips.	276
abstract_inverted_index.encode	243
abstract_inverted_index.extend	64
abstract_inverted_index.frames	198
abstract_inverted_index.global	211, 236
abstract_inverted_index.growth	2
abstract_inverted_index.learns	155, 234
abstract_inverted_index.matrix	238
abstract_inverted_index.memory	285
abstract_inverted_index.motion	107, 136, 146, 159
abstract_inverted_index.neural	50
abstract_inverted_index.recent	61
abstract_inverted_index.topics	41
abstract_inverted_index.tubes.	372
abstract_inverted_index.videos	80
abstract_inverted_index.within	274
abstract_inverted_index.address	127
abstract_inverted_index.analyze	24
abstract_inverted_index.capture	116, 194
abstract_inverted_index.efforts	56
abstract_inverted_index.encodes	288
abstract_inverted_index.extract	106, 144
abstract_inverted_index.frames.	151
abstract_inverted_index.frames?	112
abstract_inverted_index.initial	344
abstract_inverted_index.issues,	130
abstract_inverted_index.manner.	222
abstract_inverted_index.matrix,	231
abstract_inverted_index.present	322
abstract_inverted_index.problem	368
abstract_inverted_index.process	339
abstract_inverted_index.propose	204
abstract_inverted_index.refines	342
abstract_inverted_index.sampled	292
abstract_inverted_index.spatial	217, 366
abstract_inverted_index.success	47
abstract_inverted_index.systems	20
abstract_inverted_index.towards	346
abstract_inverted_index.videos:	102
abstract_inverted_index.Finally,	320
abstract_inverted_index.However,	71
abstract_inverted_index.Temporal	206
abstract_inverted_index.account.	84
abstract_inverted_index.accuracy	311
abstract_inverted_index.actions.	30
abstract_inverted_index.approach	153, 334, 354
abstract_inverted_index.computer	43
abstract_inverted_index.computes	227
abstract_inverted_index.contents	26
abstract_inverted_index.directly	233
abstract_inverted_index.handling	364
abstract_inverted_index.improved	261
abstract_inverted_index.improves	309
abstract_inverted_index.intended	241
abstract_inverted_index.internet	9
abstract_inverted_index.learning	141, 326
abstract_inverted_index.methods,	265
abstract_inverted_index.modeling	100, 188
abstract_inverted_index.multiple	291
abstract_inverted_index.networks	51, 67
abstract_inverted_index.overcome	278
abstract_inverted_index.performs	210, 335
abstract_inverted_index.problems	92
abstract_inverted_index.proposed	302
abstract_inverted_index.research	40
abstract_inverted_index.samples.	250
abstract_inverted_index.semantic	167
abstract_inverted_index.temporal	77, 99, 124, 181, 195, 212, 244, 272, 361
abstract_inverted_index.training	299
abstract_inverted_index.Attention	207
abstract_inverted_index.attention	213, 218, 230, 237
abstract_inverted_index.decoupled	221
abstract_inverted_index.different	249
abstract_inverted_index.dynamics,	168
abstract_inverted_index.dynamics?	125
abstract_inverted_index.effective	98, 145
abstract_inverted_index.entangled	187
abstract_inverted_index.features,	160
abstract_inverted_index.framework	142, 303, 327
abstract_inverted_index.hierarchy	157
abstract_inverted_index.important	95
abstract_inverted_index.introduce	133, 282
abstract_inverted_index.learning,	137
abstract_inverted_index.long-term	360
abstract_inverted_index.low-level	162
abstract_inverted_index.mechanism	178, 286
abstract_inverted_index.modeling,	182
abstract_inverted_index.movements	164
abstract_inverted_index.overhead.	319
abstract_inverted_index.proposals	345
abstract_inverted_index.solution.	349
abstract_inverted_index.trainable	306
abstract_inverted_index.Therefore,	31
abstract_inverted_index.challenges	73
abstract_inverted_index.detection.	332
abstract_inverted_index.encouraged	15
abstract_inverted_index.end-to-end	305
abstract_inverted_index.generalize	247
abstract_inverted_index.iteration.	300
abstract_inverted_index.long-range	117, 180
abstract_inverted_index.multi-step	337
abstract_inverted_index.negligible	317
abstract_inverted_index.restricted	269
abstract_inverted_index.structures	245
abstract_inverted_index.tremendous	1
abstract_inverted_index.understand	28
abstract_inverted_index.contrastive	135
abstract_inverted_index.demonstrate	184
abstract_inverted_index.development	17
abstract_inverted_index.effectively	356
abstract_inverted_index.explicitly.	199
abstract_inverted_index.fundamental	39
abstract_inverted_index.information	108, 191, 273, 289, 362
abstract_inverted_index.intelligent	19
abstract_inverted_index.investigate	175
abstract_inverted_index.limitation,	280
abstract_inverted_index.performance	253
abstract_inverted_index.progressive	325
abstract_inverted_index.recognition	257
abstract_inverted_index.conventional	224
abstract_inverted_index.dependencies	118
abstract_inverted_index.displacement	367
abstract_inverted_index.hierarchical	134
abstract_inverted_index.higher-level	166
abstract_inverted_index.manner.Next,	173
abstract_inverted_index.optimization	338
abstract_inverted_index.automatically	23
abstract_inverted_index.collaborative	284
abstract_inverted_index.computational	318
abstract_inverted_index.dissertation,	87
abstract_inverted_index.long-standing	91
abstract_inverted_index.progressively	154, 341
abstract_inverted_index.relationships	196
abstract_inverted_index.significantly	260, 308
abstract_inverted_index.understanding	33
abstract_inverted_index.aforementioned	264
abstract_inverted_index.characteristic	78
abstract_inverted_index.classification	314
abstract_inverted_index.self-attention	177, 225
abstract_inverted_index.understanding.	70
abstract_inverted_index.classification,	54
abstract_inverted_index.representations	147
abstract_inverted_index.self-supervised	140, 172
abstract_inverted_index.spatio-temporal	190, 324, 330
abstract_inverted_index.instance-specific	229
abstract_inverted_index.vision.Encouraged	44
cited_by_percentile_year
corresponding_author_ids	https://openalex.org/A5091064356
countries_distinct_count	0
institutions_distinct_count	1
citation_normalized_percentile