Publication:
Video Representation Learning via Actons

No Thumbnail Available

Date

2022-05-25

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Lam, Darius. 2022. Video Representation Learning via Actons. Bachelor's thesis, Harvard College.

Research Data

Abstract

In this paper, we propose a novel method for dense video representation learning. Through our method, we are able to learn compressed frame representations known as “actons”. By extracting actons, we strike a middle ground between expressive but computationally de- manding frame-wise features and low information whole-video features. Our model consists of two-branch processing using a VQ-Codebook and a Transformer encoder trained on the Masked-Language Modeling protocol. In order to fit within the scope of this work, we extract small-scale datasets from TinyVIRAT and UCF101, both action recognition datasets, which we use to evaluate our methods. We find that our acton representations are far smaller than original video lengths, reaching compression ratios up to 18x, that are also more expressive than framewise features. We also find that fine-tuning using our representations achieves better test-set accuracy on action classification when compared to a C3D baseline.

Description

Other Available Sources

Keywords

Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories