Automated Activity Discovery and Object Detection With Computer Vision: Towards Unsupervised Learning for Breakfast to Surgery
MetadataShow full item record
CitationZhang, Michael. 2020. Automated Activity Discovery and Object Detection With Computer Vision: Towards Unsupervised Learning for Breakfast to Surgery. Bachelor's thesis, Harvard College.
AbstractWhile deep learning in computer vision has enjoyed tremendous success in recent years, much progress remains to be made in important tasks. Many state-of-the-art vision models follow a supervised-learning paradigm, making them dependent on a large amount of labeled training data. This limits both their potential learning capabilities, as well as their application domains. In this work, we present contributions to expand the state-of-the-art in computer vision along both these dimensions, focusing on how to both perceive activities and objects in the video domain.
We first tackle the problem of unsupervised activity discovery, where given a collection of untrimmed and unlabeled videos, we wish to learn a semantically meaningful embedding of the data which can be used to segment the data into "discovered" activities. To expand the learning capabilities of computer vision models, we avoid the intensive bottleneck of amassing a large annotated dataset, and seek to do so without requiring explicit training labels. We consider unsupervised activity discovery from the perspective of the inherently hierarchical nature of activities, e.g. a single complex activity may be modeled as a sequence of smaller sub-activities. Recognizing dependencies across this spectrum of complexities may therefore lead to greater video understanding.
Accordingly, we introduce a hyperbolic embedding representation for video data to simultaneously capture hierarchical and semantic relationships in video data. While motivated by their prior success in modeling explicitly hierarchical data found in language, here we show how to leverage hyperbolic representations for the implicitly hierarchical nature of video data. We demonstrate that our hyperbolic video embeddings approach learn representations that significantly outperform the previous state-of-the-art for unsupervised activity segmentation on the Breakfast and 50Salads datasets, and that our hierarchical embeddings naturally allow discovery of activities at multiple levels of complexity.
Following this, we next consider the second challenge of expanding the application domains of state-of-the-art computer vision models. We focus on open, or non-laparoscopic surgery, which represents the vast majority of all operating room procedures. Despite this prominence, few tools exist to objectively evaluate these techniques at scale, and current efforts involve human expert-based visual assessment. We therefore leverage a state-of-the-art convolutional neural network architecture for object detection to detect operating hands in open surgery videos. Automated assessment was expanded by combining model predictions with a fast object tracker to enable surgeon-specific hand tracking. To train our model, we used publicly available videos of open surgery from YouTube and annotated these with spatial bounding boxes of operating hands. Our model's spatial detections of operating hands significantly outperforms the detections achieved using pre-existing hand-detection datasets, and allow for insights into intra-operative movement patterns and economy of motion.
Finally, we consider how to combine both advances in capability and application domain to focus on automatic activity discovery in surgery videos. Using our developed unsupervised learning algorithm, we demonstrate that we can discover multiple surgical activities in various operating procedures from just a collection of untrimmed and unlabeled YouTube surgery videos. We leverage a state-of-the-art deep learning architecture to extract base features, which we then encode into temporal and hierarchically-aware hyperbolic embeddings. We can then discover an arbitrary number of activities based on our clustering and decoding procedure. Through qualitative evaluation we show how our segmentations align with changes in video activity over time. Based on the alignment of our video segmentations, we present further insights and evidence into the hierarchical nature of activities. In all, our contributions present a step towards fully automated video activity understanding and discovery in real world domains.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364735
- FAS Theses and Dissertations