Publication:
Multimodal Sparse Representation Learning and Cross-Modal Synthesis

No Thumbnail Available

Date

2019-05-16

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Cha, Miriam. 2019. Multimodal Sparse Representation Learning and Cross-Modal Synthesis. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Research Data

Abstract

Humans have a natural ability to process and relate concurrent sensations in different sensory modalities such as vision, hearing, smell, and taste. In order for artificial intelligence to be more human-like in their capabilities, it needs to be able to interpret and translate multimodal information. However, multimodal data are heterogenous, and the relationship between modalities is often complex. For example, there exist a number of correct ways to draw an image given a text description. Similarly, many text descriptions can be valid for an image. Summarizing and translating multimodal data is therefore challenging. In this thesis, I describe multimodal sparse coding schemes that can learn to represent multiple data modalities jointly. A key premise behind joint sparse coding is the representational power that captures complementary information while reducing statistical redundancy. As a result, my schemes can improve the performance of classification and retrieval tasks involving co-occurring data modalities. Building on the deep learning framework, I also present probabilistic generative models that produce new data conditioned on an input from another data modality. Specifically, I develop text-to-image synthesis models based on generative adversarial networks (GAN). To improve the visual realism and the diversity of generated images, I propose additional objective functions and a new GAN architecture. Furthermore, I propose a novel sampling strategy for training data that promotes output diversity under adversarial setting.

Description

Other Available Sources

Keywords

multimodal learning, generative adversarial net, sparse coding

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories