Publication: Multimodal Sparse Representation Learning and Cross-Modal Synthesis
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Research Data
Abstract
Humans have a natural ability to process and relate concurrent sensations in different sensory modalities such as vision, hearing, smell, and taste. In order for artificial intelligence to be more human-like in their capabilities, it needs to be able to interpret and translate multimodal information. However, multimodal data are heterogenous, and the relationship between modalities is often complex. For example, there exist a number of correct ways to draw an image given a text description. Similarly, many text descriptions can be valid for an image. Summarizing and translating multimodal data is therefore challenging. In this thesis, I describe multimodal sparse coding schemes that can learn to represent multiple data modalities jointly. A key premise behind joint sparse coding is the representational power that captures complementary information while reducing statistical redundancy. As a result, my schemes can improve the performance of classification and retrieval tasks involving co-occurring data modalities. Building on the deep learning framework, I also present probabilistic generative models that produce new data conditioned on an input from another data modality. Specifically, I develop text-to-image synthesis models based on generative adversarial networks (GAN). To improve the visual realism and the diversity of generated images, I propose additional objective functions and a new GAN architecture. Furthermore, I propose a novel sampling strategy for training data that promotes output diversity under adversarial setting.