Publication:
Information Theory for Vector Databases

No Thumbnail Available

Date

2024-11-26

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Rakhamimov, Joel. 2024. Information Theory for Vector Databases. Bachelor's thesis, Harvard University Engineering and Applied Sciences.

Research Data

Abstract

Vector embeddings are a relatively novel concept that have been rapidly increasing in popularity for uses in data science. They are the output of a deep learning model, which usually takes an image or a string of text, as an input. An embedding is an element of $\mathbb{R}^n$, for which it is much easier to perform actions like similarity search than the original image space, which is likely high dimensional. As of yet, there is no rigorous information theoretic treatment of vector databases or vector embeddings, leaving the theoretical foundations weak. In this work, as part of research done in the Flavio Calmon group, we take the first step to apply information theory to vector databases, using CLIP as our embedding model and CIFAR-10 as our image dataset. First, we cover the formulation of the problem of vector embedding, including the theory underlying the embeddings, considering it as a transformation, and viewing the problem of embedding as a Gromov-Wasserstein alignment. We then discuss curious statistical properties of the embeddings, including their distribution and mutual distances. Part of our findings is that calculating distances between vector embeddings results in significantly faster and more accurate sorting, rather than doing the same for their original images. However, the absolute cosine similarity between vectors from similar images is still very low. Throughout, we describe dimensionality reduction techniques, product quantization codes, the cone effect, and Gaussianity of embeddings, the last two of which we observe weak results for.

Description

Other Available Sources

Keywords

dimensionality reduction, information theory, vector databases, vector embeddings, Electrical engineering, Applied mathematics

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories