Publication: Information Theory for Vector Databases
No Thumbnail Available
Open/View Files
Date
2024-11-26
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Rakhamimov, Joel. 2024. Information Theory for Vector Databases. Bachelor's thesis, Harvard University Engineering and Applied Sciences.
Research Data
Abstract
Vector embeddings are a relatively novel concept that have been rapidly increasing in popularity for uses in data science. They are the output of a deep learning model, which usually takes an image or a string of text, as an input. An embedding is an element of $\mathbb{R}^n$, for which it is much easier to perform actions like similarity search than the original image space, which is likely high dimensional.
As of yet, there is no rigorous information theoretic treatment of vector databases or vector embeddings, leaving the theoretical foundations weak. In this work, as part of research done in the Flavio Calmon group, we take the first step to apply information theory to vector databases, using CLIP as our embedding model and CIFAR-10 as our image dataset. First, we cover the formulation of the problem of vector embedding, including the theory underlying the embeddings, considering it as a transformation, and viewing the problem of embedding as a Gromov-Wasserstein alignment.
We then discuss curious statistical properties of the embeddings, including their distribution and mutual distances. Part of our findings is that calculating distances between vector embeddings results in significantly faster and more accurate sorting, rather than doing the same for their original images. However, the absolute cosine similarity between vectors from similar images is still very low.
Throughout, we describe dimensionality reduction techniques, product quantization codes, the cone effect, and Gaussianity of embeddings, the last two of which we observe weak results for.
Description
Other Available Sources
Keywords
dimensionality reduction, information theory, vector databases, vector embeddings, Electrical engineering, Applied mathematics
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service