Publication: Information Theory for Vector Databases
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Research Data
Abstract
Vector embeddings are a relatively novel concept that have been rapidly increasing in popularity for uses in data science. They are the output of a deep learning model, which usually takes an image or a string of text, as an input. An embedding is an element of $\mathbb{R}^n$, for which it is much easier to perform actions like similarity search than the original image space, which is likely high dimensional.
As of yet, there is no rigorous information theoretic treatment of vector databases or vector embeddings, leaving the theoretical foundations weak. In this work, as part of research done in the Flavio Calmon group, we take the first step to apply information theory to vector databases, using CLIP as our embedding model and CIFAR-10 as our image dataset. First, we cover the formulation of the problem of vector embedding, including the theory underlying the embeddings, considering it as a transformation, and viewing the problem of embedding as a Gromov-Wasserstein alignment.
We then discuss curious statistical properties of the embeddings, including their distribution and mutual distances. Part of our findings is that calculating distances between vector embeddings results in significantly faster and more accurate sorting, rather than doing the same for their original images. However, the absolute cosine similarity between vectors from similar images is still very low.
Throughout, we describe dimensionality reduction techniques, product quantization codes, the cone effect, and Gaussianity of embeddings, the last two of which we observe weak results for.