Publication: Zone Map Layout Optimization
No Thumbnail Available
Open/View Files
Date
2020-06-17
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Edjah, Nenya. 2020. Zone Map Layout Optimization. Bachelor's thesis, Harvard College.
Research Data
Abstract
A zone map is a lightweight data structure that stores multiple summary statistic tuples (e.g. min/max) for multiple contiguous regions of an array called “zones”. The most common application of a zone map, and the one that we consider for most of this work, is as an index to accelerate range queries in relational databases by determining which zones should be skipped during a scan. Most prior academic work in zone maps has only discussed uniformly sized zones that were drawn without much consideration of the underlying data. However, a lot of real-world data exhibits a high degree of clustering and partial-orderedness. Uniform-sized zones are incapable of recognizing this structure which limits their ability to effectively skip data. To enable the maximum amount of data skipping, non-uniformly sized zones are required.
In this work, we propose a novel technique for statically generating optimal non-uniformly sized zone map layouts via the minimization of a cost function that models the expected amount of data that needs to be scanned for a random query. We show that the layouts generated by this technique have better expected performance than any uniformly-sized zone map, and depending on the distribution of the data, can obtain up to 6.4x lower query latencies than the best uniformly-sized zone map.
We additionally implement this system in a mature, open source, column-oriented database called MonetDB, and compare its performance to that of MonetDB’s native lightweight indexing data structures: imprints. We show that in comparison to MonetDB’s imprints, optimized zone maps can be built just as quickly, take up almost 100x less space, and, depending on the column’s data distribution, offer up to 10x lower latencies for medium to high selectivity queries.
Description
Other Available Sources
Keywords
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service