Zone Map Layout Optimization
CitationEdjah, Nenya. 2020. Zone Map Layout Optimization. Bachelor's thesis, Harvard College.
AbstractA zone map is a lightweight data structure that stores multiple summary statistic tuples (e.g. min/max) for multiple contiguous regions of an array called “zones”. The most common application of a zone map, and the one that we consider for most of this work, is as an index to accelerate range queries in relational databases by determining which zones should be skipped during a scan. Most prior academic work in zone maps has only discussed uniformly sized zones that were drawn without much consideration of the underlying data. However, a lot of real-world data exhibits a high degree of clustering and partial-orderedness. Uniform-sized zones are incapable of recognizing this structure which limits their ability to effectively skip data. To enable the maximum amount of data skipping, non-uniformly sized zones are required.
In this work, we propose a novel technique for statically generating optimal non-uniformly sized zone map layouts via the minimization of a cost function that models the expected amount of data that needs to be scanned for a random query. We show that the layouts generated by this technique have better expected performance than any uniformly-sized zone map, and depending on the distribution of the data, can obtain up to 6.4x lower query latencies than the best uniformly-sized zone map.
We additionally implement this system in a mature, open source, column-oriented database called MonetDB, and compare its performance to that of MonetDB’s native lightweight indexing data structures: imprints. We show that in comparison to MonetDB’s imprints, optimized zone maps can be built just as quickly, take up almost 100x less space, and, depending on the column’s data distribution, offer up to 10x lower latencies for medium to high selectivity queries.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364697
- FAS Theses and Dissertations