Publication:

Efficient Filtering: Stacking and Hash Functions as Data

Loading...
Thumbnail Image

Date

2020-06-17

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Deeds, Kyle Boyd. 2020. Efficient Filtering: Stacking and Hash Functions as Data. Bachelor's thesis, Harvard College.

Abstract

Approximate filtering is central to a wide variety of high performance systems in computer science. Data structures like Bloom Filters which perform this filtering answer set membership queries while guaranteeing zero false negatives and some small rate of false positives. Traditionally, these filters represent a set of elements, referred to as positives, which require expensive processing, and they protect the system from having to perform that processing on elements outside of that set, referred to as negatives. To improve overall system performance, the overhead of evaluating the filter must be less than the saved processing time, so filters must fit in memory while still producing very few false positives.

Worryingly, there is a strong lower bound on the size of a filter for a given false positive rate, and state-of-the-art filters are quickly approaching it. However, this lower bound requires strong assumptions about the available workload knowledge which do not hold in many real world contexts. Seeking to break this lower bound, several papers have suggested ways to incorporate additional workload knowledge to create better filters\cite{kraska2018case}\cite{lim2015complement}\cite{carrea2016yes}. However, current suggestions are generally either computationally infeasible for most filtering applications or do not take full advantage of the available workload knowledge.

In this thesis, two original methods are presented for reducing the size of filters by incorporating workload knowledge. The first method, Stacked Filters (SFs), can adapt any existing filter design to drastically reduce the false positive rate. The utility of stacking is demonstrated using URL blacklisting, IP filtering, and synthetic data, showing that SFs achieve FPRs 10x lower than traditional filters on real world datasets. Then, a theoretical analysis is presented showing the memory footprint, computational cost, and robustness of SFs. Lastly, a method for incorporating false positives in an online manner is presented alongside experimental results demonstrating its efficacy in increasing overall performance.

The second method, Variable Hash Filters (VHFs), introduces an entirely new filter design based on a data structure from the networking literature \cite{zhou2015scaling}. VHFs are more space-efficient than state-of-the-art traditional filters for low memory budgets even in the absence of workload knowledge, and they very efficiently incorporate workload knowledge. Initial experiments and theoretical results are presented which show the FPR and size of the filters as compared to traditional methods.

Description

Other Available Sources

Research Data

Keywords

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories