Efficient Filtering: Stacking and Hash Functions as Data

Deeds, Kyle Boyd

dc.contributor.author	Deeds, Kyle Boyd
dc.date.accessioned	2020-08-28T10:28:43Z
dash.embargo.terms	2021-05-01
dc.date.created	2020-05
dc.date.issued	2020-06-17
dc.date.submitted	2020
dc.identifier.citation	Deeds, Kyle Boyd. 2020. Efficient Filtering: Stacking and Hash Functions as Data. Bachelor's thesis, Harvard College.
dc.identifier.uri	https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364685	*
dc.description.abstract	Approximate filtering is central to a wide variety of high performance systems in computer science. Data structures like Bloom Filters which perform this filtering answer set membership queries while guaranteeing zero false negatives and some small rate of false positives. Traditionally, these filters represent a set of elements, referred to as positives, which require expensive processing, and they protect the system from having to perform that processing on elements outside of that set, referred to as negatives. To improve overall system performance, the overhead of evaluating the filter must be less than the saved processing time, so filters must fit in memory while still producing very few false positives. Worryingly, there is a strong lower bound on the size of a filter for a given false positive rate, and state-of-the-art filters are quickly approaching it. However, this lower bound requires strong assumptions about the available workload knowledge which do not hold in many real world contexts. Seeking to break this lower bound, several papers have suggested ways to incorporate additional workload knowledge to create better filters\cite{kraska2018case}\cite{lim2015complement}\cite{carrea2016yes}. However, current suggestions are generally either computationally infeasible for most filtering applications or do not take full advantage of the available workload knowledge. In this thesis, two original methods are presented for reducing the size of filters by incorporating workload knowledge. The first method, Stacked Filters (SFs), can adapt any existing filter design to drastically reduce the false positive rate. The utility of stacking is demonstrated using URL blacklisting, IP filtering, and synthetic data, showing that SFs achieve FPRs 10x lower than traditional filters on real world datasets. Then, a theoretical analysis is presented showing the memory footprint, computational cost, and robustness of SFs. Lastly, a method for incorporating false positives in an online manner is presented alongside experimental results demonstrating its efficacy in increasing overall performance. The second method, Variable Hash Filters (VHFs), introduces an entirely new filter design based on a data structure from the networking literature \cite{zhou2015scaling}. VHFs are more space-efficient than state-of-the-art traditional filters for low memory budgets even in the absence of workload knowledge, and they very efficiently incorporate workload knowledge. Initial experiments and theoretical results are presented which show the FPR and size of the filters as compared to traditional methods.
dc.description.sponsorship	Computer Science
dc.description.sponsorship	Computer Science
dc.format.mimetype	application/pdf
dc.language.iso	en
dash.license	LAA
dc.title	Efficient Filtering: Stacking and Hash Functions as Data
dc.type	Thesis or Dissertation
dash.depositing.author	Deeds, Kyle Boyd
dash.embargo.until	2021-05-01
dc.date.available	2020-08-28T10:28:43Z
thesis.degree.date	2020
thesis.degree.grantor	Harvard College
thesis.degree.grantor	Harvard College
thesis.degree.level	Undergraduate
thesis.degree.level	Undergraduate
thesis.degree.name	AB
thesis.degree.name	AB
dc.type.material	text
thesis.degree.department	Computer Science
thesis.degree.department	Computer Science
dash.identifier.vireo
dc.identifier.orcid	0000-0003-2267-3276
dash.author.email	kylebd99@gmail.com

Files in this item

Name:: DEEDS-SENIORTHESIS-2020.pdf
Size:: 1005.Kb
Format:: PDF

View/Open

Name:: Thesis_ Filtering.zip
Size:: 1.144Mb
Format:: Unknown

View/Open

This item appears in the following Collection(s)

FAS Theses and Dissertations [6136]

Show simple item record