Show simple item record

dc.contributor.authorDeeds, Kyle Boyd
dc.date.accessioned2020-08-28T10:28:43Z
dash.embargo.terms2021-05-01
dc.date.created2020-05
dc.date.issued2020-06-17
dc.date.submitted2020
dc.identifier.citationDeeds, Kyle Boyd. 2020. Efficient Filtering: Stacking and Hash Functions as Data. Bachelor's thesis, Harvard College.
dc.identifier.urihttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364685*
dc.description.abstractApproximate filtering is central to a wide variety of high performance systems in computer science. Data structures like Bloom Filters which perform this filtering answer set membership queries while guaranteeing zero false negatives and some small rate of false positives. Traditionally, these filters represent a set of elements, referred to as positives, which require expensive processing, and they protect the system from having to perform that processing on elements outside of that set, referred to as negatives. To improve overall system performance, the overhead of evaluating the filter must be less than the saved processing time, so filters must fit in memory while still producing very few false positives. Worryingly, there is a strong lower bound on the size of a filter for a given false positive rate, and state-of-the-art filters are quickly approaching it. However, this lower bound requires strong assumptions about the available workload knowledge which do not hold in many real world contexts. Seeking to break this lower bound, several papers have suggested ways to incorporate additional workload knowledge to create better filters\cite{kraska2018case}\cite{lim2015complement}\cite{carrea2016yes}. However, current suggestions are generally either computationally infeasible for most filtering applications or do not take full advantage of the available workload knowledge. In this thesis, two original methods are presented for reducing the size of filters by incorporating workload knowledge. The first method, Stacked Filters (SFs), can adapt any existing filter design to drastically reduce the false positive rate. The utility of stacking is demonstrated using URL blacklisting, IP filtering, and synthetic data, showing that SFs achieve FPRs 10x lower than traditional filters on real world datasets. Then, a theoretical analysis is presented showing the memory footprint, computational cost, and robustness of SFs. Lastly, a method for incorporating false positives in an online manner is presented alongside experimental results demonstrating its efficacy in increasing overall performance. The second method, Variable Hash Filters (VHFs), introduces an entirely new filter design based on a data structure from the networking literature \cite{zhou2015scaling}. VHFs are more space-efficient than state-of-the-art traditional filters for low memory budgets even in the absence of workload knowledge, and they very efficiently incorporate workload knowledge. Initial experiments and theoretical results are presented which show the FPR and size of the filters as compared to traditional methods.
dc.description.sponsorshipComputer Science
dc.description.sponsorshipComputer Science
dc.format.mimetypeapplication/pdf
dc.language.isoen
dash.licenseLAA
dc.titleEfficient Filtering: Stacking and Hash Functions as Data
dc.typeThesis or Dissertation
dash.depositing.authorDeeds, Kyle Boyd
dash.embargo.until2021-05-01
dc.date.available2020-08-28T10:28:43Z
thesis.degree.date2020
thesis.degree.grantorHarvard College
thesis.degree.grantorHarvard College
thesis.degree.levelUndergraduate
thesis.degree.levelUndergraduate
thesis.degree.nameAB
thesis.degree.nameAB
dc.type.materialtext
thesis.degree.departmentComputer Science
thesis.degree.departmentComputer Science
dash.identifier.vireo
dc.identifier.orcid0000-0003-2267-3276
dash.author.emailkylebd99@gmail.com


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record