Publication:
Real-Time Tf-Idf Clustering Using Simhash, Approximate Nearest Neighbors, and DBSCAN

No Thumbnail Available

Date

2016-06-21

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Research Data

Abstract

Modern news aggregators and content providers such as Facebook and LinkedIn need to give their users the experience of a diversified, highly personalized news feed. The diversity requirement implies that a feed must be clustered into categories, and the personalization requirement implies that feeds must be unique. Therefore, news aggregators have to periodically run offline algorithms to cluster each user’s assigned corpus of news stories, whether the user logs in or not. In this thesis, we offer a novel approach to clustering a user’s news feed that is fast enough to run in real-time at the moment of log-in. Modern news aggregators could use this approach to save a significant amount of computing resources and power. This thesis introduces a novel randomized algorithm capable of clustering tens of thousands of tf-idf vectors in real time on a standard modern machine. The algorithm consists of three novel parts: The first is an efficient dimensionality reduction technique. The second is a novel randomized approximate nearest neighbor computation that collects representative samples of the data in order to appropriately tune its own hyper-parameters. The third step is a simplified run of DBSCAN to produce a clustering assignment. In addition, we present a series of optimizations that speed up the creation of tf-idf vectors from text documents. Using corpora of text documents pulled from Stack Exchange’s API, the algorithm achieves an 85% accuracy rate in clustering 65K tf-idf vectors of size 58K entries each, in less than 3 seconds. In addition, we speed up the basic tf-idf creation process by over 3.3x. Finally, we describe two variations on the approximate nearest neighbor computation. Even though both yield great results on average, one of the variations is more likely to terminate faster with the desired level of accuracy on average, but with a higher variability of accuracy. We analyze both variations on different data distributions and empirically expose the tradeoffs involved.

Description

Other Available Sources

Keywords

Computer Science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories