Publication:

Stylometric Features for Multiple Authorship Attribution

Loading...
Thumbnail Image

Date

2019-08-23

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Yu, Brian. 2019. Stylometric Features for Multiple Authorship Attribution. Bachelor's thesis, Harvard College.

Abstract

In computational linguistics, authorship attribution is the task of predicting the author of a document of unknown authorship. This task is generally performed via the analysis of stylometric features — particular characteristics of an author’s writing that can be used to identify his or her works in contrast with the works of other authors. A wide variety of different features have been proposed in the literature for this classification task, many of which are based on the analysis of lexical, syntactic, or semantic proper- ties of a text. I propose an extension to existing authorship attribution models aimed at solving the related problem of multiple author attribution, which attempts to perform authorship classification on documents that may be jointly written by multiple authors instead of one, and aims to predict sentence-level and section-level authorship within the document. To do so, I propose a model that first uses a sentence-level Bayesian classifier to predict the most likely author of each sentence in the composite document, and then uses those sentence-level predictions to estimate likely section boundaries between authors. The model is tested against a set of synthesized multi-author documents generated from a corpus of sentences of known authorship from literature. A Hidden Markov Model-based procedure is proposed for estimating section boundaries, and new possible syntax-based feature sets for classifier training — including function word embedded subtrees and a variant of syntactic n-grams — are proposed and demonstrated to improve predictive accuracy when solving multi-author attribution problems.

Description

Other Available Sources

Research Data

Keywords

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories