Publication: Stylometric Features for Multiple Authorship Attribution
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
In computational linguistics, authorship attribution is the task of predicting the author of a document of unknown authorship. This task is generally performed via the analysis of stylometric features — particular characteristics of an author’s writing that can be used to identify his or her works in contrast with the works of other authors. A wide variety of different features have been proposed in the literature for this classification task, many of which are based on the analysis of lexical, syntactic, or semantic proper- ties of a text. I propose an extension to existing authorship attribution models aimed at solving the related problem of multiple author attribution, which attempts to perform authorship classification on documents that may be jointly written by multiple authors instead of one, and aims to predict sentence-level and section-level authorship within the document. To do so, I propose a model that first uses a sentence-level Bayesian classifier to predict the most likely author of each sentence in the composite document, and then uses those sentence-level predictions to estimate likely section boundaries between authors. The model is tested against a set of synthesized multi-author documents generated from a corpus of sentences of known authorship from literature. A Hidden Markov Model-based procedure is proposed for estimating section boundaries, and new possible syntax-based feature sets for classifier training — including function word embedded subtrees and a variant of syntactic n-grams — are proposed and demonstrated to improve predictive accuracy when solving multi-author attribution problems.