Using Jocker's method G. Bruce Schaalje, et al. find false attribution of the Federalist Papers to Mormon writers.

Date
Apr 2011
Type
Academic / Technical Report
Source
G. Bruce Schaalje
LDS
Hearsay
Direct
Reference

G. Bruce Schaalje, Paul J. Fields, Matthew Roper, and Gregory L. Snow, "Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes," Literary and Linguistic Computing 26, no. 1 (April 2011): 71–88

Scribe/Publisher
Literary and Linguistic Computing, Oxford University Press
People
Paul J. Fields, G. Bruce Schaalje, Matthew Roper, Gregory L. Snow
Audience
General Public
PDF
Transcription

To further illustrate the problem of probability inflation, we created an artificial authorship attribution problem in which the style of the test texts was deliberately chosen to be far different from those of all training authors. As training data, we computed 130 literary features for word blocks from six nineteenth-century authors connected with early Mormonism: Joseph Smith, early Sidney Rigdon (1831–46), late Sidney Rigdon (1863–73), Solomon Spalding, Oliver Cowdery, and Parley P. Pratt (see Appendices A and B). As test data, we calculated the same features for the 51 Federalist papers authored by Alexander Hamilton. We then naively used the closed-set NSC procedure to calculate posterior probabilities and classifications for the Hamilton texts (as if they were anonymous). The 130 literary features included relative frequencies of 93 non-contextual words, 35 word-pattern ratios (Hilton 1990), and 2 vocabulary richness measures (Holmes 1992).

Early or late Rigdon was falsely chosen as the author of 28 of the 51 Hamilton texts with inflated posterior probabilities ranging as high as 0.9999 (Fig. 2). Pratt was falsely chosen as the author of 12 of the papers, and Cowdery was falsely chosen as the author of the remaining 11 papers. These results dramatically demonstrate the danger of misapplying closed-set NSC.

One message of this example is that before applying closed-set NSC to any authorship attribution problem, an initial examination of the data must be carried out to see if the x vectors for the test texts are reasonably near the distribution of x vectors in the training set for at least one of the candidate authors. A dimension reduction procedure such as principal components analysis (PCA) or a high-dimensional dynamic visualization program such as GGobi (Buja et al., 2003) must be used for this even though some information is lost when visualizing a high-dimensional data set in two or three dimensions. A principal components plot (Fig. 3) shows, as expected, that the test (Hamilton) texts are highly distinct from those of every author in the training data; thus, one expects that inflation of posterior authorship probabilities will be a serious problem for naıve authorship attribution of these test texts using closed-set NSC. Although the test authors do not appear very distinct in Fig. 3, a principal components plot (not shown) of only the training author data shows that they are reasonably distinct.

Copyright © B. H. Roberts Foundation
The B. H. Roberts Foundation is not owned by, operated by, or affiliated with the Church of Jesus Christ of Latter-day Saints.