Authorship Attribution and Optical Character Recognition Errors
Patrick Juola*, **, John I. Noecker Jr**, Michael V. Ryan**
* Evaluating Variations in Language Laboratory
** Juola & Associates
Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difﬁcult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial.
Patrick Juola, , John I. Noecker Jr, Michael V. Ryan
Volume 53 2012
3. Du bruit dans le signal : gestion des erreurs en traitement automatique des langues