Accueil du site Accueil du site Adhésion Contact Plan du site

Authorship Attribution and Optical Character Recognition Errors

Patrick Juola*, **, John I. Noecker Jr**, Michael V. Ryan**

* Evaluating Variations in Language Laboratory
Duquesne University
Pittsburgh
Pennsylvania
USA

** Juola & Associates
Pittsburgh
Pennsylvania
USA


Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difficult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial.


Télécharger:
Fichier PDF
Patrick Juola, , John I. Noecker Jr, Michael V. Ryan
979.4 ko

TAL Volume 53 2012 . 3. Du bruit dans le signal : gestion des erreurs en traitement automatique des langues

Date de dernière mise à jour : 15 juillet 2013, auteur : Rédacteurs en chef.