The Index Thomisticus Treebank Project : Annotation, Parsing and Valency Lexicon

Barbara McGillivray^*, Marco Passarotti^** et Paolo Ruffolo^**

^*University of Pisa, Italy; b.mcgillivray@ling.unipi.it

^**Catholic University of the Sacred Heart, Milan, Italy; marco.passarotti@unicatt.it, paolo.ruffolo@poste.it

Résumé (en anglais)

We present an overview of the Index Thomisticus Treebank project (IT-TB). The IT-TB consists of around 60,000 tokens from the Index Thomisticus by Roberto Busa SJ, an 11-million-token Latin corpus of the texts by Thomas Aquinas. We brieﬂy describe the annotation guidelines, shared with the Latin Dependency Treebank (LDT). The application of data-driven dependency parsers on IT-TB and LDT data is reported on. We present training and parsing results on several datasets and provide evaluation of learning algorithms and techniques. Furthermore, we introduce the IT-TB valency lexicon extracted from the treebank. We report on quantitative data of the lexicon and provide some statistical measures on subcategorisation structures.

Paru dans

Traitement automatique des langues et langues anciennes

Document

TAL_50_2_4.pdf

Rank