The Index Thomisticus Treebank Project : Annotation, Parsing and Valency Lexicon

Barbara McGillivray* , Marco Passarotti** , Paolo Ruffolo**

* University of Pisa, Italy

** Catholic University of the Sacred Heart, Milan, Italy,

We present an overview of the Index Thomisticus Treebank project (IT-TB). The IT-TB consists of around 60,000 tokens from the Index Thomisticus by Roberto Busa SJ, an 11-million-token Latin corpus of the texts by Thomas Aquinas. We briefly describe the annotation guidelines, shared with the Latin Dependency Treebank (LDT). The application of data-driven dependency parsers on IT-TB and LDT data is reported on. We present training and parsing results on several datasets and provide evaluation of learning algorithms and techniques. Furthermore, we introduce the IT-TB valency lexicon extracted from the treebank. We report on quantitative data of the lexicon and provide some statistical measures on subcategorisation structures.

TAL Volume 50 2009 . 2. Traitement automatique des langues et langues anciennes

