Identification of Cognates and Recurrent Sound Correspondences in Word Lists

Grzegorz Kondrak

Department of Computing Science
University of Alberta
Edmonton, AB T6G 2E8, Canada

Identification of cognates and recurrent sound correspondences is a component of two principal tasks of historical linguistics : demonstrating the relatedness of languages, and reconstructing the histories of language families. We propose methods for detecting and quantifying three characteristics of cognates : recurrent sound correspondences, phonetic similarity, and semantic affinity. The ultimate goal is to identify cognates and correspondences directly from lists of words representing pairs of languages that are known to be related. The proposed solutions are language independent, and are evaluated against authentic linguistic data. The results of evaluation experiments involving the Indo-European, Algonquian, and Totonac language families indicate that our methods are more accurate than comparable programs, and achieve high precision and recall on various test sets. The results also suggest that combining various types of evidence substantially increases cognate identification accuracy.

Grzegorz Kondrak
TAL Volume 50 2009 . 2. Traitement automatique des langues et langues anciennes

