European Thematic Network for Doctoral Education in Computing

The statistical language models and their appliance in automatic continuous speech recognition of highly inflected Lithuanian are being analyzed in this dissertation. This work analyzes the task of the general dictation system of the continuous speech of the Lithuanian language.
The main methods of statistical language modeling for Lithuanian are thorough investigated. Out-of-vocabulary rate depending on the vocabulary size is presented and the perplexity of standard n-gram models is being evaluated. The compound models such as class-based, skip, adaptive cache and topic mixture models of Lithuanian are also created and evaluated. Several types of morphology-based models are being reviewed in this work. New method of statistical modeling which includes both (left and right) word contexts for modeling of the particular word is being introduced and the performance of such method is being presented. The results are presented in terms of perplexity and word error rate. Experiments have been performed using very large vocabularies, including more than 1 million different word forms.
A prototype of large vocabulary continuous speech recognition system for Lithuanian was built. The accuracy of the system depending on the size of pronunciation vocabulary and several types of statistical language models is being evaluated. Moreover the system of the recognition which is based on word particles was built and the results of such system are presented.

PhD DATABASE