Machine Learning Techniques for Modeling of Language Varieties
Language varieties are a primary means of expressing a person's social affiliation and identity. Hence, computer systems that can adapt to the user by displaying a familiar socio-cultural identity are expected to raise the acceptance within certain contexts and target groups dramatically. However, current systems are far from achieving the fidelity required for realization of these benefits. For example, promising early results have been obtained in the context of speech synthesis through localized pronunciation, but it is clear that language variety is a multi-faceted concept that involves deviations from standard language on various linguistic levels. Our goal is to develop algorithmic methods that are capable of capturing and reproducing all major idiosyncrasies displayed by a language variety, be they syntactic, lexical or phonetic in nature. Conceptually, part of this task can be understood as a machine translation problem, which is, however, characterized by unique properties.
For one thing, substantial written corpora of a language variety are rare, and orthographic conventions can vary greatly among existing resources. Parallel resources for a standard language and a variety thereof are even less common. But such parallel data is the workhorse of modern machine translation systems and key to producing sufficiently natural utterances. It then seems that current methods are inadequate for the task at hand. On the other hand, the relative proximity between a standard language and its varieties works to our advantage. Furthermore, steady progress on the machine learning front has opened up new possibilities. We argue that utilizing these facts, it will be possible to draw on statistical machine translation techniques despite data sparsity.
In particular, we investigate principled bootstrapping of parallel data and statistical models using active learning techniques. Such a strategy allows for reduced manual effort by automatically choosing for annotation by a human those sentences that will result in the largest leap in quality. Within this overall framework, we aim at employing several recent advances in machine learning and statistical machine translation. Notably, factored translation models are used for separate translation of different linguistic levels, thereby mitigating the combinatorial explosion incurred by translation of full word forms. Moreover, using discriminative machine learning techniques, we aim at incorporating rich linguistic hints that can greatly enhance generalization performance.
Standard German and its Viennese varieties serve as a test bed for realization and exploration of our techniques. However, our goal is to establish methods that are sufficiently general to greatly facilitate similar endeavors in the future. Moreover, the resulting prototype will be of great value for localization of information systems within the wider region of Vienna.
- Adolfo Hernández
- Friedrich Neubarth
- Sylvia Moosmüller, Acoustics Research Institute (ARI) of the Austrian Academy of Sciences
- Philipp Koehn, Institute for Language, Cognition and Computation (ILCC), School of Informatics of the University of Edinburgh