Lab of the Month:
The Institute for Applied Linguistics at Eurac Research
by Alexander König & Lionel Nicolas
The Institute for Applied Linguistics (IAL) is one of 11 institutes at Eurac Research, a research centre situated in Bolzano, South Tyrol, right in the heart of the Dolomites. This peculiar location comes with a strengthened research focus on the local multilingualism. In South Tyrol not only Italian and German, but also the smaller Ladin language are all recognized as official languages, but they interact with each other in a lot of interesting ways, and create particular demands like multilingual legal and administrative terminology or specific approaches to teaching in multilingual environments. The Institute for Applied Linguistics is tackling several of these challenges by means of three research groups: Specialised communication, Bilingualism and Multilingualism and Language Technologies. The latter group is the main participant of the IAL with regard to enetCollect.
We are currently seven people in the IAL’s Language Technologies (LT) group, namely, Jennifer-Carmen Frey, Alexander König, Verena Lyding, Lionel Nicolas, Nadezda Okinina, Monica Pretti and Egon Stemle. The LT group both supports the other two groups in terms of computer science needs and develops its own research subjects that are mostly situated within the fields of Computational Linguistics, Natural Language Processing, Linguistic Infrastructure, with a special focus on the design of Linguistic Corpora and Crowdsourcing.
The field of crowdsourcing is a relatively new focus for our work. We were an essential part in setting up enetCollect and the COST Action is being coordinated by the LT group at the IAL. As there exists a strong imbalance in terms of NLP coverage for the three official languages in South Tyrol, we aim at crowdsourcing varied NLP resources, starting with wide-coverage Part-of-Speech (POS) lexica for the South Tyrolean variety of German and Ladin. We envision to do so by implicitly crowdsourcing content via automatically generated exercises where the answers provided by learners are used to extend and correct the NLP datasets used to generate the exercises.
With regard to Linguistic Infrastructure, we have recently become very active in the CLARIN community – a pan-European initiative to increase the interoperability and availability of linguistic data and tools. We have set up the Eurac Research CLARIN Centre (ERCC) where we are in preparation to host all corpora collected at the institute. We also created a dockerized version of the popular CLARIN DSpace repository software, which we hope will enable even more institutions to easily set up their own repository. In addition, we actively collaborate with the ELEXIS project that fosters cooperation and knowledge regarding lexicography in order to bridge the gap between lesser-resourced languages and those for which advanced experience in e-lexicography exists. As regards Linguistic Infrastructure and Linguistic Corpora, we are providing support as well for the creation of Learner Corpora by adapting tools, developing semi-automatic approaches and creating a dedicated specialised portal allowing to browse and query the several Learner Corpora created and curated by the Bilingualism and Multilingualism group. With respect to Linguistic Corpora, we also are actively involved in the CMC community (Computer Mediated Communication): we released a South Tyrolean CMC corpus named DIDI and organized the 2017 edition of the CMC conference. Last but not least, being involved in enetCollect has already started to influence other projects. Indeed, within the CLARIN-oriented local infrastructure project DI-ÖSS, in which we work on integrating and networking smaller “language-related institutions” (e.g. libraries and newspapers) within South Tyrol and exploit mutually-beneficial synergies, we are currently starting a crowdsourcing initiative where we plan to involve the local population for annotating and transcribing a linguistic corpus of historical letters.