CURLICAT: Curated Multilingual Language Resources for CEF AT
Financer institution: Innovation and Networks Executive Agency

CURLICAT will compile curated monolingual datasets in the seven target languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak, and Slovenian) for areas relevant to European digital service infrastructures (DSI) to support the development of CEF AT. The primary data source will be the national/reference corpora of the aforementioned languages.
The initiative will publish at least 14 million sentences (estimated to contain at least 140 million words) from domains such as science, culture, healthcare, economy, and finance.
Additionally, the action will address the shortcomings of machine translation technology, which heavily relies on providing domain-specific quality linguistic resources for these medium-resourced languages.
Participating institutions

Institute for Bulgarian Language "Prof. Lyubomir Andreychin"

University of Zagreb, Faculty of Humanities and Social Sciences

Institute of Computer Science, Polish Academy of Sciences

Institutul de Cercetari pentru Inteligenta Artificiala, Academia Romana

Jazykovedný ústav Ľ. Štúra Slovenskej akadémie vied
