CURLICAT: Curated Multilingual Language Resources for CEF AT
Financer institution: Innovation and Networks Executive Agency
CURLICAT will compile curated monolingual datasets in the seven target languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak, and Slovenian) for areas relevant to European digital service infrastructures (DSI) to support the development of CEF AT. The primary data source will be the national/reference corpora of the aforementioned languages.
The initiative will publish at least 14 million sentences (estimated to contain at least 140 million words) from domains such as science, culture, healthcare, economy, and finance.
Additionally, the action will address the shortcomings of machine translation technology, which heavily relies on providing domain-specific quality linguistic resources for these medium-resourced languages.