Language Technology Research Group
Research area

The predecessor of the Language Technology Research Group, the Corpus Linguistics Department, was founded in 1997, as a formal recognition of the language technology research and development work that had been going on for several years at that time. Since then, the research group has gained significant experience in research and development in many areas of language technology: it has achieved outstanding results in building language resources, developing language technology tools and, more recently, in teaching language models.

In terms of linguistic resources, the first version of the Hungarian National Corpus (HNC or MNSZ)  must be highlighted: the 187.6 million-word annotated, representative text corpus, completed in 2005, which also includes language variants from territories outside the present borders of Hungary. It was the first major database of its kind for the Hungarian language. MNSZ2, the improved version, published in 2014, not only contains almost ten times more words (1.5 billion), but also covers new, important text types, such as social media. Furthermore, the quality of the linguistic analysis also improved a lot compared to the previous version.

Over the years, the members of the research group have developed a number of tools. One of the most significant of these is the – the spelling service portal, which was created to help language users in improving spelling, following the complex rules of the Hungarian spelling system, using automatic tools. e-magyar, the digital  language processing toolchain and its improved, modularized version, emtsv, which enable extensive analysis of texts should also be emphasised.

The Hungarian version of the WordNet lexical database is another important resource, which was created with the cooperation of the research group. HuWordNet was created as a result of a three-year- process and depicts the Hungarian vocabulary from a semantic point of view: it includes both synonymous words and the relations between them.

The scientific paradigm shifts that took place in 2013 and then in 2018 had a serious impact on the work carried out in the research group. Following the leading international research, we created the Hungarian versions of the neural language models developed primarily for English. Initially, this meant static word embeddings, but now we also have several transformer-based contextual language models. Examples include HILBERT, which is a BERT-Large language model, and PULI-GPT-3SX (7 billion parameters), the Hungarian version of GPT-3.

An important element of the current activity of the research group is the mapping of new teaching paradigms related to language models, such as zero-shot and few-shot learning or prompt programming. Another important research direction is improving the quality of machine translation with transformer-based neural networks. Our specific applications related to language models can be tried on this demo page. 

Another priority of the research group is to create test databases for Hungarian, so-called create benchmark corpora that, when embedded in a web service, enable a simple, yet multi-criteria evaluation of neural network-based technologies, as well as the comparison and publication of these results. For this purpose, the Hungarian Language Understanding Evaluation Benchmark Kit (HuLU) was created, based on the GLUE and SuperGLUE test database infrastructure developed for the English language.

Héja EnikőResearch Group Leader:

Enikő Héja, PhD
Email: urwn.ravxb@alghq.uha-era.uh
Phone: +36 (1) 3429372 / 6043
Current national tendersStart – end
Supporting the digital sustainability of the Hungarian language2020.12.01. – 2026.11.30.
Digital support for the Hungarian language is Hungarian
in the service of science
2020.12.01. – 2026.11.30.
Major closed international tendersStart – end
CURLICAT: Curated Multilingual Language Resources for CEF AT2020.06.01. – 2022.11.30.
MARCELL: Multilingual Resources for CEF.AT in the Legal Domain2018.10.01. – 2021.03.31.
Large-scale, Cross-lingual Trend Mining and Summarisation of Real-time Media Streams (TrendMiner)2013 – 2014
Innovative Networking in Infrastructure for Endangered Languages (INNET)2011 – 2013
European Media Monitor – Hungarian modul2012
Central and South-East European Resources (CESAR)2011 – 2013
Internet Translators for all European Languages (iTranslate4)2010 – 2012

Major closed national tendersStart – end Open, integrated Hungarian language technology research
building infrastructure.
2015.01.01. – 2016.06.30.
Hungarian Generative Diachronic Syntax 22015 – 2019
helyesírá – Online spelling consultation portal2008 – 2013
Disclosure of BSI-22008 – 2012
Dictionary of Hungarian Verb Phrase Constructions2008 – 2010
Building of the Hungarian WordNet ontology and its applications in information extraction systems (Hungarian WordNet)2005 – 2007

*A detailed list of the closed tenders can be found here.

Language Technology Research Group

Enikő HÉJA
research group leader, research fellow

Institute for Language Technologies and Applied Linguistics

junior research fellow

Institute for Language Technologies and Applied Linguistics

director general, research professor

Institute for Language Technologies and Applied Linguistics

IT director

Institute for Language Technologies and Applied Linguistics

IT specialist (Linux / Unix Supervisor, Devops architect)

Institute for Language Technologies and Applied Linguistics

research fellow

Institute for Language Technologies and Applied Linguistics

László János LAKI
research fellow

Institute for Language Technologies and Applied Linguistics

research fellow

Institute for Language Technologies and Applied Linguistics

junior research fellow

Institute for Language Technologies and Applied Linguistics

deputy director-general, director, senior research fellow

Institute for Language Technologies and Applied Linguistics

Zijian Győző YANG
research fellow

Institute for Language Technologies and Applied Linguistics

Language Technology Research Group

Building a data infrastructure by correcting OCR errors in curated texts

The production of language models requires a corpus of billions of words, the most obvious source of which is the Internet. However, most of the texts available here are of uncertain origin and quality, often with little metadata. As part of the cooperation with the Arcanum Database Publisher, we have a collection of curated texts of approximately nine billion words at our disposal. This collection is the result of the publisher’s many years of OCR scanning (Optical Character Recognition). Yet, ...

Building and publishing benchmark corpora

One of the prerequisites for following cutting-edge NLP is the standardized measurement of development results in the Hungarian language. This requires a whole series of test databases, so-called benchmark corpora, created according to a strict methodology, which serve as a reference for measuring the level of development of new technologies and devices.However, benchmark databases serve more than just the purpose of comparing the performance of different language models. Their important new rol ...

Development of language-centered artificial intelligence (language models)

The neural language models becoming dominant in the last decade have brought about a paradigm shift in language technology as a whole. The creation of these general-purpose language models requires extraordinary computing capacity and enormous amounts of data. Our main task is to adapt the world-class language models for the Hungarian language and make them available to the Hungarian language technology sector.The latest type of large-scale language models have already taken a significant step t ...

Machine translation

One of the important fields of language technology is machine translation. Transformer-based language representation, which is today’s market-leading technology, was first created in the field of machine translation.  Starting from there, it became not only the most defining tool of NLP, but also the defining tool of the science of speech processing or even image recognition.The aim of the research is to further develop the transformer-based machine translation system created for the English-Hun ...