Language Technology Research Group•Research area
The predecessor of the Language Technology Research Group, the Corpus Linguistics Department, was founded in 1997, as a formal recognition of the language technology research and development work that had been going on for several years at that time. Since then, the research group has gained significant experience in research and development in many areas of language technology: it has achieved outstanding results in building language resources, developing language technology tools and, more recently, in teaching language models.
In terms of linguistic resources, the first version of the Hungarian National Corpus (HNC or MNSZ) must be highlighted: the 187.6 million-word annotated, representative text corpus, completed in 2005, which also includes language variants from territories outside the present borders of Hungary. It was the first major database of its kind for the Hungarian language. , the improved version, published in 2014, not only contains almost ten times more words (1.5 billion), but also covers new, important text types, such as social media. Furthermore, the quality of the linguistic analysis also improved a lot compared to the previous version.
Over the years, the members of the research group have developed a number of tools. One of the most significant of these is the , which was created to help language users in improving spelling, following the complex rules of the Hungarian spelling system, using automatic tools. and its improved, modularized version, , which enable extensive analysis of texts should also be emphasised.
The Hungarian version of the WordNet lexical database is another important resource, which was created with the cooperation of the research group. was created as a result of a three-year- process and depicts the Hungarian vocabulary from a semantic point of view: it includes both synonymous words and the relations between them.
The scientific paradigm shifts that took place in 2013 and then in 2018 had a serious impact on the work carried out in the research group. Following the leading international research, we created the Hungarian versions of the neural language models developed primarily for English. Initially, this meant static word embeddings, but now we also have several transformer-based contextual language models. Examples include HILBERT, which is a BERT-Large language model, and (7 billion parameters), the .
An important element of the current activity of the research group is the mapping of new teaching paradigms related to language models, such as zero-shot and few-shot learning or prompt programming. Another important research direction is improving the quality of machine translation with transformer-based neural networks. Our specific applications related to language models can be tried on this demo page.
Another priority of the research group is to create test databases for Hungarian, so-called create benchmark corpora that, when embedded in a web service, enable a simple, yet multi-criteria evaluation of neural network-based technologies, as well as the comparison and publication of these results. For this purpose, the was created, based on the GLUE and SuperGLUE test database infrastructure developed for the English language.
|Research Group Leader:
Phone: +36 (1) 3429372 / 6043
|Current national tenders
|Start – end
|2020.12.01. – 2026.11.30.
|2020.12.01. – 2026.11.30.
|Major closed international tenders
|Start – end
|2020.06.01. – 2022.11.30.
|2018.10.01. – 2021.03.31.
|2013 – 2014
|2011 – 2013
|2011 – 2013
|2010 – 2012
Language Technology Research Group•Staff
Language Technology Research Group•Research
Building a data infrastructure by correcting OCR errors in curated texts
The production of language models requires a corpus of billions of words, the most obvious source of which is the Internet. However, most of the texts available here are of uncertain origin and quality, often with little metadata. As part of the cooperation with the Arcanum Database Publisher, we have a collection of curated texts of approximately nine billion words at our disposal. This collection is the result of the publisher’s many years of OCR scanning (Optical Character Recognition). Yet, ...
Building and publishing benchmark corpora
One of the prerequisites for following cutting-edge NLP is the standardized measurement of development results in the Hungarian language. This requires a whole series of test databases, so-called benchmark corpora, created according to a strict methodology, which serve as a reference for measuring the level of development of new technologies and devices.However, benchmark databases serve more than just the purpose of comparing the performance of different language models. Their important new rol ...
Development of language-centered artificial intelligence (language models)
The neural language models becoming dominant in the last decade have brought about a paradigm shift in language technology as a whole. The creation of these general-purpose language models requires extraordinary computing capacity and enormous amounts of data. Our main task is to adapt the world-class language models for the Hungarian language and make them available to the Hungarian language technology sector.The latest type of large-scale language models have already taken a significant step t ...
One of the important fields of language technology is machine translation. Transformer-based language representation, which is today’s market-leading technology, was first created in the field of machine translation. Starting from there, it became not only the most defining tool of NLP, but also the defining tool of the science of speech processing or even image recognition.The aim of the research is to further develop the transformer-based machine translation system created for the English-Hun ...