Language Technology Research Group - Research Centre for Linguistics

Language Technology Research Group
•
Research area

The predecessor of the Language Technology Research Group, the Corpus Linguistics Department, was established in 1997 as a formal recognition of several years of ongoing research and development work in the field of language technology. Since then, the research group has accumulated widespread experience across various areas, such as building linguistic resources, developing language technology tools, and more recently, in training large language models (LLMs).

In terms of linguistic resources, the first version of the Hungarian National Corpus (HNC or MNSZ) must be In the 2010s, significant scientific paradigm shifts had a profound impact on the research activities within the group. Following influential international research, we developed Hungarian versions of neural laThe significant scientific paradigm shifts in the 2010s profoundly impacted the activity of our research group. By focusing on the most influential international advancements, we developed Hungarian versions of neural language models initially designed for English. Starting with static word embeddings, our scope has continuously expanded to include numerous transformer-based and generative contextual language models for Hungarian. Notably, we have developed HILBERT (a BERT-Large language model), PULI-GPT-3SX (the Hungarian version of GPT-3), and most recently, PULI LlumiX 32K (a Hungarian fine-tuned Llama-2 model). One of our most important recent initiatives is developing instruction-following models, resulting in the creation of the ParancsPULI and PULI LlumiX 32K Instruct models. Specific applications related to these language models can be tested on our demo page.

The development of high-quality LLMs requires multi-faceted test datasets in Hungarian, offering comprehensive information on the models’ accuracy. Therefore, the creation of Hungarian test datasets, so-called benchmark corpora, is a key focus of our research. These datasets, integrated into a web service, simplify the intricate evaluation of neural network-based technologies and enable easy comparison and publication of results. To this end, we’ve developed the Hungarian Language Understanding Evaluation Benchmark Kit (HuLU), modeled after the infrastructure of the GLUE and SuperGLUE test databases for English. Additionally, we’re currently in the process of developing benchmark datasets tailored for generative language models.

In recent years, it goes without saying that a vast amount of linguistic data is essential for LLMs to grasp the fundamental patterns of language. Consequently, data is becoming increasingly valuable in the digital realm, as it empowers machine learning algorithms to learn, predict, and make informed decisions. A balanced corpus, encompassing a diverse array of linguistic phenomena, equips language models to comprehend texts across different subjects and styles. Thus, the quantity and quality of accessible linguistic data directly influence the effectiveness and adaptability of LLMs.

The Language Technology Research Group has nearly two decades of experience in constructing corpora. The first major textual database for Hungarian, the Hungarian National Corpus (MNSZ), was finalized in 2005. Consisting of 187.6 million words, MNSZ includes varieties of Hungarian from beyond the borders. An enhanced version of the Hungarian National Corpus, MNSZ2 was released in 2014. MNSZ2 not only contains nearly ten times the amount of text (1.5 billion words) but also covers new and important text types, such as social media. Furthermore, the quality of linguistic analysis has significantly improved compared to its predecessor.

The importance of big amount and high-quality data motivates our ongoing corpus construction efforts: as part of the Science for the Hungarian Language National Program (Tudomány a magyar nyelvért nemzeti program), our goal is to create MNSZ3, the extended version of MNSZ2, to include 10 billion words while preserving the variety of genres and dialects in MNSZ2.

Another key objective is to collect textual data directly to pretrain LLMs. For doing so, Hungarian-language textual content of Common Crawl is downloaded and preprocessed. Common Crawl is a nonprofit organization that provides access to large amounts of textual content by regularly crawling websites and making the data available via the Amazon Web Services.

However, we also focus on more normative, curated texts. To this end, in a collaboration with the Library and Information Centre of the Hungarian Academy of Sciences, also as part of the Science for the Hungarian Language National Program, the textual content of the REAL repository is being processed via NLP tools. Our main objective is to make a massive volume of scientific publications of PDF format more searchable by processing the content of PDF files and providing automatically extracted metadata, such as authors, affiliations, named entities, and terminology. We hope that the processing of the REAL repository’s content will not only assist researchers working in various fields in using the collection but also potentially benefit any knowledgeable enthusiast.

Over the years, our research group has developed numerous tools. One of the most significant is the Spelling Advisory Portal (helyesiras.mta.hu), created to automatically assist with the normative spelling of Hungarian. Supported by the Hungarian Academy of Sciences, the portal was launched in 2013. While it was cutting-edge at the time, it has since become outdated and requires renovation both in terms of its software platform and user experience. This renovation work is currently underway.

Another important tool developed in collaboration with numerous partner institutions is e-magyar Digital Language Processing Toolchain and its enhanced, modularized successor, emtsv, which enable comprehensive analysis of natural language texts in Hungarian.

The research group also contributed to HuWordNet, the Hungarian version of the Princeton WordNet lexical database. HuWordNet, the result of three years of work, maps the Hungarian vocabulary onto a hierarchical structure according to the meaning of lexical items. First, words are organized into synonym sets, then the synonym sets are ordered based on various semantic relations.

The research group was involved in machine translation, as well. Our basic objective was to further develop the transformer-based machine translation system created for the English-Hungarian language pair towards multilingual direction, enabling translation not only between two languages but from multiple input languages to one or more target languages. Improving the translation quality in existing systems was also among our top priorities, especially in the case of Hungarian as the target language.

Research Group Leader:

Noémi Ligeti-Nagy, PhD
Email: ligeti-nagy.noemi@nytud.elte.hu
Phone: +36 (1) 3429372 / 6118

Current international project proposals	Start – end
Alliance for Language Technologies European Digital Infrastructure Consortium	2024.05.27. –

Current national project proposals	Start – end
Supporting the digital sustainability of the Hungarian language	2020.12.01. – 2026.11.30.
Digital support for the Hungarian language is Hungarian in the service of science	2020.12.01. – 2026.11.30.

Major closed international project proposals	Start – end
CURLICAT: Curated Multilingual Language Resources for CEF AT	2020.06.01. – 2022.11.30.
MARCELL: Multilingual Resources for CEF.AT in the Legal Domain	2018.10.01. – 2021.03.31.
Large-scale, Cross-lingual Trend Mining and Summarisation of Real-time Media Streams (TrendMiner)	2013 – 2014
Innovative Networking in Infrastructure for Endangered Languages (INNET)	2011 – 2013
European Media Monitor – Hungarian modul	2012
Central and South-East European Resources (CESAR)	2011 – 2013
Internet Translators for all European Languages (iTranslate4)	2010 – 2012

Major closed national project proposals	Start – end
e-magyar.hu: Open, integrated Hungarian language technology research building infrastructure.	2015.01.01. – 2016.06.30.
Hungarian Generative Diachronic Syntax 2	2014 – 2018
helyesiras.mta.hu – Spelling Advisory Portal	2008 – 2013
Disclosure of BSI-2	2008 – 2012
Dictionary of Hungarian Verb Phrase Constructions	2008 – 2010
Building of the Hungarian WordNet ontology and its applications in information extraction systems (Hungarian WordNet)	2005 – 2007

*A detailed list of the closed tenders can be found here.