Language Technology Research Group•Research area
The predecessor of the Language Technology Research Group, the Corpus Linguistics Department, was established in 1997 as a formal recognition of several years of ongoing research and development work in the field of language technology. Since then, the research group has accumulated widespread experience across various areas, such as building linguistic resources, developing language technology tools, and more recently, in training large language models (LLMs).
In terms of linguistic resources, the first version of the Hungarian National Corpus (HNC or MNSZ) must be In the 2010s, significant scientific paradigm shifts had a profound impact on the research activities within the group. Following influential international research, we developed Hungarian versions of neural laThe significant scientific paradigm shifts in the 2010s profoundly impacted the activity of our research group. By focusing on the most influential international advancements, we developed Hungarian versions of neural language models initially designed for English. Starting with static word embeddings, our scope has continuously expanded to include numerous transformer-based and generative contextual language models for Hungarian. Notably, we have developed HILBERT (a BERT-Large language model), PULI-GPT-3SX (the Hungarian version of GPT-3), and most recently, PULI LlumiX 32K (a Hungarian fine-tuned Llama-2 model). One of our most important recent initiatives is developing instruction-following models, resulting in the creation of the ParancsPULI and PULI LlumiX 32K Instruct models. Specific applications related to these language models can be tested on our demo page.
The development of high-quality LLMs requires multi-faceted test datasets in Hungarian, offering comprehensive information on the models’ accuracy. Therefore, the creation of Hungarian test datasets, so-called benchmark corpora, is a key focus of our research. These datasets, integrated into a web service, simplify the intricate evaluation of neural network-based technologies and enable easy comparison and publication of results. To this end, we’ve developed the Hungarian Language Understanding Evaluation Benchmark Kit (HuLU), modeled after the infrastructure of the GLUE and SuperGLUE test databases for English. Additionally, we’re currently in the process of developing benchmark datasets tailored for generative language models.
In recent years, it goes without saying that a vast amount of linguistic data is essential for LLMs to grasp the fundamental patterns of language. Consequently, data is becoming increasingly valuable in the digital realm, as it empowers machine learning algorithms to learn, predict, and make informed decisions. A balanced corpus, encompassing a diverse array of linguistic phenomena, equips language models to comprehend texts across different subjects and styles. Thus, the quantity and quality of accessible linguistic data directly influence the effectiveness and adaptability of LLMs.
The Language Technology Research Group has nearly two decades of experience in constructing corpora. The first major textual database for Hungarian, the Hungarian National Corpus (MNSZ), was finalized in 2005. Consisting of 187.6 million words, MNSZ includes varieties of Hungarian from beyond the borders. An enhanced version of the Hungarian National Corpus, MNSZ2 was released in 2014. MNSZ2 not only contains nearly ten times the amount of text (1.5 billion words) but also covers new and important text types, such as social media. Furthermore, the quality of linguistic analysis has significantly improved compared to its predecessor.
The importance of big amount and high-quality data motivates our ongoing corpus construction efforts: as part of the Science for the Hungarian Language National Program (Tudomány a magyar nyelvért nemzeti program), our goal is to create MNSZ3, the extended version of MNSZ2, to include 10 billion words while preserving the variety of genres and dialects in MNSZ2.
Another key objective is to collect textual data directly to pretrain LLMs. For doing so, Hungarian-language textual content of Common Crawl is downloaded and preprocessed. Common Crawl is a nonprofit organization that provides access to large amounts of textual content by regularly crawling websites and making the data available via the Amazon Web Services.
However, we also focus on more normative, curated texts. To this end, in a collaboration with the Library and Information Centre of the Hungarian Academy of Sciences, also as part of the Science for the Hungarian Language National Program, the textual content of the REAL repository is being processed via NLP tools. Our main objective is to make a massive volume of scientific publications of PDF format more searchable by processing the content of PDF files and providing automatically extracted metadata, such as authors, affiliations, named entities, and terminology. We hope that the processing of the REAL repository’s content will not only assist researchers working in various fields in using the collection but also potentially benefit any knowledgeable enthusiast.
Over the years, our research group has developed numerous tools. One of the most significant is the Spelling Advisory Portal (helyesiras.mta.hu), created to automatically assist with the normative spelling of Hungarian. Supported by the Hungarian Academy of Sciences, the portal was launched in 2013. While it was cutting-edge at the time, it has since become outdated and requires renovation both in terms of its software platform and user experience. This renovation work is currently underway.
Another important tool developed in collaboration with numerous partner institutions is e-magyar Digital Language Processing Toolchain and its enhanced, modularized successor, emtsv, which enable comprehensive analysis of natural language texts in Hungarian.
The research group also contributed to HuWordNet, the Hungarian version of the Princeton WordNet lexical database. HuWordNet, the result of three years of work, maps the Hungarian vocabulary onto a hierarchical structure according to the meaning of lexical items. First, words are organized into synonym sets, then the synonym sets are ordered based on various semantic relations.
The research group was involved in machine translation, as well. Our basic objective was to further develop the transformer-based machine translation system created for the English-Hungarian language pair towards multilingual direction, enabling translation not only between two languages but from multiple input languages to one or more target languages. Improving the translation quality in existing systems was also among our top priorities, especially in the case of Hungarian as the target language.
Research Group Leader: Enikő Héja, PhD Email: urwn.ravxb@alghq.uha-era.uh Phone: +36 (1) 3429372 / 6043 |
Current international project proposals | Start – end |
Alliance for Language Technologies European Digital Infrastructure Consortium | 2024.05.27. – |
Current national project proposals | Start – end |
Supporting the digital sustainability of the Hungarian language | 2020.12.01. – 2026.11.30. |
Digital support for the Hungarian language is Hungarian in the service of science | 2020.12.01. – 2026.11.30. |
Major closed international project proposals | Start – end |
CURLICAT: Curated Multilingual Language Resources for CEF AT | 2020.06.01. – 2022.11.30. |
MARCELL: Multilingual Resources for CEF.AT in the Legal Domain | 2018.10.01. – 2021.03.31. |
Large-scale, Cross-lingual Trend Mining and Summarisation of Real-time Media Streams (TrendMiner) | 2013 – 2014 |
Innovative Networking in Infrastructure for Endangered Languages (INNET) | 2011 – 2013 |
European Media Monitor – Hungarian modul | 2012 |
Central and South-East European Resources (CESAR) | 2011 – 2013 |
Internet Translators for all European Languages (iTranslate4) | 2010 – 2012 |
Major closed national project proposals | Start – end |
e-magyar.hu: Open, integrated Hungarian language technology research building infrastructure. | 2015.01.01. – 2016.06.30. |
Hungarian Generative Diachronic Syntax 2 | 2014 – 2018 |
helyesiras.mta.hu – Spelling Advisory Portal | 2008 – 2013 |
Disclosure of BSI-2 | 2008 – 2012 |
Dictionary of Hungarian Verb Phrase Constructions | 2008 – 2010 |
Building of the Hungarian WordNet ontology and its applications in information extraction systems (Hungarian WordNet) | 2005 – 2007 |
*A detailed list of the closed tenders can be found here.
Language Technology Research Group•Staff
Institute for Language Technologies and Applied Linguistics
Institute for Language Technologies and Applied Linguistics
Institute for Language Technologies and Applied Linguistics
Institute for Language Technologies and Applied Linguistics
Language Technology Research Group•Research
Building a data infrastructure by correcting OCR errors in curated texts
The production of language models requires a corpus of billions of words, the most obvious source of which is the Internet. However, most of the texts available here are of uncertain origin and quality, often with little metadata. As part of the cooperation with the Arcanum Database Publisher, we have a collection of curated texts of approximately nine billion words at our disposal. This collection is the result of the publisher’s many years of OCR scanning (Optical Character Recognition). Yet, ...
Building and publishing benchmark corpora
One of the prerequisites for following cutting-edge NLP is the standardized measurement of development results in the Hungarian language. This requires a whole series of test databases, so-called benchmark corpora, created according to a strict methodology, which serve as a reference for measuring the level of development of new technologies and devices.However, benchmark databases serve more than just the purpose of comparing the performance of different language models. Their important new rol ...
Development of language-centered artificial intelligence (language models)
The neural language models becoming dominant in the last decade have brought about a paradigm shift in language technology as a whole. The creation of these general-purpose language models requires extraordinary computing capacity and enormous amounts of data. Our main task is to adapt the world-class language models for the Hungarian language and make them available to the Hungarian language technology sector.The latest type of large-scale language models have already taken a significant step t ...
Language Technology Research Group•Contacts
Partner institutions
On May 27, 2024, Hungary was elected as a member of the European Digital Infrastructure Consortium Alliance for Language Technologies (ALT-EDIC). The representation of Hungary will be handled by the Research Institute for Linguistics of HUN-REN, commissioned by the Ministry of Culture and Innovation.
Tamás Váradi has been the secretary of the EFNIL organization since 2010, and the institute has been handling the secretarial tasks for EFNIL since 2010.
The European Language Resource Coordination (ELRC) workshop in Hungary was organized by the Research Center, within which we engage in dialogue with industry and government stakeholders about the state and prospects of Hungarian language technology. Developers and users of language technology share their experiences, needs, and ideas on how language technology solutions can support the digital interactions of a multilingual Europe.
In an ongoing collaboration between NYTK and Indamedia Sales Kft., NYTK has received and processed the entire content of the news portal index.hu. Active negotiations are underway to expand the collaboration to apply Hungarian-language artificial intelligence in publishing work.
With the involvement of language technology tools, the material of the MTA Library REAL repository can become searchable in a more efficient manner than the current state. Processing the content of PDF texts is already underway: we are making the content of a massive volume of scientific publications easily searchable by automatically extracting metadata (such as authors, affiliations, named entities, and terminology) from the PDF format
In a successful collaboration, NYTK and MNL processed the more than 600,000 personal records of Hungarian prisoners of war deported to Soviet Union camps. In an ongoing joint project, they are processing a database containing approximately 5 million index cards collected for the Comprehensive Dictionary of the Hungarian Language, about 50% of which are handwritten. The optical character recognition of Hungarian handwriting has opened a new dimension in the development of artificial intelligence.
The Research Group provides language technology assistance for processing the materials of the National Széchényi Library (OSZK) in exchange for access to the OSZK's web harvesting and other digital collections.
Advisory services for the development of T-COM's artificial intelligence-based applications.