Digital support for the Hungarian language is Hungarian in the service of science

Financer institutionHungarian Academy of Sciences

IDTMNYNP 2
Domestic tenderConsortial tender

The Hungarian language, as a carrier of our thousand-year-old culture and a central element of our national identity, has proven its vitality over the centuries. However, in the era of globalization and pervasive digital interaction, our mother tongue faces new challenges. The threat is not the extinction of our language but the risk that, lacking technological support, Hungarian could be marginalized in the digital realm: for example, if certain services are not available in Hungarian on mobile devices, Hungarian people might be forced to use a foreign language instead. We are already experiencing a certain degree of decline in the use of Hungarian in some areas. This includes the scientific field, which fundamentally depends on the rapid and widespread dissemination of scientific results. It is an inevitable and unstoppable development in the global scientific community to use a lingua franca, currently represented by English. However, we cannot abandon the transmission of scientific results and education in our mother tongue.

NYTK has significant experience in both building text corpora and digitally processing documents. Notable corpus-building projects include the Hungarian National Corpus 1, 2, and the MARCELL project. In the current program, however, aspects of corpus-building different from classic corpus-building (such as representativeness, balance) are emphasized, like the extraction and organization of metadata, and the development and construction of appropriate structures. Increasing attention is being paid to materials found in public collections for the texts used in corpus-building (see, for example, the event focused on this topic: Clarin and Libraries Workshop, The Hague, May 9/10, 2022). Among domestic textual document collections, the repository of the Library of the Hungarian Academy of Sciences, REAL, holds a prominent position in both the quantity of texts it contains and the scope of its collection (scientific, peer-reviewed texts). The Library and Information Centre of the Hungarian Academy of Sciences (MTA KIK) was among the first to start building a domestic repository and has significant experience in collecting and managing materials. In this respect, we participate in various international collaborations (e.g., Confederation of Open Access Repositories, European Open Science Cloud). The experiences of MTA KIK with the use of Persistent Identifiers and the development of national bibliographic databases are relevant for the application of tools to be developed in the project. An important precedent for this proposal is the NYTI MATRICA project, which aimed to extract citations from a corpus of scientific journal articles. Relevant prior work in the research of advanced forms of scientific communication includes András Holl’s work on the technical development of the Information Bulletin on Variable Stars.

This proposal has a dual purpose: it generally assists scientific communication through text mining methods, and within this, it provides substantial support for Hungarian-language science by processing and making Hungarian scientific publications available. Regarding semantic search capabilities, we can mention the development of search for proper names. Our plans include developing procedures that are innovative or new in approach and have not been implemented at all or in scientific services. Primarily, we plan to extract modern, born-digital, peer-reviewed content that can be well used as training material. The training of appropriate computational linguistic tools (which will likely operate iteratively, involving increasingly broad content) provides an opportunity to use the developed tools to improve and enrich the MTA KIK databases. We plan to semantically segment texts, determine subject areas, extract and improve metadata, correct textual errors, extract terminology, conduct broad text mining, and create intelligent, semantic search tools. The language technology tools developed within the project will thus enable the improvement and enrichment of the REAL repository content and the enhancement of repository services provided to users (researchers). By applying language technology, we aim to make the MTA KIK material more efficiently searchable within the framework of cooperation between NYTK and MTA KIK. An important outcome of the project could be the addition of references to Hungarian journals in the MTMT database that are currently missing.

Participating institutions

Library and Information Centre of the Hungarian Academy of Sciences

Library and Information Centre of the Hungarian Academy of Sciences