MONDILEX Book
Total Page:16
File Type:pdf, Size:1020Kb
MONDILEX Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources Institute of Mathematics and Informatics, Bulgarian Academy of Sciences Ludmila Dimitrova, Violetta Koseska, Radovan Garabík, Tomaž Erjavec, Leonid Iomdin, Volodymyr Shyrokov CONCEPTUAL SCHEME FOR A RESEARCH INFRASTRUCTURE SUPPORTING DIGITAL RESOURCES IN SLAVIC LEXICOGRAPHY Sofia 2010 MONDILEX Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources Institute of Mathematics and Informatics, Bulgarian Academy of Sciences 2010 The volume is the outcome of the efforts of the participants of the project GA212938 MONDILEX Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources and the financial support of the European Commission: 7th Framework Programme Capacities—Research Infrastructures (Design studies for research infra- structures in all Sciences and Technologies fields). © Authors, © Institute of Mathematics and Informatics, BAS 2010 ISBN 978-954-8986-33-5 2 TABLE OF CONTENTS Foreword.................................................................................................................5 Introduction........................................................................................................…7 1. Language Resources in a Research Infrastructure for Slavic Lexicography....................................................................................9 1.1. Lexical database...........................................................................................9 1.1.1 Slovak morphology database...................................................................9 1.1.2 Multilingual Corpus Linguistics terminology database.......................11 1.1.3 Slovak-Czech Lexical Database.............................................................13 1.1.4 Paremiography database.......................................................................17 1.1.5 Bulgarian-Polish Lexical Database.......................................................20 1.2 Dictionaries..................................................................................................30 1.2.1 Dictionary of Slovak Collocations..........................................................30 1.2.2 Bulgarian-Polish Dictionary...................................................................34 1.2.3 Dictionaries of Ukraine on-line...............................................................38 1.3 Corpora........................................................................................................41 1.3.1 Monolingual corpus SynTagRus............................................................41 1.3.2 Multilingual parallel corpora................................................................42 1.4 Grammars....................................................................................................45 2. Standardisation of Slavic Lexicographic Resources and their Metadata.....48 2.1 Morphosyntactic Annotation in Slavic Digital Lexicography..................48 2.2 Corpus Encoding.........................................................................................54 2.3 Machine readable dictionaries....................................................................56 2.4 Lexical databases (on the example of Slovene)..........................................58 2.5 Universal networking language..................................................................62 3. Software Environments for Digital Lexicography.........................................63 3.1 Conceptual Modeling of Services for the Bilingual Lexicographic Systems........................................................................................................63 3 3.2 Integration with Other Services of the Lexicographic Systems...............72 3.3 Software Environments for Creating Digital Dictionaries.......................73 3.4 Software Environments for Creating Digital Corpora.............................77 4. Technological Platform for Research Infrastructure for Digital Language Resources and Research for Slavic Lexicography.........................................80 4.1 Research Infrastructure for Digital Lexicography...................................80 4.2 Virtual lexicographic system – technological platform for research e-infrastructure for digital lexicography....................................84 4.3 Grid Infrastructure Requirements for Supporting Research Activities in Digital Lexicography..............................................................94 4.4 Case studies................................................................................................102 4.5 Recommendations.....................................................................................106 5. Concluding remarks.......................................................................................108 Acknowledgments...........................................................................................111 Bibliography...................................................................................................117 4 Foreword This volume is the outcome of the efforts of the participants in the project GA212938 MONDILEX Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources and the financial support of the European Commission: 7th Framework Programme Capacities—Research Infrastructures (Design studies for research infrastruc- tures in all Sciences and Technologies fields). The MONDILEX project has six participants: (1) Institute of Mathematics and Informatics, Bulgarian Academy of Sciences (IMI-BAS), Sofia, Bulgaria, which coordinates the project; (2) Institute of Slavic Studies, Polish Academy of Sciences (ISS-PAS), Warsaw, Poland, (3) Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences (ĽŠIL), Bratislava, Slovakia, (4) Jožef Stefan Institute (JSI), Ljubljana, Slovenia; (5) Institute for Information Transmission Problems, Russian Academy of Sciences (IITP-RAS); and (6) the Ukrainian Lingua-Information Fund of the National Academy of Sciences of Ukraine (ULIF-NASU). The partners are research organisations from six European countries whose six national languages belong to the Slavic group: four EU memBers – Bulgaria, Poland, Slovak Republic, Slovenia, as well as the Russian Federation and Ukraine. All partners are national centres for research in linguistics, lexicography, and natural language processing. The main objective of the MONDILEX project was to design a conceptual scheme of a research infrastructure supporting the networking of centres for high-quality research in Slavic lexicography. Research infrastructures in general function as sets of strategic centres of excellence for research, education and training, whose chief aim is facilitating scientific cooperation and public partnership as well as strength- ening the interaction between research and applications. As such, research infrastructures greatly contribute to the development of the knowledge society. The MONDILEX project was motivated By the need of a sustainable and scalaBle infrastructure for institutions involved in creating and supporting a network of mul- tilingual resources of Slavic languages. Such an infrastructure is necessary in view of the obvious mismatch Between the importance of Slavic languages, spoken By a substantial part of Europe’s population, and the insufficient number and inadequate quality of digital lexical resources for these languages. The project MONDILEX provided a venue for networking activities, such as joint management and pooling of resources, implementation of standards for products of digital lexicography, and coordination with relevant international standards and practices. It demonstrated that unified strategies should contribute to reusability and interoperability of such resources so that researchers in the humanities and 5 social sciences as well as business communities could have easy access to Bilingual and multilingual dictionaries of Slavic languages. The implementation of a Research infrastructure for Slavic lexicography will con- tribute to the development of a knowledge society, not only By carrying out research, but also through the combination of various expertises from different backgrounds, from the development of communication capacities and strengthen- ing the interaction between research and society. Access to and use of technologically well-equipped facilities or databases enaBles young researchers and students to undertake complex problems as part of high-level interdisciplinary teams, and qualifies them, in an outstanding manner, for tasks in science or industry, and fostering their career mobility. Participation in the MONDILEX consortium enables the sharing of services for data processing and data collections, the coordinated extension and further devel- opment of bilingual and multilingual lexical resources, so that researchers in the humanities and social sciences as well as education and business will Be provided with an easy access to digital bilingual and multilingual dictionaries of Slavic lan- guages. The MONDILEX project contributes to the preservation and support of the multilingual and multicultural European heritage. It has laid foundations for further cooperation, setting up and elaborating a methodology of interaction of remote research groups and coordination of formats of lexicographic