Introducing Linguistic Knowledge Into Statistical Machine Translation

Total Page:16

File Type:pdf, Size:1020Kb

Introducing Linguistic Knowledge Into Statistical Machine Translation Ph.D. Thesis Dissertation Introducing Linguistic Knowledge into Statistical Machine Translation Author: Adri`ade Gispert Ramis Advisor: Prof. Jos´eB. Mari˜noAcebal TALP Research Center, Speech Processing Group Department of Signal Theory and Communications Universitat Polit`ecnica de Catalunya Barcelona, October 2006 Time flies like an arrow. Fruit flies like a banana. Groucho Marx. iv Abstract This Ph.D. thesis dissertation addresses the use of morphosyntactic information in order to improve the performance of Statistical Machine Translation (SMT) systems, providing them with additional linguistic information beyond the surface level of words from parallel corpora. The statistical machine translation system in this work here follows a tuple-based approach, modelling joint-probability translation models via log-linear combination of bilingual n-grams with additional feature functions. A detailed study of the approach is conducted. This includes its initial development from a speech-oriented Finite-State Transducer architecture implement- ing X-grams towards a large-vocabulary text-oriented n-grams implementation, training and decoding particularities, portability across language pairs and tasks, and main difficulties as revealed in error analyses. The use of linguistic knowledge to improve word alignment quality is also studied. A cooccurrence-based one-to-one word alignment algorithm is extended with verb form classifica- tion with successful results. Additionally, we evaluate the impact in word alignment and transla- tion quality of Part-Of-Speech, base form, verb form classification and stemming on state-of-art word alignment tools. Furthermore, the thesis proposes a translation model tackling verb form generation through an additional verb instance model, reporting experiments in English→Spanish tasks. Disagree- ment is addressed via incorporating a target Part-Of-Speech language model. Finally, we study the impact of morphology derivation on Ngram-based SMT formulation, empirically evaluating the quality gain that is to be gained via morphology reduction. v vi Resum Aquesta tesi est`adedicada a l’estudi de la utilitzaci´ode informaci´omorfosint`acticaen el marc dels sistemes de traducci´oestoc`astica, amb l’objectiu de millorar-ne la qualitat a trav´esde la incorporaci´ode informaci´oling¨u´ıstica m´es enll`adel nivell simb`olic superficial de les paraules. El sistema de traducci´oestoc`asticautilitzat en aquest treball segueix un enfocament basat en tuples, unitats biling¨ues que permeten estimar un model de traducci´ode probabilitat conjunta per mitj`ade la combinaci´o,dins un entorn log-linial, de cadenes d’n-grames i funcions carac- ter´ıstiques addicionals. Es presenta un estudi detallat d’aquesta aproximaci´o,que inclou la seva transformaci´odes d’una implementaci´od’X-grames en aut`omatsd’estats finits, m´esorientada a la traducci´ode veu, cap a l’actual soluci´od’n-grames orientada a la traducci´ode text de gran vocabulari. La tesi estudia tamb´eles fases d’entrenament i decodificaci´o,aix´ıcom el rendiment per a diferents tasques (variant el tamany dels corpora o el parell d’idiomes) i els principals problemes reflectits en les an`alisisd’error. La tesis tamb´einvestiga la incorporaci´ode informaci´oling¨u´ısticaespec´ıficament en alini- ament per paraules. Es proposa l’extensi´omitjan¸cant classificaci´ode formes verbals d’un al- gorisme d’aliniament paraula a paraula basat en co-ocurr`encies, amb resultats positius. Aix´ı mateix, s’avalua de forma emp´ırica l’impacte en qualitat d’aliniament i de traducci´oque s’obt´e mitjan¸cant l’etiquetatge morfol`ogic,la lematitzaci´o,la classificaci´ode formes verbals i el trun- cament o stemming del text paral·lel. Pel que fa al model de traducci´o,es proposa un model de tractament de les formes ver- bals per mitj`ad’un model de instanciaci´oaddicional, i es realitzen experiments en la direcci´o angl`es→castell`a.La tesi tamb´eintrodueix un model de llenguatge d’etiquetes morfol`ogiques del dest´ı per tal d’abordar problemes de concordan¸ca.Finalment, s’estudia l’impacte de la derivaci´omorfol`ogicaen la formulaci´ode la traducci´oestoc`asticamitjan¸cant n-grames, aval- uant emp´ıricament el possible guany derivat d’estrat`egies de reducci´omorfol`ogica. vii viii Agra¨ıments Vull donar les gr`aciesde tot cor al Pepe, no nom´esper haver-me ajudat a tirar endavant aquesta tesi, sin´oper mostrar-se sempre com una persona exemplar dins i fora de l’entorn professional, i de la qual he apr`es molts valors, i moltes actituds i maneres de treballar admirables. Sempre li estar´emolt agra¨ıtper tot el que ha fet per mi. Gr`acies tamb´eals companys i companyes del grup de traducci´oestad´ıstica, i en especial al Josep Maria Crego, sense l’aportaci´odel qual aquesta tesi no hauria anat tan lluny, i amb qui he apr`es a fer recerca de manera seriosa. Tamb´evull donar les gr`acies a aquells i aquelles que, amb el seu treball i potser sense adonar- se’n, han estat i sempre seran un referent de comprom´ısi motivaci´oper a mi, com ara el Climent Nadeu, el Jaume Padrell, el Xavi P´erez (el company de feina que tothom desitjaria tenir), el Llu´ısPadr´oi el Llu´ısM`arquez. Gr`acies tamb´eals molts companys de tesi amb qui he compartit molt bones estones com l’Alberto, el Ramon, el Pere, l’Ignasi, la Marta, el Javi o el Joel, entre molts d’altres. Moltes gr`aciestamb´eals grans amics i amigues que m’han acompanyat en aquest trajecte, no nom´es pel que m’han aguantat quan ha fet falta, sin´oper fer-me veure clarament que sempre hi tornarien a ser si calgu´es(Jordi, Mari, Oriol, N´uria, Paqui, Ferran, Pablo, Ana, Llu´ıs,David, etc.). Per ´ultim, vull dedicar aquesta tesi als meus pares i a l’Aleyda, que amb el seu amor constant em donen la for¸canecess`aria per treballar i viure amb il·lusi´o.Aquesta tesi ´es, en bona part, tamb´evostra. Adri`a Barcelona, octubre del 2006 ix x Contents 1 Introduction 1 1.1 Machine Translation and the Statistical Approach ................. 2 1.1.1 A brief history of MT ............................. 2 1.1.2 Approaches to MT ............................... 3 1.1.3 Statistical Machine Translation ........................ 5 1.2 Motivation ....................................... 6 1.3 Objectives of this Ph.D. ................................ 7 1.4 Thesis Organisation .................................. 8 1.5 Research Contributions ................................ 9 2 State of the art 11 2.1 Word-based translation models ............................ 11 2.1.1 IBM translation and alignment models .................... 12 2.1.2 Training and decoding tools .......................... 14 2.2 Phrase-based translation models ........................... 15 2.2.1 Alignment templates .............................. 15 2.2.2 Phrase-based SMT ............................... 16 2.2.3 Training and decoding tools .......................... 17 2.3 Tuple-based translation model ............................. 18 2.3.1 Finite-State Transducer implementation ................... 18 2.3.2 Other implementations ............................. 19 xi 2.4 Feature-based models combination .......................... 20 2.4.1 Minimum-error training ............................ 20 2.4.2 Re-ranking ................................... 21 2.5 Statistical Word Alignment .............................. 21 2.5.1 Evaluating Word Alignment .......................... 22 2.5.2 Word Alignment approaches .......................... 23 2.6 Use of linguistic knowledge into SMT ......................... 24 2.6.1 Other approaches ................................ 25 2.7 Machine Translation evaluation ............................ 26 2.7.1 Automatic evaluation metrics ......................... 27 2.7.1.1 BLEU score ............................. 27 2.7.1.2 NIST score .............................. 29 2.7.1.3 mWER ................................ 30 2.7.1.4 mPER ................................ 31 2.7.1.5 Other evaluation metrics ...................... 32 2.7.2 Human evaluation metrics ........................... 33 2.7.3 International evaluation campaigns ...................... 34 3 The Bilingual N-gram Translation Model 37 3.1 Introduction ....................................... 37 3.2 X-grams FST implementation ............................. 38 3.2.1 Reviewing X-grams for Language Modelling ................. 38 3.2.2 Bilingual X-grams for Speech Translation .................. 39 3.2.2.1 Training from parallel data ..................... 41 3.2.2.2 Preliminary experiment ....................... 42 3.2.3 Tuple definition: from one-to-many to many-to-many ............ 43 3.2.4 Monotonicity vs. word reordering ....................... 46 xii 3.2.4.1 Studying English–Spanish cross patterns ............. 47 3.2.4.2 An initial reordering strategy .................... 50 3.2.4.3 Morphology-reduced word alignment ................ 51 3.2.5 The TALP X-grams translation system ................... 52 3.2.5.1 FAME project public demonstration ................ 52 3.2.5.2 IWSLT’04 participation ....................... 53 3.3 N-gram implementation ................................ 56 3.3.1 Modelling issues ................................ 56 3.3.1.1 History length ............................ 57 3.3.1.2 Pruning strategies .......................... 58 3.3.1.3 Smoothing the bilingual model ................... 60 3.3.2 Case study: the Catalan-Spanish task .................... 62 3.4 The Tuple as Translation Unit ............................ 65 3.4.1 Embedded words ................................ 65 3.4.2 Tuple segmentation .............................
Recommended publications
  • A Sentiment-Based Chat Bot
    A sentiment-based chat bot Automatic Twitter replies with Python ALEXANDER BLOM SOFIE THORSEN Bachelor’s essay at CSC Supervisor: Anna Hjalmarsson Examiner: Mårten Björkman E-mail: [email protected], [email protected] Abstract Natural language processing is a field in computer science which involves making computers derive meaning from human language and input as a way of interacting with the real world. Broadly speaking, sentiment analysis is the act of determining the attitude of an author or speaker, with respect to a certain topic or the overall context and is an application of the natural language processing field. This essay discusses the implementation of a Twitter chat bot that uses natural language processing and sentiment analysis to construct a be- lievable reply. This is done in the Python programming language, using a statistical method called Naive Bayes classifying supplied by the NLTK Python package. The essay concludes that applying natural language processing and sentiment analysis in this isolated fashion was simple, but achieving more complex tasks greatly increases the difficulty. Referat Natural language processing är ett fält inom datavetenskap som innefat- tar att få datorer att förstå mänskligt språk och indata för att på så sätt kunna interagera med den riktiga världen. Sentiment analysis är, generellt sagt, akten av att försöka bestämma känslan hos en författare eller talare med avseende på ett specifikt ämne eller sammanhang och är en applicering av fältet natural language processing. Den här rapporten diskuterar implementeringen av en Twitter-chatbot som använder just natural language processing och sentiment analysis för att kunna svara på tweets genom att använda känslan i tweetet.
    [Show full text]
  • Integrating Optical Character Recognition and Machine
    Integrating Optical Character Recognition and Machine Translation of Historical Documents Haithem Afli and Andy Way ADAPT Centre School of Computing Dublin City University Dublin, Ireland haithem.afli, andy.way @adaptcentre.ie { } Abstract Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimenta- tion shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement. 1 Introduction While research on improving Optical Character Recognition (OCR) algorithms is ongoing, our assess- ment is that Machine Translation (MT) will continue to produce unacceptable translation errors (or non- translations) based solely on the automatic output of OCR systems. The problem comes from the fact that current OCR and Machine Translation systems are commercially distinct and separate technologies. There are often mistakes in the scanned texts as the OCR system occasionally misrecognizes letters and falsely identifies scanned text, leading to misspellings and linguistic errors in the output text (Niklas, 2010). Works involved in improving translation services purchase off-the-shelf OCR technology but have limited capability to adapt the OCR processing to improve overall machine translation perfor- mance.
    [Show full text]
  • Construction of English-Bodo Parallel Text
    International Journal on Natural Language Computing (IJNLC) Vol.7, No.5, October 2018 CONSTRUCTION OF ENGLISH -BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRANSLATION Saiful Islam 1, Abhijit Paul 2, Bipul Syam Purkayastha 3 and Ismail Hussain 4 1,2,3 Department of Computer Science, Assam University, Silchar, India 4Department of Bodo, Gauhati University, Guwahati, India ABSTRACT Corpus is a large collection of homogeneous and authentic written texts (or speech) of a particular natural language which exists in machine readable form. The scope of the corpus is endless in Computational Linguistics and Natural Language Processing (NLP). Parallel corpus is a very useful resource for most of the applications of NLP, especially for Statistical Machine Translation (SMT). The SMT is the most popular approach of Machine Translation (MT) nowadays and it can produce high quality translation result based on huge amount of aligned parallel text corpora in both the source and target languages. Although Bodo is a recognized natural language of India and co-official languages of Assam, still the machine readable information of Bodo language is very low. Therefore, to expand the computerized information of the language, English to Bodo SMT system has been developed. But this paper mainly focuses on building English-Bodo parallel text corpora to implement the English to Bodo SMT system using Phrase-Based SMT approach. We have designed an E-BPTC (English-Bodo Parallel Text Corpus) creator tool and have been constructed General and Newspaper domains English-Bodo parallel text corpora. Finally, the quality of the constructed parallel text corpora has been tested using two evaluation techniques in the SMT system.
    [Show full text]
  • Applying Ensembling Methods to BERT to Boost Model Performance
    Applying Ensembling Methods to BERT to Boost Model Performance Charlie Xu, Solomon Barth, Zoe Solis Department of Computer Science Stanford University cxu2, sbarth, zoesolis @stanford.edu Abstract Due to its wide range of applications and difficulty to solve, Machine Comprehen- sion has become a significant point of focus in AI research. Using the SQuAD 2.0 dataset as a metric, BERT models have been able to achieve superior performance to most other models. Ensembling BERT with other models has yielded even greater results, since these ensemble models are able to utilize the relative strengths and adjust for the relative weaknesses of each of the individual submodels. In this project, we present various ensemble models that leveraged BiDAF and multiple different BERT models to achieve superior performance. We also studied the effects how dataset manipulations and different ensembling methods can affect the performance of our models. Lastly, we attempted to develop the Synthetic Self- Training Question Answering model. In the end, our best performing multi-BERT ensemble obtained an EM score of 73.576 and an F1 score of 76.721. 1 Introduction Machine Comprehension (MC) is a complex task in NLP that aims to understand written language. Question Answering (QA) is one of the major tasks within MC, requiring a model to provide an answer, given a contextual text passage and question. It has a wide variety of applications, including search engines and voice assistants, making it a popular problem for NLP researchers. The Stanford Question Answering Dataset (SQuAD) is a very popular means for training, tuning, testing, and comparing question answering systems.
    [Show full text]
  • Machine Translation for Academic Purposes Grace Hui-Chin Lin
    Proceedings of the International Conference on TESOL and Translation 2009 December 2009: pp.133-148 Machine Translation for Academic Purposes Grace Hui-chin Lin PhD Texas A&M University College Station Master of Science, University of Southern California Paul Shih Chieh Chien Center for General Education, Taipei Medical University Abstract Due to the globalization trend and knowledge boost in the second millennium, multi-lingual translation has become a noteworthy issue. For the purposes of learning knowledge in academic fields, Machine Translation (MT) should be noticed not only academically but also practically. MT should be informed to the translating learners because it is a valuable approach to apply by professional translators for diverse professional fields. For learning translating skills and finding a way to learn and teach through bi-lingual/multilingual translating functions in software, machine translation is an ideal approach that translation trainers, translation learners, and professional translators should be familiar with. In fact, theories for machine translation and computer assistance had been highly valued by many scholars. (e.g., Hutchines, 2003; Thriveni, 2002) Based on MIT’s Open Courseware into Chinese that Lee, Lin and Bonk (2007) have introduced, this paper demonstrates how MT can be efficiently applied as a superior way of teaching and learning. This article predicts the translated courses utilizing MT for residents of global village should emerge and be provided soon in industrialized nations and it exhibits an overview about what the current developmental status of MT is, why the MT should be fully applied for academic purpose, such as translating a textbook or teaching and learning a course, and what types of software can be successfully applied.
    [Show full text]
  • Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7S, May 2019 Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna At this period corpora were used in semantics and related Abstract: Corpus linguistic will be the one of latest and fast field. At this period corpus linguistics was not commonly method for teaching and learning of different languages. used, no computers were used. Corpus linguistic history Proposed article can provide the corpus linguistic introduction were opposed by Noam Chomsky. Now a day’s modern and give the general overview about the linguistic team of corpus. corpus linguistic was formed from English work context and Article described about the corpus, novelties and corpus are associated in the following field. Proposed paper can focus on the now it is used in many other languages. Alongside was using the corpus linguistic in IT field. As from research it is corpus linguistic history and this is considered as obvious that corpora will be used as internet and primary data in methodology [Abdumanapovna, 2018]. A landmark is the field of IT. The method of text corpus is the digestive modern corpus linguistic which was publish by Henry approach which can drive the different set of rules to govern the Kucera. natural language from the text in the particular language, it can Corpus linguistic introduce new methods and techniques explore about languages that how it can inter-related with other. for more complete description of modern languages and Hence the corpus linguistic will be automatically drive from the these also provide opportunities to obtain new result with text sources.
    [Show full text]
  • Aconcorde: Towards a Proper Concordance for Arabic
    aConCorde: towards a proper concordance for Arabic Andrew Roberts, Dr Latifa Al-Sulaiti and Eric Atwell School of Computing University of Leeds LS2 9JT United Kingdom {andyr,latifa,eric}@comp.leeds.ac.uk July 12, 2005 Abstract Arabic corpus linguistics is currently enjoying a surge in activity. As the growth in the number of available Arabic corpora continues, there is an increased need for robust tools that can process this data, whether it be for research or teaching. One such tool that is useful for both groups is the concordancer — a simple tool for displaying a specified target word in its context. However, obtaining one that can reliably cope with the Arabic lan- guage had proved extremely difficult. Therefore, aConCorde was created to provide such a tool to the community. 1 Introduction A concordancer is a simple tool for summarising the contents of corpora based on words of interest to the user. Otherwise, manually navigating through (of- ten very large) corpora would be a long and tedious task. Concordancers are therefore extremely useful as a time-saving tool, but also much more. By isolat- ing keywords in their contexts, linguists can use concordance output to under- stand the behaviour of interesting words. Lexicographers can use the evidence to find if a word has multiple senses, and also towards defining their meaning. There is also much research and discussion about how concordance tools can be beneficial for data-driven language learning (Johns, 1990). A number of studies have demonstrated that providing access to corpora and concordancers benefited students learning a second language.
    [Show full text]
  • Enriching an Explanatory Dictionary with Framenet and Propbank Corpus Examples
    Proceedings of eLex 2019 Enriching an Explanatory Dictionary with FrameNet and PropBank Corpus Examples Pēteris Paikens 1, Normunds Grūzītis 2, Laura Rituma 2, Gunta Nešpore 2, Viktors Lipskis 2, Lauma Pretkalniņa2, Andrejs Spektors 2 1 Latvian Information Agency LETA, Marijas street 2, Riga, Latvia 2 Institute of Mathematics and Computer Science, University of Latvia, Raina blvd. 29, Riga, Latvia E-mail: [email protected], [email protected] Abstract This paper describes ongoing work to extend an online dictionary of Latvian – Tezaurs.lv – with representative semantically annotated corpus examples according to the FrameNet and PropBank methodologies and word sense inventories. Tezaurs.lv is one of the largest open lexical resources for Latvian, combining information from more than 300 legacy dictionaries and other sources. The corpus examples are extracted from Latvian FrameNet and PropBank corpora, which are manually annotated parallel subsets of a balanced text corpus of contemporary Latvian. The proposed approach augments traditional lexicographic information with modern cross-lingually interpretable information and enables analysis of word senses from the perspective of frame semantics, which is substantially different from (complementary to) the traditional approach applied in Latvian lexicography. In cases where FrameNet and PropBank corpus evidence aligns well with the word sense split in legacy dictionaries, the frame-semantically annotated corpus examples supplement the word sense information with clarifying usage examples and commonly used semantic valence patterns. However, the annotated corpus examples often provide evidence of a different sense split, which is often more coarse-grained and, thus, may help dictionary users to cluster and comprehend a fine-grained sense split suggested by the legacy sources.
    [Show full text]
  • 20 Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation
    Toward a Sustainable Handling of Interlinear-Glossed Text in Language Documentation JOHANN-MATTIS LIST, Max Planck Institute for the Science of Human History, Germany 20 NATHANIEL A. SIMS, The University of California, Santa Barbara, America ROBERT FORKEL, Max Planck Institute for the Science of Human History, Germany While the amount of digitally available data on the worlds’ languages is steadily increasing, with more and more languages being documented, only a small proportion of the language resources produced are sustain- able. Data reuse is often difficult due to idiosyncratic formats and a negligence of standards that couldhelp to increase the comparability of linguistic data. The sustainability problem is nicely reflected in the current practice of handling interlinear-glossed text, one of the crucial resources produced in language documen- tation. Although large collections of glossed texts have been produced so far, the current practice of data handling makes data reuse difficult. In order to address this problem, we propose a first framework forthe computer-assisted, sustainable handling of interlinear-glossed text resources. Building on recent standardiza- tion proposals for word lists and structural datasets, combined with state-of-the-art methods for automated sequence comparison in historical linguistics, we show how our workflow can be used to lift a collection of interlinear-glossed Qiang texts (an endangered language spoken in Sichuan, China), and how the lifted data can assist linguists in their research. CCS Concepts: • Applied computing → Language translation; Additional Key Words and Phrases: Sino-tibetan, interlinear-glossed text, computer-assisted language com- parison, standardization, qiang ACM Reference format: Johann-Mattis List, Nathaniel A.
    [Show full text]
  • Machine Translation for Language Preservation
    Machine translation for language preservation Steven BIRD1,2 David CHIANG3 (1) Department of Computing and Information Systems, University of Melbourne (2) Linguistic Data Consortium, University of Pennsylvania (3) Information Sciences Institute, University of Southern California [email protected], [email protected] ABSTRACT Statistical machine translation has been remarkably successful for the world’s well-resourced languages, and much effort is focussed on creating and exploiting rich resources such as treebanks and wordnets. Machine translation can also support the urgent task of document- ing the world’s endangered languages. The primary object of statistical translation models, bilingual aligned text, closely coincides with interlinear text, the primary artefact collected in documentary linguistics. It ought to be possible to exploit this similarity in order to improve the quantity and quality of documentation for a language. Yet there are many technical and logistical problems to be addressed, starting with the problem that – for most of the languages in question – no texts or lexicons exist. In this position paper, we examine these challenges, and report on a data collection effort involving 15 endangered languages spoken in the highlands of Papua New Guinea. KEYWORDS: endangered languages, documentary linguistics, language resources, bilingual texts, comparative lexicons. Proceedings of COLING 2012: Posters, pages 125–134, COLING 2012, Mumbai, December 2012. 125 1 Introduction Most of the world’s 6800 languages are relatively unstudied, even though they are no less im- portant for scientific investigation than major world languages. For example, before Hixkaryana (Carib, Brazil) was discovered to have object-verb-subject word order, it was assumed that this word order was not possible in a human language, and that some principle of universal grammar must exist to account for this systematic gap (Derbyshire, 1977).
    [Show full text]
  • Linguistic Annotation of the Digital Papyrological Corpus: Sematia
    Marja Vierros Linguistic Annotation of the Digital Papyrological Corpus: Sematia 1 Introduction: Why to annotate papyri linguistically? Linguists who study historical languages usually find the methods of corpus linguis- tics exceptionally helpful. When the intuitions of native speakers are lacking, as is the case for historical languages, the corpora provide researchers with materials that replaces the intuitions on which the researchers of modern languages can rely. Using large corpora and computers to count and retrieve information also provides empiri- cal back-up from actual language usage. In the case of ancient Greek, the corpus of literary texts (e.g. Thesaurus Linguae Graecae or the Greek and Roman Collection in the Perseus Digital Library) gives information on the Greek language as it was used in lyric poetry, epic, drama, and prose writing; all these literary genres had some artistic aims and therefore do not always describe language as it was used in normal commu- nication. Ancient written texts rarely reflect the everyday language use, let alone speech. However, the corpus of documentary papyri gets close. The writers of the pa- pyri vary between professionally trained scribes and some individuals who had only rudimentary writing skills. The text types also vary from official decrees and orders to small notes and receipts. What they have in common, though, is that they have been written for a specific, current need instead of trying to impress a specific audience. Documentary papyri represent everyday texts, utilitarian prose,1 and in that respect, they provide us a very valuable source of language actually used by common people in everyday circumstances.
    [Show full text]
  • Hunspell – the Free Spelling Checker
    Hunspell – The free spelling checker About Hunspell Hunspell is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding. Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, OpenOffice.org UNO module. Main features of Hunspell spell checker and morphological analyzer: - Unicode support (affix rules work only with the first 65535 Unicode characters) - Morphological analysis (in custom item and arrangement style) and stemming - Max. 65535 affix classes and twofold affix stripping (for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish, etc.) - Support complex compoundings (for example, Hungarian and German) - Support language specific features (for example, special casing of Azeri and Turkish dotted i, or German sharp s) - Handle conditional affixes, circumfixes, fogemorphemes, forbidden words, pseudoroots and homonyms. - Free software (LGPL, GPL, MPL tri-license) Usage The src/tools dictionary contains ten executables after compiling (or some of them are in the src/win_api): affixcompress: dictionary generation from large (millions of words) vocabularies analyze: example of spell checking, stemming and morphological analysis chmorph: example of automatic morphological generation and conversion example: example of spell checking and suggestion hunspell: main program for spell checking and others (see manual) hunzip: decompressor of hzip format hzip: compressor of
    [Show full text]