Training Automatic Transliteration Models on Dbpedia Data

Total Page:16

File Type:pdf, Size:1020Kb

Training Automatic Transliteration Models on Dbpedia Data Training Automatic Transliteration Models on DBpedia Data Velislava Todorova Kiril Simov Linguistic Modeling Department Linguistic Modeling Department IICT-BAS IICT-BAS [email protected] [email protected] Abstract provide access to all RDF statements about the original URIs. Our goal is to facilitate named en- The paper presents several transliteration tity recognition in Bulgarian texts by models and their evaluation. Their evaluation extending the coverage of DBpedia is done over 100 examples, transliterated man- (http://www.dbpedia.org/) for Bul- ually by two people, independently from each garian. For this task we have trained other. The discrepancies between the two hu- translation Moses models to transliter- man transliterations demonstrate the complex- ate foreign names to Bulgarian. The ity of the task. training sets were obtained by extract- The structure of the paper is as follows: in ing the names of all people, places section 2 we describe the problem in more de- and organizations from DBpedia and tail; in section 3 we present some related ap- its extension Airpedia (http://www. proaches; section 4 reports on the preliminary airpedia.org/). Our approach is ex- experiments; section 5 describes how the train- tendable to other languages with small ing data is extended on the basis of the results DBpedia coverage. from the preliminary models and new models are trained; section 6 provides some heuristic 1 Introduction rules for combining transliteration and trans- lation of names’ parts; section 7 describes the 1 DBpedia Linked Open Dataset provides the evaluation of the models; the last section con- wealth of Wikipedia in a formalized way via cludes the paper and presents some directions the ontological language in which DBpedia for future work. statements are represented. Still DBpedia re- flects the multilingual nature of WikiPedia. 2 Challenges But if a user needs to access the huge num- ber of instances extracted from the English The transliteration of proper names presents version of WikiPedia, for example, in a dif- many and quite challenging difficulties not ferent language (in our case Bulgarian) he/she only to automatic systems, but also to humans. will not be able to do so, because the DBpedia A lot of information is needed to perform the in the other language will not provide appro- task: the language of origin to determine the priate Uniform Resource Identifiers (URIs) for pronunciation; some real world facts like how many of the instances in the English DBpe- this name was/is actually pronounced by the dia. In this paper we describe an approach to person it belongs or belonged to (in case of the problem. It generates appropriate names personal names) or by the locals (in case of in Bulgarian from DBpedia URIs in other lan- toponyms) and so on; and also the tradition guages. These new names are used in two in transliterating names from this particular ways: (1) to form gazetteers for annotation language into the given target language. Even of DBpedia instances in Bulgarian texts; and if all of this information is gathered and if it (2) they refer back to the DBpedia URIs from is consistent (which is not always the case), which they have been created and in this way there are still decisions to be made – the task of finding a phonetic equivalent of one lan- 1http://wiki.dbpedia.org/ guage’s phoneme into another is not trivial; in 654 Proceedings of Recent Advances in Natural Language Processing, pages 654–662, Hissar, Bulgaria, Sep 7–9 2015. some cases the name or parts of it are meaning- (like many music band names).2 To distin- ful words in the source language and it might guish the cases where transliteration is needed not be obvious which is more appropriate – from those where direct adoption or transla- translation or transliteration; and sometimes tion is more appropriate, we use some heuris- it might be better to leave the name in its orig- tics that involve the combination of several dif- inal script. ferent models and are described in more details In their survey on machine transliteration in section 6. (Karimi et al., 2011) give a list of five main Problem 5) is very peculiar. It looks similar challenges for an automated approach to the to the variation in translation, but is different task: 1) script specifications, 2) language in its ‘abnormality’. One would accept as nor- of origin, 3) missing sounds, 4) deciding on mal different translations of a single sentence whether or not to translate or transliterate a and we know that even in the source language name (or part of it), and 5) transliteration vari- the meaning of this sentence can be expressed ants. in other ways. Transliteration, on the other Fortunately, 1) does not present great diffi- hand, is expected to produce one single result culties when dealing with European languages for each name, just as this name is an unvaried only, as the direction of writing is the same reference to an entity. However, there are of- and the characters do not undergo changes in ten multiple transliterations of the same name. shape due to the phonetic environment. The Our models do not attempt to generate all the only problem related to the script (apart from acceptable variants, but we calculate how dif- the minor one of choosing appropriate encod- ferent our results are from manually generated ing) is the letter case. We decided to leave the transliterations and we expect this estimation upper and lower case characters in our train- to be useful to determine if an expression is ing data, trading overcomplication of the sta- likely to be a transliteration of a certain name. tistical models for a result that does not need 3 Related Works additional postprocessing. On the other hand, 2) is a very hard chal- (Matthews, 2007) approaches transliteration lenge and we decided to leave it aside for now. very similarly to the way we do. Like us, When we extract names from Wikipedia, we he trains machine translation Moses models know the language they are written in, but on parallel lists of proper names. The lan- there are no straightforward ways to figure out guage pairs for which he obtains translitera- the language they came from. tion and backtransliteration models are En- glish and Arabic, and English and Chinese. 3) is the challenge of finding a phonetic Unlike him, we are only interested in forward match for a sound that does not exist in the transliteration to Bulgarian. And our ap- target language. Human translators need to proach differs from his in several other aspects. make a decision which of the phonemes at hand First, we do not lowercase our training data. is most appropriate in the particular case, and Second, we explore not only unigram models, appropriate does not necessarily mean pho- but also bigram ones. And finally, we employ netically close, as orthography and etymology heuristics to decide whether to transliterate or could also be considered important. We hope not. to overcome this difficulty by training machine The construction of our models was inspired translation Moses models on parallel lists of by (Nakov and Tiedemann, 2012). They train names where the decisions about sound map- the only transliteration models for Bulgarian ping have already been made by humans. known to us. They use automatic transliter- Challenge 4) is related to the fact that trans- ation as a substitution for machine transla- ferring a name from one language to another 2Or it might be a combination of the three. Here can happen not only through transliteration, we will not deal with mixed cases as such. We will but also via translation (for example, until consider as cases of direct adoption only those where 1986, the French name Cˆoted’Ivoire has been the whole name has been directly adopted. We will treat as translation cases all cases where at least some translated, not transliterated in Bulgarian as part of the name has been translated, and we will take Бряг на слоновата кост) or direct adoption the rest to be transliteration ones. 655 tion between very closely related languages, cleaned and tidied as much as possible (details namely Bulgarian and Macedonian. Their are given in section 4.1). models are of several different types – uni- The second step was to apply the first mod- gram, bigram, trigram – and their results show els on the data to further clean and tidy it up that bigrams perform best, because they are (details are given in section 5.1), so that a sec- “a good compromise between generality and ond, better series of models is obtained. In this contextual specificity” (Nakov and Tiedemann, section, we describe the first models, and the 2012, p. 302). We have also trained and com- next section deals with the ones that were the pared unigram and bigram models (see Sec- product of the second step. tions 4 and 5), however we left the trigrams out because of the specificity problem (a tri- 4.1 Training Data gram generally occurs much less often than a The parallel lists of names on which we have unigram, for example), which gets even worse trained our models have been extracted from with proper names coming from many differ- DBpedia. We have used the instance type fea- ent languages with very diverse letter sequence ture to select the URLs of all people, places 3 patterns. and organizations in seven languages: Bulgar- Here we will not give a full overview of the ian, English, German, French, Russian, Ital- automatic transliteration techniques, instead ian and Spanish. Then we have mapped the we will reference (Karimi et al., 2011), which Bulgarian names to the corresponding foreign is a detailed survey on the topic, and (Choi et names via the interlanguage links in DBpedia.
Recommended publications
  • Integrating Optical Character Recognition and Machine
    Integrating Optical Character Recognition and Machine Translation of Historical Documents Haithem Afli and Andy Way ADAPT Centre School of Computing Dublin City University Dublin, Ireland haithem.afli, andy.way @adaptcentre.ie { } Abstract Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimenta- tion shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement. 1 Introduction While research on improving Optical Character Recognition (OCR) algorithms is ongoing, our assess- ment is that Machine Translation (MT) will continue to produce unacceptable translation errors (or non- translations) based solely on the automatic output of OCR systems. The problem comes from the fact that current OCR and Machine Translation systems are commercially distinct and separate technologies. There are often mistakes in the scanned texts as the OCR system occasionally misrecognizes letters and falsely identifies scanned text, leading to misspellings and linguistic errors in the output text (Niklas, 2010). Works involved in improving translation services purchase off-the-shelf OCR technology but have limited capability to adapt the OCR processing to improve overall machine translation perfor- mance.
    [Show full text]
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Machine Translation for Academic Purposes Grace Hui-Chin Lin
    Proceedings of the International Conference on TESOL and Translation 2009 December 2009: pp.133-148 Machine Translation for Academic Purposes Grace Hui-chin Lin PhD Texas A&M University College Station Master of Science, University of Southern California Paul Shih Chieh Chien Center for General Education, Taipei Medical University Abstract Due to the globalization trend and knowledge boost in the second millennium, multi-lingual translation has become a noteworthy issue. For the purposes of learning knowledge in academic fields, Machine Translation (MT) should be noticed not only academically but also practically. MT should be informed to the translating learners because it is a valuable approach to apply by professional translators for diverse professional fields. For learning translating skills and finding a way to learn and teach through bi-lingual/multilingual translating functions in software, machine translation is an ideal approach that translation trainers, translation learners, and professional translators should be familiar with. In fact, theories for machine translation and computer assistance had been highly valued by many scholars. (e.g., Hutchines, 2003; Thriveni, 2002) Based on MIT’s Open Courseware into Chinese that Lee, Lin and Bonk (2007) have introduced, this paper demonstrates how MT can be efficiently applied as a superior way of teaching and learning. This article predicts the translated courses utilizing MT for residents of global village should emerge and be provided soon in industrialized nations and it exhibits an overview about what the current developmental status of MT is, why the MT should be fully applied for academic purpose, such as translating a textbook or teaching and learning a course, and what types of software can be successfully applied.
    [Show full text]
  • Machine Translation for Language Preservation
    Machine translation for language preservation Steven BIRD1,2 David CHIANG3 (1) Department of Computing and Information Systems, University of Melbourne (2) Linguistic Data Consortium, University of Pennsylvania (3) Information Sciences Institute, University of Southern California [email protected], [email protected] ABSTRACT Statistical machine translation has been remarkably successful for the world’s well-resourced languages, and much effort is focussed on creating and exploiting rich resources such as treebanks and wordnets. Machine translation can also support the urgent task of document- ing the world’s endangered languages. The primary object of statistical translation models, bilingual aligned text, closely coincides with interlinear text, the primary artefact collected in documentary linguistics. It ought to be possible to exploit this similarity in order to improve the quantity and quality of documentation for a language. Yet there are many technical and logistical problems to be addressed, starting with the problem that – for most of the languages in question – no texts or lexicons exist. In this position paper, we examine these challenges, and report on a data collection effort involving 15 endangered languages spoken in the highlands of Papua New Guinea. KEYWORDS: endangered languages, documentary linguistics, language resources, bilingual texts, comparative lexicons. Proceedings of COLING 2012: Posters, pages 125–134, COLING 2012, Mumbai, December 2012. 125 1 Introduction Most of the world’s 6800 languages are relatively unstudied, even though they are no less im- portant for scientific investigation than major world languages. For example, before Hixkaryana (Carib, Brazil) was discovered to have object-verb-subject word order, it was assumed that this word order was not possible in a human language, and that some principle of universal grammar must exist to account for this systematic gap (Derbyshire, 1977).
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • Machine Translation and Computer-Assisted Translation: a New Way of Translating? Author: Olivia Craciunescu E-Mail: Olivia [email protected]
    Machine Translation and Computer-Assisted Translation: a New Way of Translating? Author: Olivia Craciunescu E-mail: [email protected] Author: Constanza Gerding-Salas E-mail: [email protected] Author: Susan Stringer-O’Keeffe E-mail: [email protected] Source: http://www.translationdirectory.com The Need for Translation IT has produced a screen culture that Technology tends to replace the print culture, with printed documents being dispensed Advances in information technology (IT) with and information being accessed have combined with modern communication and relayed directly through computers requirements to foster translation automation. (e-mail, databases and other stored The history of the relationship between information). These computer documents technology and translation goes back to are instantly available and can be opened the beginnings of the Cold War, as in the and processed with far greater fl exibility 1950s competition between the United than printed matter, with the result that the States and the Soviet Union was so intensive status of information itself has changed, at every level that thousands of documents becoming either temporary or permanent were translated from Russian to English and according to need. Over the last two vice versa. However, such high demand decades we have witnessed the enormous revealed the ineffi ciency of the translation growth of information technology with process, above all in specialized areas of the accompanying advantages of speed, knowledge, increasing interest in the idea visual
    [Show full text]
  • Language Service Translation SAVER Technote
    TechNote August 2020 LANGUAGE TRANSLATION APPLICATIONS The U.S. Department of Homeland Security (DHS) AEL Number: 09ME-07-TRAN established the System Assessment and Validation for Language translation equipment used by emergency services has evolved into Emergency Responders commercially available, mobile device applications (apps). The wide adoption of (SAVER) Program to help cell phones has enabled language translation applications to be useful in emergency responders emergency scenarios where first responders may interact with individuals who improve their procurement speak a different language. Most language translation applications can identify decisions. a spoken language and translate in real-time. Applications vary in functionality, Located within the Science such as the ability to translate voice or text offline and are unique in the number and Technology Directorate, of available translatable languages. These applications are useful tools to the National Urban Security remedy communication barriers in any emergency response scenario. Technology Laboratory (NUSTL) manages the SAVER Overview Program and conducts objective operational During an emergency response, language barriers can challenge a first assessments of commercial responder’s ability to quickly assess and respond to a situation. First responders equipment and systems must be able to accurately communicate with persons at an emergency scene in relevant to the emergency order to provide a prompt, appropriate response. The quick translations provided responder community. by mobile apps remove the language barrier and are essential to the safety of The SAVER Program gathers both first responders and civilians with whom they interact. and reports information about equipment that falls within the Before the adoption of smartphones, first responders used language interpreters categories listed in the DHS and later language translation equipment such as hand-held language Authorized Equipment List translation devices (Figure 1).
    [Show full text]
  • Terminology Extraction, Translation Tools and Comparable Corpora Helena Blancafort, Béatrice Daille, Tatiana Gornostay, Ulrich Heid, Claude Méchoulam, Serge Sharoff
    TTC: Terminology Extraction, Translation Tools and Comparable Corpora Helena Blancafort, Béatrice Daille, Tatiana Gornostay, Ulrich Heid, Claude Méchoulam, Serge Sharoff To cite this version: Helena Blancafort, Béatrice Daille, Tatiana Gornostay, Ulrich Heid, Claude Méchoulam, et al.. TTC: Terminology Extraction, Translation Tools and Comparable Corpora. 14th EURALEX International Congress, Jul 2010, Leeuwarden/Ljouwert, Netherlands. pp.263-268. hal-00819365 HAL Id: hal-00819365 https://hal.archives-ouvertes.fr/hal-00819365 Submitted on 2 May 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. TTC: Terminology Extraction, Translation Tools and Comparable Corpora Helena Blancafort, Syllabs, Universitat Pompeu Fabra, Barcelona, Spain Béatrice Daille, LINA, Université de Nantes, France Tatiana Gornostay, Tilde, Latva Ulrich Heid, IMS, Universität Stuttgart, Germany Claude Mechoulam, Sogitec, France Serge Sharoff, CTS, University of Leeds, England The need for linguistic resources in any natural language application is undeniable. Lexicons and terminologies
    [Show full text]
  • Acknowledging the Needs of Computer-Assisted Translation Tools Users: the Human Perspective in Human-Machine Translation Annemarie Taravella and Alain O
    The Journal of Specialised Translation Issue 19 – January 2013 Acknowledging the needs of computer-assisted translation tools users: the human perspective in human-machine translation AnneMarie Taravella and Alain O. Villeneuve, Université de Sherbrooke, Quebec ABSTRACT The lack of translation specialists poses a problem for the growing translation markets around the world. One of the solutions proposed for the lack of human resources is automated translation tools. In the last few decades, organisations have had the opportunity to increase their use of technological resources. However, there is no consensus on the way that technological resources should be integrated into translation service providers (TSP). The approach taken by this article is to set aside both 100% human translation and 100% machine translation (without human intervention), to examine a third, more realistic solution: interactive translation where humans and machines co-operate. What is the human role? Based on the conceptual framework of information systems and organisational sciences, we recommend giving users, who are mainly translators for whom interactive translation tools are designed, a fundamental role in the thinking surrounding the implementation of a technological tool. KEYWORDS Interactive translation, information system, implementation, business process reengineering, organisational science. 1. Introduction Globalisation and the acceleration of world trade operations have led to an impressive growth of the global translation services market. According to EUATC (European Union of Associations of Translation Companies), translation industry is set to grow around five percent annually in the foreseeable future (Hager, 2008). With its two official languages, Canada is a good example of an important translation market, where highly skilled professional translators serve not only the domestic market, but international clients as well.
    [Show full text]
  • MULTILINGUAL CHATBOT with HUMAN CONVERSATIONAL ABILITY [1] Aradhana Bisht, [2] Gopan Doshi, [3] Bhavna Arora, [4] Suvarna Pansambal [1][2] Student, Dept
    International Journal of Future Generation Communication and Networking Vol. 13, No. 1s, (2020), pp. 138- 146 MULTILINGUAL CHATBOT WITH HUMAN CONVERSATIONAL ABILITY [1] Aradhana Bisht, [2] Gopan Doshi, [3] Bhavna Arora, [4] Suvarna Pansambal [1][2] Student, Dept. of Computer Engineering,[3][4] Asst. Prof., Dept. of Computer Engineering, Atharva College of Engineering, Mumbai, India Abstract Chatbots - The chatbot technology has become very fascinating to people around the globe because of its ability to communicate with humans. They respond to the user query and are sometimes capable of executing sundry tasks. Its implementation is easier because of wide availability of development platforms and language libraries. Most of the chatbots support English language only and very few have the skill to communicate in multiple languages. In this paper we are proposing an idea to build a chatbot that can communicate in as many languages as google translator supports and also the chatbot will be capable of doing humanly conversation. This can be done by using various technologies such as Natural Language Processing (NLP) techniques, Sequence To Sequence Modeling with encoder decoder architecture[12]. We aim to build a chatbot which will be like virtual assistant and will have the ability to have conversations more like human to human rather than human to bot and will also be able to communicate in multiple languages. Keywords: Chatbot, Multilingual, Communication, Human Conversational, Virtual agent, NLP, GNMT. 1. Introduction A chatbot is a virtual agent for conversation, which is capable of answering user queries in the form of text or speech. In other words, a chatbot is a software application/program that can chat with a user on any topic[5].
    [Show full text]
  • Exploring the Use of Acoustic Embeddings in Neural Machine Translation
    This is a repository copy of Exploring the use of Acoustic Embeddings in Neural Machine Translation. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/121515/ Version: Accepted Version Proceedings Paper: Deena, S., Ng, R.W.M., Madhyashtha, P. et al. (2 more authors) (2018) Exploring the use of Acoustic Embeddings in Neural Machine Translation. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, December 16-20, 2017, Okinawa, Japan. IEEE . ISBN 978-1-5090-4788-8 https://doi.org/10.1109/ASRU.2017.8268971 Reuse Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. [email protected] https://eprints.whiterose.ac.uk/ EXPLORING THE USE OF ACOUSTIC EMBEDDINGS IN NEURAL MACHINE TRANSLATION Salil Deena1, Raymond W. M. Ng1, Pranava Madhyastha2, Lucia Specia2 and Thomas Hain1 1Speech and Hearing Research Group, The University of Sheffield, UK 2Natural Language Processing Research Group, The University of Sheffield, UK {s.deena, wm.ng, p.madhyastha, l.specia, t.hain}@sheffield.ac.uk ABSTRACT subsequently used to obtain a topic-informed encoder context Neural Machine Translation (NMT) has recently demon- vector, which is then passed to the decoder.
    [Show full text]
  • Analyzing Political Polarization in News Media with Natural Language Processing
    Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber A thesis presented to the Honors College of Middle Tennessee State University in partial fulfillment of the requirements for graduation from the University Honors College Fall 2020 Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber APPROVED: ______________________________ Dr. Salvador E. Barbosa Project Advisor Computer Science Department Dr. Medha Sarkar Computer Science Department Chair ____________________________ Dr. John R. Vile, Dean University Honors College Copyright © 2020 Brian Sharber Department of Computer Science Middle Tennessee State University; Murfreesboro, Tennessee, USA. I hereby grant to Middle Tennessee State University (MTSU) and its agents (including an institutional repository) the non-exclusive right to archive, preserve, and make accessible my thesis in whole or in part in all forms of media now and hereafter. I warrant that the thesis and the abstract are my original work and do not infringe or violate any rights of others. I agree to indemnify and hold MTSU harmless for any damage which may result from copyright infringement or similar claims brought against MTSU by third parties. I retain all ownership rights to the copyright of my thesis. I also retain the right to use in future works (such as articles or books) all or part of this thesis. The software described in this work is free software. You can redistribute it and/or modify it under the terms of the MIT License. The software is posted on GitHub under the following repository: https://github.com/briansha/NLP-Political-Polarization iii Abstract Political discourse in the United States is becoming increasingly polarized.
    [Show full text]