1 Nov 2018 H Yeo Loih Sdfrtann H Embed- Than the ( More Training for Dings Set Used Algorithm Feature of Type Underlying the the by Driven NLP

Total Page:16

File Type:pdf, Size:1020Kb

1 Nov 2018 H Yeo Loih Sdfrtann H Embed- Than the ( More Training for Dings Set Used Algorithm Feature of Type Underlying the the by Driven NLP View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Universität München: Elektronischen Publikationen A Stronger Baseline for Multilingual Word Embeddings Philipp Dufter, Hinrich Schutze¨ Center for Information and Language Processing (CIS) LMU Munich, Germany [email protected] Abstract Most embedding learners build on using context information as feature. Dufter et al. (2018) re- Levy, Søgaard and Goldberg’s (2017) cently showed that using concept information can S-ID (sentence ID) method applies be effective for multilingual embedding learning word2vec on tuples containing a sentence as well. ID and a word from the sentence. It Here, we propose the method SC-ID. SC-ID has been shown to be a strong baseline combines the concept identification method “any- for learning multilingual embeddings. malign” (Lardilleux and Lepage, 2009) and S-ID Inspired by recent work on concept based (Levy et al., 2017) into an embedding learning embedding learning we propose SC-ID, method that is based on both concept (C-ID) and an extension to S-ID: given a sentence context (S-ID). We show below that SC-ID is ef- aligned corpus, we use sampling to extract fective. On a massively parallel corpus, the Par- concepts that are then processed in the allel Bible Corpus (PBC) covering 1000+ lan- same manner as S-IDs. We perform guages, SC-ID outperforms S-ID in a word trans- experiments on the Parallel Bible Corpus lation task. In addition, we show that SC-ID out- across 1000+ languages and show that performs S-ID on EuroParl, a corpus with charac- SC-ID yields up to 6% performance teristics quite different from PBC. increase in a word translation task. In ad- For both corpora there are embedding learners dition, we provide evidence that SC-ID is that perform better. However, SC-ID is the only easily and widely applicable by reporting one that scales easily across 1000+ languages and competitive results across 8 tasks on a at the same time exhibits a stable performance EuroParl based corpus. across datasets. In summary, we make the follow- 1 Introduction ing contributions in this paper: i) We demonstrate that using concept IDs equivalently to (Levy et al., Multilingual embeddings are useful because they 2017)’s sentence IDs works well. ii) We show that provide meaning representations of source and tar- combining C-IDs and S-IDs yields higher qual- arXiv:1811.00586v1 [cs.CL] 1 Nov 2018 get in the same space in machine translation and ity embeddings than either by itself. iii) In ex- because they are a basis for transfer learning. tensive experiments investigating hyperparameters In contrast to prior multilingual work we find that, despite the large number of lan- (Zeman and Resnik, 2008; McDonald et al., guages, lower dimensional spaces work better. iv) 2011; Tsvetkov et al., 2014), automatically We demonstrate that our method works on very learned embeddings potentially perform as different datasets and yields competitive perfor- well but are more efficient and easier to use mance on a EuroParl based corpus. (Klementiev et al., 2012; Hermann and Blunsom, 2014b; Guo et al., 2016). Thus, multilingual word 2 Methods embedding learning is important for NLP. The quality of multilingual embeddings is Throughout this section we describe how we iden- driven by the underlying feature set more than tify concepts and write out text corpora that are the type of algorithm used for training the embed- then used as input to the embedding learning algo- dings (Upadhyay et al., 2016; Ruder et al., 2017). rithm. 2.1 Concept Induction Algorithm 1 Anymalign Lardilleux and Lepage (2009) propose a word 1: procedure GETCONCEPTS(V , MINL, MAXN, T) alignment algorithm which we use for concept in- 2: C = ∅ duction. They argue that hapax legomena (i.e., 3: while runtime ≤ T do 4: V ′ = get-subsample(V ) words which occur exactly once in a corpus) are ′ 5: A = get-concepts(V ) easy to align in a sentence-aligned corpus. If ha- 6: A = filter-concepts(A, MINL, MAXN) pax legomena across multiple languages occur in 7: C = C ∪ A 8: end while the same sentence, their meanings are likely the 9: end procedure same. Similarly, words across multiple languages that occur more than once, but strictly in the same sentences, can be considered translations. We call Figure 1: V is an alingual corpus. get-subsample words that occur strictly in the same sentences per- creates a subcorpus by randomly selecting lines fectly aligned. Further we define a concept as a set from the alingual corpus. get-concepts extracts of words that are perfectly aligned. words and word ngrams that are perfectly aligned. By this definition one expects the number of filter-concepts imposes the constraint that ngrams identified concepts to be low. Coverage can be in- have a specified length and concepts cover enough creased by not only considering the original paral- languages. lel corpus, but by sampling subcorpora from the parallel corpus. As the number of sentences is smaller in each sample and there is a high num- viously this probability depends on the number of ber of sampled subcorpora, the number of per- samples drawn and thus on the runtime of the al- fect alignments is much higher. In addition to gorithm. Thus the runtime (T) is another hyper- words, word ngrams are also considered to in- parameter. Note that T only affects the number of crease coverage. The complement of an ngram distinct concepts, not the quality of an individual (i.e., the sentence in one language without the per- concept. For details see (Lardilleux and Lepage, fectly aligned words) is also treated as a perfect 2009). alignment as their meaning can be assumed to be Note that most members of a concept are neither equivalent as well. hapax legomena nor perfect alignments in V . A ′ For example, if for a particular subsample, the word can be part of multiple concepts within V English trigram “mount of olives” occurs exactly and obviously also within V (it can be found in in the same sentences as “montagne des oliviers” multiple iterations). this gives rise to a concept, even if “olives” or One can interpret the concept identification as a “mountain” might not perfectly aligned in this par- form of data augmentation. From an existing par- ticular subsample. allel corpus new parallel “sentences” are extracted Figure 1 shows Lardilleux and Lepage (2009)’s by considering perfectly aligned subsequences. anymalign algorithm. Given a sentence aligned parallel corpus, “alingual” sentences are created 2.2 Corpus Creation by concatenating each sentence across all lan- Method S-ID. We adopt Levy and Goldberg guages. We then consider the set of all alingual (2014)’s framework; it formalizes the basic infor- sentences V , which Lardilleux and Lepage (2009) mation that is passed to the embedding learner as a call an alingual corpus. The core of the algo- set of pairs. In the monolingual case each pair con- rithm iterates the following loop: (i) draw a ran- sists of two words that occur in the same context. dom sample of alingual sentences V ′ ⊂ V ; (ii) ex- A successful approach to multilingual embedding tract perfect alignments in V ′. The perfect align- learning for parallel corpora is to use a corpus of ments are then added to the set of concepts. pairs (one per line) of a word and a sentence ID Anymalign’s hyperparameters include the min- (Levy et al., 2017). We refer to this method as S- imum number of languages a perfect alignment ID. Note that we use S-ID for the sentence identi- should cover (MINL) and the maximum ngram fier, for the corpus creation method that is based length (MAXN). The size of a subsample is ad- on these identifiers, for the embedding learning justed automatically to maximize the probability method based on such corpora and for the embed- that each sentence is sampled at least once. Ob- dings produced by the method. The same applies to C-ID. Which sense is meant should be clear 2.3 Embedding learning from context. We use word2vec skipgram1 (Mikolov et al., Method C-ID. We use the same method for 2013a) with default hyperparameters, except for writing a corpus using our identified concepts and three: number of iterations (ITER), minimum fre- call this method C-ID. Figure 2 gives examples of quency of a word (MINC) and embedding dimen- the generated corpora that are passed to the em- sion (DIM). For details see Table 2. bedding learner. 3 Data Method SC-ID. We combine S-IDs and C- IDs by simply concatenating their corpora before We work on PBC, the Parallel Bible Corpus learning embeddings. However, we apply differ- (Mayer and Cysouw, 2014), a verse-aligned cor- ent frequency thresholds when learning embed- pus of 1000+ translations of the New Testament. dings from this corpus: for S-ID we investigate a For the sake of comparability we use the same frequency threshold on a development set whereas 1664 Bible editions across 1259 languages (dis- for C-ID we always set it to 1. As argued before tinct ISO 639-3 codes) and the same 6458 training each word in a concept carries a strong multilin- verses as in (Dufter et al., 2018).2 We follow their gual signal, which is why we do not apply any fre- terminology and refer to “translations” as “edi- quency filtering here. In the implementation we tions”. PBC is a good model for resource-poverty; simply delete words in the S-ID part of the corpus e.g., the training set of KJV contains fewer than with frequency lower than our threshold and set 150,000 tokens in 6458 verses.
Recommended publications
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Concepticon: a Resource for the Linking of Concept Lists
    Concepticon: A Resource for the Linking of Concept Lists Johann-Mattis List1, Michael Cysouw2, Robert Forkel3 1CRLAO/EHESS and AIRE/UPMC, Paris, 2Forschungszentrum Deutscher Sprachatlas, Marburg, 3Max Planck Institute for the Science of Human History, Jena [email protected], [email protected], [email protected] Abstract We present an attempt to link the large amount of different concept lists which are used in the linguistic literature, ranging from Swadesh lists in historical linguistics to naming tests in clinical studies and psycholinguistics. This resource, our Concepticon, links 30 222 concept labels from 160 conceptlists to 2495 concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations. Keywords: concepts, concept list, Swadesh list, naming test, word norms, cross-linguistically linked data 1. Introduction in the languages they were working on, or that they turned In 1950, Morris Swadesh (1909 – 1967) proposed the idea out to be not as stable and universal as Swadesh had claimed that certain parts of the lexicon of human languages are uni- (Matisoff, 1978; Alpher and Nash, 1999). Up to today, versal, stable over time, and rather resistant to borrowing. dozens of different concept lists have been compiled for var- As a result, he claimed that this part of the lexicon, which ious purposes.
    [Show full text]
  • Knowledge Graphs on the Web – an Overview Arxiv:2003.00719V3 [Cs
    January 2020 Knowledge Graphs on the Web – an Overview Nicolas HEIST, Sven HERTLING, Daniel RINGLER, and Heiko PAULHEIM Data and Web Science Group, University of Mannheim, Germany Abstract. Knowledge Graphs are an emerging form of knowledge representation. While Google coined the term Knowledge Graph first and promoted it as a means to improve their search results, they are used in many applications today. In a knowl- edge graph, entities in the real world and/or a business domain (e.g., people, places, or events) are represented as nodes, which are connected by edges representing the relations between those entities. While companies such as Google, Microsoft, and Facebook have their own, non-public knowledge graphs, there is also a larger body of publicly available knowledge graphs, such as DBpedia or Wikidata. In this chap- ter, we provide an overview and comparison of those publicly available knowledge graphs, and give insights into their contents, size, coverage, and overlap. Keywords. Knowledge Graph, Linked Data, Semantic Web, Profiling 1. Introduction Knowledge Graphs are increasingly used as means to represent knowledge. Due to their versatile means of representation, they can be used to integrate different heterogeneous data sources, both within as well as across organizations. [8,9] Besides such domain-specific knowledge graphs which are typically developed for specific domains and/or use cases, there are also public, cross-domain knowledge graphs encoding common knowledge, such as DBpedia, Wikidata, or YAGO. [33] Such knowl- edge graphs may be used, e.g., for automatically enriching data with background knowl- arXiv:2003.00719v3 [cs.AI] 12 Mar 2020 edge to be used in knowledge-intensive downstream applications.
    [Show full text]
  • Detecting Personal Life Events from Social Media
    Open Research Online The Open University’s repository of research publications and other research outputs Detecting Personal Life Events from Social Media Thesis How to cite: Dickinson, Thomas Kier (2019). Detecting Personal Life Events from Social Media. PhD thesis The Open University. For guidance on citations see FAQs. c 2018 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00010aa9 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Detecting Personal Life Events from Social Media a thesis presented by Thomas K. Dickinson to The Department of Science, Technology, Engineering and Mathematics in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science The Open University Milton Keynes, England May 2019 Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson Detecting Personal Life Events from Social Media Abstract Social media has become a dominating force over the past 15 years, with the rise of sites such as Facebook, Instagram, and Twitter. Some of us have been with these sites since the start, posting all about our personal lives and building up a digital identify of ourselves. But within this myriad of posts, what actually matters to us, and what do our digital identities tell people about ourselves? One way that we can start to filter through this data, is to build classifiers that can identify posts about our personal life events, allowing us to start to self reflect on what we share online.
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • 'Face' in Vietnamese
    Body part extensions with mặt ‘face’ in Vietnamese ​ Annika Tjuka 1 Introduction In many languages, body part terms have multiple meanings. The word tongue, for example, ​ ​ refers not only to the body part but also to a landscape or object feature, as in tongue of the ocean ​ or tongue of flame. These examples illustrate that body part terms are commonly mapped to ​ ​ concrete objects. In addition, body part terms can be transferred to abstract domains, including the description of emotional states (e.g., Yu 2002). The present study investigates the mapping of the body part term face to the concrete domain of objects. Expressions like mouth of the river ​ ​ and foot of the bed reveal the different analogies for extending human body parts to object ​ features. In the case of mouth of the river, the body part is extended on the basis of the function ​ ​ of the mouth as an opening. In contrast, the expression foot of the bed refers to the part of the bed ​ where your feet are if you lie in it. Thus, the body part is mapped according to the spatial alignment of the body part and the object. Ambiguous words challenge our understanding of how the mental lexicon is structured. Two processes could lead to the activation of a specific meaning: i) all meanings of a word are stored in one lexical entry and the relevant one is retrieved, ii) only a core meaning is stored and the other senses are generated by lexical rules (for a description of both models, see, Pustejovsky 1991). In the following qualitative analysis, I concentrate on the different meanings of the body part face and discuss the implications for its representation in the mental lexicon.
    [Show full text]
  • Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation
    Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation Alexander Panchenkoz, Fide Martenz, Eugen Ruppertz, Stefano Faralliy, Dmitry Ustalov∗, Simone Paolo Ponzettoy, and Chris Biemannz zLanguage Technology Group, Department of Informatics, Universitat¨ Hamburg, Germany yWeb and Data Science Group, Department of Informatics, Universitat¨ Mannheim, Germany ∗Institute of Natural Sciences and Mathematics, Ural Federal University, Russia fpanchenko,marten,ruppert,[email protected] fsimone,[email protected] [email protected] Abstract manually in one of the underlying resources, such as Wikipedia. Unsupervised knowledge-free ap- Interpretability of a predictive model is proaches, e.g. (Di Marco and Navigli, 2013; Bar- a powerful feature that gains the trust of tunov et al., 2016), require no manual labor, but users in the correctness of the predictions. the resulting sense representations lack the above- In word sense disambiguation (WSD), mentioned features enabling interpretability. For knowledge-based systems tend to be much instance, systems based on sense embeddings are more interpretable than knowledge-free based on dense uninterpretable vectors. Therefore, counterparts as they rely on the wealth of the meaning of a sense can be interpreted only on manually-encoded elements representing the basis of a list of related senses. word senses, such as hypernyms, usage We present a system that brings interpretability examples, and images. We present a WSD of the knowledge-based sense representations into system that bridges the gap between these the world of unsupervised knowledge-free WSD two so far disconnected groups of meth- models. The contribution of this paper is the first ods. Namely, our system, providing access system for word sense induction and disambigua- to several state-of-the-art WSD models, tion, which is unsupervised, knowledge-free, and aims to be interpretable as a knowledge- interpretable at the same time.
    [Show full text]
  • Ten Years of Babelnet: a Survey
    Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Survey Track Ten Years of BabelNet: A Survey Roberto Navigli1 , Michele Bevilacqua1 , Simone Conia1 , Dario Montagnini2 and Francesco Cecconi2 1Sapienza NLP Group, Sapienza University of Rome, Italy 2Babelscape, Italy froberto.navigli, michele.bevilacqua, [email protected] fmontagnini, [email protected] Abstract to integrate symbolic knowledge into neural architectures [d’Avila Garcez and Lamb, 2020]. The rationale is that the The intelligent manipulation of symbolic knowl- use of, and linkage to, symbolic knowledge can not only en- edge has been a long-sought goal of AI. How- able interpretable, explainable and accountable AI systems, ever, when it comes to Natural Language Process- but it can also increase the degree of generalization to rare ing (NLP), symbols have to be mapped to words patterns (e.g., infrequent meanings) and promote better use and phrases, which are not only ambiguous but also of information which is not explicit in the text. language-specific: multilinguality is indeed a de- Symbolic knowledge requires that the link between form sirable property for NLP systems, and one which and meaning be made explicit, connecting strings to repre- enables the generalization of tasks where multiple sentations of concepts, entities and thoughts. Historical re- languages need to be dealt with, without translat- sources such as WordNet [Miller, 1995] are important en- ing text. In this paper we survey BabelNet, a pop- deavors which systematize symbolic knowledge about the ular wide-coverage lexical-semantic knowledge re- words of a language, i.e., lexicographic knowledge, not only source obtained by merging heterogeneous sources in a machine-readable format, but also in structured form, into a unified semantic network that helps to scale thanks to the organization of concepts into a semantic net- tasks and applications to hundreds of languages.
    [Show full text]
  • KOI at Semeval-2018 Task 5: Building Knowledge Graph of Incidents
    KOI at SemEval-2018 Task 5: Building Knowledge Graph of Incidents 1 2 2 Paramita Mirza, Fariz Darari, ∗ Rahmad Mahendra ∗ 1 Max Planck Institute for Informatics, Germany 2 Faculty of Computer Science, Universitas Indonesia, Indonesia paramita @mpi-inf.mpg.de fariz,rahmad.mahendra{ } @cs.ui.ac.id { } Abstract Subtask S3 In order to answer questions of type (ii), participating systems are also required We present KOI (Knowledge of Incidents), a to identify participant roles in each identified an- system that given news articles as input, builds swer incident (e.g., victim, subject-suspect), and a knowledge graph (KOI-KG) of incidental use such information along with victim-related nu- events. KOI-KG can then be used to effi- merals (“three people were killed”) mentioned in ciently answer questions such as “How many the corresponding answer documents, i.e., docu- killing incidents happened in 2017 that involve ments that report on the answer incident, to deter- Sean?” The required steps in building the KG include: (i) document preprocessing involv- mine the total number of victims. ing word sense disambiguation, named-entity Datasets The organizers released two datasets: recognition, temporal expression recognition and normalization, and semantic role labeling; (i) test data, stemming from three domains of (ii) incidental event extraction and coreference gun violence, fire disasters and business, and (ii) resolution via document clustering; and (iii) trial data, covering only the gun violence domain. KG construction and population. Each dataset
    [Show full text]
  • Web Search Result Clustering with Babelnet
    Web Search Result Clustering with BabelNet Marek Kozlowski Maciej Kazula OPI-PIB OPI-PIB [email protected] [email protected] Abstract 2 Related Work In this paper we present well-known 2.1 Search Result Clustering search result clustering method enriched with BabelNet information. The goal is The goal of text clustering in information retrieval to verify how Babelnet/Babelfy can im- is to discover groups of semantically related docu- prove the clustering quality in the web ments. Contextual descriptions (snippets) of docu- search results domain. During the evalua- ments returned by a search engine are short, often tion we tested three algorithms (Bisecting incomplete, and highly biased toward the query, so K-Means, STC, Lingo). At the first stage, establishing a notion of proximity between docu- we performed experiments only with tex- ments is a challenging task that is called Search tual features coming from snippets. Next, Result Clustering (SRC). Search Results Cluster- we introduced new semantic features from ing (SRC) is a specific area of documents cluster- BabelNet (as disambiguated synsets, cate- ing. gories and glosses describing synsets, or Approaches to search result clustering can be semantic edges) in order to verify how classified as data-centric or description-centric they influence on the clustering quality (Carpineto, 2009). of the search result clustering. The al- The data-centric approach (as Bisecting K- gorithms were evaluated on AMBIENT Means) focuses more on the problem of data clus- dataset in terms of the clustering quality. tering, rather than presenting the results to the user. Other data-centric methods use hierarchical 1 Introduction agglomerative clustering (Maarek, 2000) that re- In the previous years, Web clustering engines places single terms with lexical affinities (2-grams (Carpineto, 2009) have been proposed as a solu- of words) as features, or exploit link information tion to the issue of lexical ambiguity in Informa- (Zhang, 2008).
    [Show full text]
  • Analyzing Political Polarization in News Media with Natural Language Processing
    Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber A thesis presented to the Honors College of Middle Tennessee State University in partial fulfillment of the requirements for graduation from the University Honors College Fall 2020 Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber APPROVED: ______________________________ Dr. Salvador E. Barbosa Project Advisor Computer Science Department Dr. Medha Sarkar Computer Science Department Chair ____________________________ Dr. John R. Vile, Dean University Honors College Copyright © 2020 Brian Sharber Department of Computer Science Middle Tennessee State University; Murfreesboro, Tennessee, USA. I hereby grant to Middle Tennessee State University (MTSU) and its agents (including an institutional repository) the non-exclusive right to archive, preserve, and make accessible my thesis in whole or in part in all forms of media now and hereafter. I warrant that the thesis and the abstract are my original work and do not infringe or violate any rights of others. I agree to indemnify and hold MTSU harmless for any damage which may result from copyright infringement or similar claims brought against MTSU by third parties. I retain all ownership rights to the copyright of my thesis. I also retain the right to use in future works (such as articles or books) all or part of this thesis. The software described in this work is free software. You can redistribute it and/or modify it under the terms of the MIT License. The software is posted on GitHub under the following repository: https://github.com/briansha/NLP-Political-Polarization iii Abstract Political discourse in the United States is becoming increasingly polarized.
    [Show full text]
  • Modeling and Encoding Traditional Wordlists for Machine Applications
    Modeling and Encoding Traditional Wordlists for Machine Applications Shakthi Poornima Jeff Good Department of Linguistics Department of Linguistics University at Buffalo University at Buffalo Buffalo, NY USA Buffalo, NY USA [email protected] [email protected] Abstract Clearly, descriptive linguistic resources can be This paper describes work being done on of potential value not just to traditional linguis- the modeling and encoding of a legacy re- tics, but also to computational linguistics. The source, the traditional descriptive wordlist, difficulty, however, is that the kinds of resources in ways that make its data accessible to produced in the course of linguistic description NLP applications. We describe an abstract are typically not easily exploitable in NLP appli- model for traditional wordlist entries and cations. Nevertheless, in the last decade or so, then provide an instantiation of the model it has become widely recognized that the devel- in RDF/XML which makes clear the re- opment of new digital methods for encoding lan- lationship between our wordlist database guage data can, in principle, not only help descrip- and interlingua approaches aimed towards tive linguists to work more effectively but also al- machine translation, and which also al- low them, with relatively little extra effort, to pro- lows for straightforward interoperation duce resources which can be straightforwardly re- with data from full lexicons. purposed for, among other things, NLP (Simons et al., 2004; Farrar and Lewis, 2007). 1 Introduction Despite this, it has proven difficult to create When looking at the relationship between NLP significant electronic descriptive resources due to and linguistics, it is typical to focus on the dif- the complex and specific problems inevitably as- ferent approaches taken with respect to issues sociated with the conversion of legacy data.
    [Show full text]