Phrase Based Language Model for Statistical Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Noun Phrase Parser of English Atro Voutilainen H Elsinki
A Noun Phrase Parser of English Atro Voutilainen H elsinki A bstract An accurate rule-based noun phrase parser of English is described. Special attention is given to the linguistic description. A report on a performance test concludes the paper. 1. Introduction 1.1 Motivation. A noun phrase parser is useful for several purposes, e.g. for index term generation in an information retrieval application; for the extraction of collocational knowledge from large corpora for the development of computational tools for language analysis; for providing a shallow but accurately analysed input for a more ambitious parsing system; for the discovery of translation units, and so on. Actually, the present noun phrase parser is already used in a noun phrase extractor called NPtool (Voutilainen 1993). 1.2. Constraint Grammar. The present system is based on the Constraint Grammar framework originally proposed by Karlsson (1990). A few characteristics of this framework are in order. • The linguistic representation is based on surface-oriented morphosyntactic tags that can encode dependency-oriented functional relations between words. • Parsing is reductionistic. All conventional analyses are provided as alternatives to each word by a context-free lookup mechanism, typically a morphological analyser. The parser itself seeks to discard all and only the contextually illegitimate alternative readings. What 'survives' is the parse.• • The system is modular and sequential. For instance, a grammar for the resolution of morphological (or part-of-speech) ambiguities is applied, before a syntactic module is used to introduce and then resolve syntactic ambiguities. 301 Proceedings of NODALIDA 1993, pages 301-310 • The parsing description is based on linguistic generalisations rather than probabilities. -
Personal Pronouns, Pronoun-Antecedent Agreement, and Vague Or Unclear Pronoun References
Personal Pronouns, Pronoun-Antecedent Agreement, and Vague or Unclear Pronoun References PERSONAL PRONOUNS Personal pronouns are pronouns that are used to refer to specific individuals or things. Personal pronouns can be singular or plural, and can refer to someone in the first, second, or third person. First person is used when the speaker or narrator is identifying himself or herself. Second person is used when the speaker or narrator is directly addressing another person who is present. Third person is used when the speaker or narrator is referring to a person who is not present or to anything other than a person, e.g., a boat, a university, a theory. First-, second-, and third-person personal pronouns can all be singular or plural. Also, all of them can be nominative (the subject of a verb), objective (the object of a verb or preposition), or possessive. Personal pronouns tend to change form as they change number and function. Singular Plural 1st person I, me, my, mine We, us, our, ours 2nd person you, you, your, yours you, you, your, yours she, her, her, hers 3rd person he, him, his, his they, them, their, theirs it, it, its Most academic writing uses third-person personal pronouns exclusively and avoids first- and second-person personal pronouns. MORE . PRONOUN-ANTECEDENT AGREEMENT A personal pronoun takes the place of a noun. An antecedent is the word, phrase, or clause to which a pronoun refers. In all of the following examples, the antecedent is in bold and the pronoun is italicized: The teacher forgot her book. -
Evaluation of Machine Learning Algorithms for Sms Spam Filtering
EVALUATION OF MACHINE LEARNING ALGORITHMS FOR SMS SPAM FILTERING David Bäckman Bachelor Thesis, 15 credits Bachelor Of Science Programme in Computing Science 2019 Abstract The purpose of this thesis is to evaluate dierent machine learning algorithms and methods for text representation in order to determine what is best suited to use to distinguish between spam SMS and legitimate SMS. A data set that contains 5573 real SMS has been used to train the algorithms K-Nearest Neighbor, Support Vector Machine, Naive Bayes and Logistic Regression. The dierent methods that have been used to represent text are Bag of Words, Bigram and Word2Vec. In particular, it has been investigated if semantic text representations can improve the performance of classication. A total of 12 combinations have been evaluated with help of the metrics accuracy and F1-score. The results shows that Logistic Regression together with Bag of Words reach the highest accuracy and F1-score. Bigram as text representation seems to work worse then the others methods. Word2Vec can increase the performnce for K- Nearst Neigbor but not for the other algorithms. Acknowledgements I would like to thank my supervisor Kai-Florian Richter for all good advice and guidance through the project. I would also like to thank all my classmates for help and support during the education, you have made it possible for me to reach this day. Contents 1 Introduction 1 1.1 Background1 1.2 Purpose and Research Questions1 2 Related Work 3 3 Theoretical Background 5 3.1 The Choice of Algorithms5 3.2 Classication -
NLP - Assignment 2
NLP - Assignment 2 Week 2 December 27th, 2016 1. A 5-gram model is a order Markov Model: (a) Six (b) Five (c) Four (d) Constant Ans : c) Four 2. For the following corpus C1 of 3 sentences, what is the total count of unique bi- grams for which the likelihood will be estimated? Assume we do not perform any pre-processing, and we are using the corpus as given. (i) ice cream tastes better than any other food (ii) ice cream is generally served after the meal (iii) many of us have happy childhood memories linked to ice cream (a) 22 (b) 27 (c) 30 (d) 34 Ans : b) 27 3. Arrange the words \curry, oil and tea" in descending order, based on the frequency of their occurrence in the Google Books n-grams. The Google Books n-gram viewer is available at https://books.google.com/ngrams: (a) tea, oil, curry (c) curry, tea, oil (b) curry, oil, tea (d) oil, tea, curry Ans: d) oil, tea, curry 4. Given a corpus C2, The Maximum Likelihood Estimation (MLE) for the bigram \ice cream" is 0.4 and the count of occurrence of the word \ice" is 310. The likelihood of \ice cream" after applying add-one smoothing is 0:025, for the same corpus C2. What is the vocabulary size of C2: 1 (a) 4390 (b) 4690 (c) 5270 (d) 5550 Ans: b)4690 The Questions from 5 to 10 require you to analyse the data given in the corpus C3, using a programming language of your choice. -
3 Dictionaries and Tolerant Retrieval
Online edition (c)2009 Cambridge UP DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 49 Dictionaries and tolerant 3 retrieval In Chapters 1 and 2 we developed the ideas underlying inverted indexes for handling Boolean and proximity queries. Here, we develop techniques that are robust to typographical errors in the query, as well as alternative spellings. In Section 3.1 we develop data structures that help the search for terms in the vocabulary in an inverted index. In Section 3.2 we study WILDCARD QUERY the idea of a wildcard query: a query such as *a*e*i*o*u*, which seeks doc- uments containing any term that includes all the five vowels in sequence. The * symbol indicates any (possibly empty) string of characters. Users pose such queries to a search engine when they are uncertain about how to spell a query term, or seek documents containing variants of a query term; for in- stance, the query automat* would seek documents containing any of the terms automatic, automation and automated. We then turn to other forms of imprecisely posed queries, focusing on spelling errors in Section 3.3. Users make spelling errors either by accident, or because the term they are searching for (e.g., Herman) has no unambiguous spelling in the collection. We detail a number of techniques for correcting spelling errors in queries, one term at a time as well as for an entire string of query terms. Finally, in Section 3.4 we study a method for seeking vo- cabulary terms that are phonetically close to the query term(s). -
TRADITIONAL GRAMMAR REVIEW I. Parts of Speech Traditional
Traditional Grammar Review Page 1 of 15 TRADITIONAL GRAMMAR REVIEW I. Parts of Speech Traditional grammar recognizes eight parts of speech: Part of Definition Example Speech noun A noun is the name of a person, place, or thing. John bought the book. verb A verb is a word which expresses action or state of being. Ralph hit the ball hard. Janice is pretty. adjective An adjective describes or modifies a noun. The big, red barn burned down yesterday. adverb An adverb describes or modifies a verb, adjective, or He quickly left the another adverb. room. She fell down hard. pronoun A pronoun takes the place of a noun. She picked someone up today conjunction A conjunction connects words or groups of words. Bob and Jerry are going. Either Sam or I will win. preposition A preposition is a word that introduces a phrase showing a The dog with the relation between the noun or pronoun in the phrase and shaggy coat some other word in the sentence. He went past the gate. He gave the book to her. interjection An interjection is a word that expresses strong feeling. Wow! Gee! Whew! (and other four letter words.) Traditional Grammar Review Page 2 of 15 II. Phrases A phrase is a group of related words that does not contain a subject and a verb in combination. Generally, a phrase is used in the sentence as a single part of speech. In this section we will be concerned with prepositional phrases, gerund phrases, participial phrases, and infinitive phrases. Prepositional Phrases The preposition is a single (usually small) word or a cluster of words that show relationship between the object of the preposition and some other word in the sentence. -
Code-Mixing As Found in Kartini Magazine
Code-Mixing as Found in Kartini Magazine Dewi Yanti Agustina Sinaga [email protected] Abstract This study deals with code-mixing in Kartini magazine. The objective of the study is to analyze the components of language used in code-mixing namely word, phrase, clause and sentence and to find out the components of language which occurs dominantly in code-mixing in Kartini magazine. This study is limited on the use of code-mixing only in English and Indonesian language in articles of Kartini magazine and it taken randomly as the sample. The study was conducted by using descriptive quantitative design. The technique for collecting the data was a documentary technique. The data were analyzed based on the components of language which consists of word, phrase, clause and sentence. Having analyzed the data,it was found that the components of language dominantly used is in word. The percentage of word is 59,6% followed by phrase 38%, sentence 2,4% and clause 0%. Key words : Code mixing, word, phrase, clause and sentence 1. The Background of the Study Communication has become a basic need for human beings. The activity or process of expressing ideas and feelings or giving information is called communication (Hornby, 2008: 84). The main instrument of communication is language. By using it, human can get information through language. Jendra (2010: 1) explains that “The word language is used only to refer to human’s way of communicating. In addition, Alwasilah (1983: 37) also defines that “Language is an activity of human mind, and activities of human minds are wonderfully varied, often illogical, sometimes unpredictable and occasionally obscure. -
PARTS of SPEECH ADJECTIVE: Describes a Noun Or Pronoun; Tells
PARTS OF SPEECH ADJECTIVE: Describes a noun or pronoun; tells which one, what kind or how many. ADVERB: Describes verbs, adjectives, or other adverbs; tells how, why, when, where, to what extent. CONJUNCTION: A word that joins two or more structures; may be coordinating, subordinating, or correlative. INTERJECTION: A word, usually at the beginning of a sentence, which is used to show emotion: one expressing strong emotion is followed by an exclamation point (!); mild emotion followed by a comma (,). NOUN: Name of a person, place, or thing (tells who or what); may be concrete or abstract; common or proper, singular or plural. PREPOSITION: A word that connects a noun or noun phrase (the object) to another word, phrase, or clause and conveys a relation between the elements. PRONOUN: Takes the place of a person, place, or thing: can function any way a noun can function; may be nominative, objective, or possessive; may be singular or plural; may be personal (therefore, first, second or third person), demonstrative, intensive, interrogative, reflexive, relative, or indefinite. VERB: Word that represents an action or a state of being; may be action, linking, or helping; may be past, present, or future tense; may be singular or plural; may have active or passive voice; may be indicative, imperative, or subjunctive mood. FUNCTIONS OF WORDS WITHIN A SENTENCE: CLAUSE: A group of words that contains a subject and complete predicate: may be independent (able to stand alone as a simple sentence) or dependent (unable to stand alone, not expressing a complete thought, acting as either a noun, adjective, or adverb). -
Chapter 3 Noun Phrases Pronouns
Chapter 3 Noun Phrases Now that we have established something about the structure of verb phrases, let's move on to noun phrases (NPs). A noun phrase is a noun or pronoun head and all of its modifiers (or the coordination of more than one NP--to be discussed in Chapter 6). Some nouns require the presence of a determiner as a modifier. Most pronouns are typically not modified at all and no pronoun requires the presence of a determiner. We'll start with pronouns because they are a relatively simple closed class. Pronouns English has several categories of pronouns. Pronouns differ in the contexts they appear in and in the grammatical information they contain. Pronouns in English can contrast in person, number, gender, and case. We've already discussed person and number, but to review: 1. English has three persons o first person, which is the speaker or the group that includes the speaker; o second person, which is the addressee or the group of addressees; o third person, which is anybody or anything else 2. English has two numbers o singular, which refers to a singular individual or undifferentiated group or mass; o plural, which refers to more than one individual. The difference between we and they is a difference in person: we is first person and they is third person. The difference between I and we is a difference in number: I is singular and we is plural. The other two categories which pronouns mark are gender and case. Gender is the system of marking nominal categories. -
Parts of Speech
PARTS OF SPEECH NOUN: Name of a person, place, thing or quality. (examples: Billy, Chicago, pencil, courage). PRONOUN: Word used in place of a noun. (examples: he, she, it, they, we). ADJECTIVE: Word that describes or limits a noun or pronoun. (examples: big, blue, mean, pretty). VERB: Word expressing action or state of being. (examples: write, kiss, is, feels). ADVERB: Word used to modify the meaning of a verb, adjective, or another adverb. (examples: always, once, quickly, too, there). PREPOSITION: Word used to show the relation between two or more things. (examples: to, at, under, into, between). CONJUNCTION: Word used to join a word or group of words with another word or group of words. INTERJECTION: An exclamation. (examples: oh!, wow!). GRAMMAR Subject: The something or someone talked about or doing the action in a sentence. (ex.: President Johnson disliked his portrait.) Predicate: The verb plus all its modifiers and complements. (ex.: President Johnson disliked his portrait.) Fused Sentence: Two sentences run together with no punctuation between them. (ex.: President Johnson disliked his portrait he felt it made him look too old.) Comma Splice: Two sentences separated only by a comma. (ex.: President Johnson disliked his portrait, he felt it made him look too old.) Fragment: A group of words presented as a sentence but which lacks the elements of a sentence. (ex.: Once upon a time.) Sentence: A group of words containing a subject, a verb, and expressing a complete thought. (ex.: Once upon a time, Jack and Jill tumbled down the hill.) OVER HELPING VERBS be will being would been must am do is did are does was have were has may had might having can shall could should PREPOSITIONS about beyond over above by since across down through after during to against for toward along from under among in underneath around inside until at into up before like upon behind near with below of within beneath off without besides on between onto 1. -
N-Gram-Based Machine Translation
N-gram-based Machine Translation ∗ JoseB.Mari´ no˜ ∗ Rafael E. Banchs ∗ Josep M. Crego ∗ Adria` de Gispert ∗ Patrik Lambert ∗ Jose´ A. R. Fonollosa ∗ Marta R. Costa-jussa` Universitat Politecnica` de Catalunya This article describes in detail an n-gram approach to statistical machine translation. This ap- proach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS). 1. Introduction The beginnings of statistical machine translation (SMT) can be traced back to the early fifties, closely related to the ideas from which information theory arose (Shannon and Weaver 1949) and inspired by works on cryptography (Shannon 1949, 1951) during World War II. According to this view, machine translation was conceived as the problem of finding a sentence by decoding a given “encrypted” version of it (Weaver 1955). Although the idea seemed very feasible, enthusiasm faded shortly afterward because of the computational limitations of the time (Hutchins 1986). Finally, during the nineties, two factors made it possible for SMT to become an actual and practical technology: first, significant increment in both the computational power and storage capacity of computers, and second, the availability of large volumes of bilingual data. The first SMT systems were developed in the early nineties (Brown et al. 1990, 1993). These systems were based on the so-called noisy channel approach, which models the probability of a target language sentence T given a source language sentence S as the product of a translation-model probability p(S|T), which accounts for adequacy of trans- lation contents, times a target language probability p(T), which accounts for fluency of target constructions. -
Title and Link Description Improving Distributional Similarity With
title and link description Improving Distributional Similarity • word similarity with Lessons Learned from Word ... • analogy • qualitativ: compare neighborhoods across Dependency-Based Word Embeddings embeddings • word similarity • word similarity Multi-Granularity Chinese Word • analogy Embedding • qualitative: local neighborhoods A Mixture Model for Learning Multi- • word similarity Sense Word Embeddings • analogy Learning Crosslingual Word • multilingual Embeddings without Bilingual Corpora • word similarities (monolingual) Word Embeddings with Limited • word similarity Memory • phrase similarity A Latent Variable Model Approach to • analogy PMI-based Word Embeddings Prepositional Phrase Attachment over • extrinsic Word Embedding Products A Simple Word Embedding Model for • lexical substitution (nearest Lexical Substitution neighbors after vector averaging) Segmentation-Free Word Embedding • extrinsic for Unsegmented Languages • word similarity Evaluation methods for unsupervised • analogy word embeddings • concept categorization • selectional preference Dict2vec : Learning Word Embeddings • word similarity using Lexical Dictionaries • extrinsic Task-Oriented Learning of Word • extrinsic Embeddings for Semantic Relation ... Refining Word Embeddings for • extrinsic Sentiment Analysis Language classification from bilingual • word similarity word embedding graphs Learning principled bilingual mappings • multilingual of word embeddings while ... A Word Embedding Approach to • extrinsic Identifying Verb-Noun Idiomatic ... Deep Multilingual