Lexical Analysis Example: Source Code

Total Page:16

File Type:pdf, Size:1020Kb

Lexical Analysis Example: Source Code CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Lexical Analysis Example: Source Code A Sample Toy Program: • Read source program and produce a list of tokens (“linear” analysis) (* define valid mutually recursive procedures *) let token source lexical parser function do_nothing1(a: int, b: string)= program analyzer do_nothing2(a+1) get next token function do_nothing2(d: int) = do_nothing1(d, “str”) • The lexical structure is specified using regular expressions in do_nothing1(0, “str2”) • Other secondary tasks: end (1) get rid of white spaces (e.g., \t,\n,\sp) and comments (2) line numbering What do we really care here ? Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 1 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 2 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS The Lexical Structure Tokens Output after the Lexical Analysis ----- token + associated value • Tokens are the atomic unit of a language, and are usually specific strings or LET 51 FUNCTION 56 ID(do_nothing1) 65 instances of classes of strings. LPAREN 76 ID(a) 77 COLON 78 Tokens Sample Values Informal Description ID(int) 80 COMMA 83 ID(b) 85 COLON 86 ID(string) 88 RPAREN 94 LET let keyword LET EQ 95 ID(do_nothing2) 99 END end keyword END LPAREN 110 ID(a) 111 PLUS 112 PLUS + INT(1) 113 RPAREN 114 FUNCTION 117 LPAREN ( ID(do_nothing2) 126 LPAREN 137 ID(d) 138 COLON 139 ID(int) 141 COLON : RPAREN 144 EQ 146 STRING “str” ID(do_nothing1) 150 LPAREN 161 RPAREN ) ID(d) 162 COMMA 163 STRING(str) 165 INT 49, 48 integer constants RPAREN 170 IN 173 ID do_nothing1, a, letter followed by letters, digits, and under- ID(do_nothing1) 177 LPAREN 188 int, string scores INT(0) 189 COMMA 190 STRING(str2) 192 EQ = RPAREN 198 END 200 EOF 203 EOF end of file Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 3 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 4 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Lexical Analysis, How? Regular Expressions • regular expressions are concise, linguistic characterization of regular • First, write down the lexical specification (how each token is defined?) languages (regular sets) * using regular expression to specify the lexical structure: identifier = letter (letter | digit | underscore) identifier = letter (letter | digit | underscore)* “or” “ 0 or more” letter = a | ... | z | A | ... | Z digit = 0 | 1 | ... | 9 • each regular expression define a regular language --- a set of strings over some alphabet, such as ASCII characters; each member of this set is called a • Second, based on the above lexical specification, build the lexical analyzer (to sentence, or a word recognize tokens) by hand, Regular Expression Spec ==> NFA ==> DFA ==>Transition Table ==> Lexical Analyzer • we use regular expressions to define each category of tokens • Or just by using lex --- the lexical analyzer generator For example, the above identifier specifies a set of strings that are a sequence of letters, digits, and underscores, starting with a letter. Regular Expression Spec (in lex format) ==> feed to lex ==> Lexical Analyzer Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 5 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 6 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Regular Expressions and Regular Example Languages Regular Expression Explanation • Given an alphabet the regular expressions over and their corresponding a* 0 or more a’s regular languages are a+ 1 or more a’s * a) denotes ,the empty string, denotes the language { }. (a|b) all strings of a’s and b’s (including ) (aa|ab|ba|bb)* all strings of a’s and b’s of even length b) for each a in , a denotes { a } --- a language with one string. [a-zA-Z] shorthand for “a|b|...|z|A|...|Z” c) if R denotes LR and S denotes LS then R | S denotes the language [0-9] shorthand for “0|1|2|...|9” L L , i.e, { x | x L or x L }. R S R S 0([0-9])*0 numbers that start and end with 0 * ? d) if R denotes LR and S denotes LS then RS denotes the language (ab|aab|b) (a|aa|e) LRLS , that is, { xy | x LRand y LS }. ? all strings that contain foo as substring * * * e) if R denotes LR then R denotes the language LR where L is the i i n n union of all L (i=0,...,and L is just {x1x2...xi | x1L, ..., xiL}. • the following is not a regular expression: a b (n > 0) f) if R denotes LR then (R) denotes the same language LR. Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 7 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 8 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Lexical Specification Transition Diagrams • Flowchart with states and edges; each edge is labelled with characters; certain •Using regular expressions to specify tokens subset of states are marked as “final states” keyword = begin | end | if | then | else • Transition from state to state proceeds along edges according to the next input * identifier = letter (letter | digit | underscore) character integer = digit+ relop = < | <= | = | <> | > | >= letter digit underscore letter = a | b | ... | z | A | B | ... | Z start letter delimiter final state digit = 0 | 1 | 2 | ... | 9 0 1 2 • Ambiguity : is “begin” a keyword or an identifier ? • Next step: to construct a token recognizer for languages given by regular • Every string that ends up at a final state is accepted expressions --- by using finite automata ! • If get “stuck”, there is no transition for a given character, it is an error given a string x, the token recognizer says “yes” if x is a sentence of the specified language and says “no” otherwise • Transition diagrams can be easily translated to programs using case statements (in C). Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 9 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 10 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Transition Diagrams (cont’d) Finite Automata The token recognizer (for identifiers) based on transition diagrams: • Finite Automata are similar to transition diagrams; they have states and labelled edges; there are one unique start state and one or more than one final state0: c = getchar(); states if (isalpha(c)) goto state1; error(); • Nondeterministic Finite Automata (NFA) : ... state1: c = getchar(); a) can label edges (these edges are called -transitions) if (isalpha(c) || isdigit(c) || b) some character can label 2 or more edges out of the same state isunderscore(c)) goto state1; if (c == ‘,’ || ... || c == ‘)’) goto state2; • Deterministic Finite Automata (DFA) : error(); a) no edges are labelled with ... b) each charcter can label at most one edge out of the same state state2: ungetc(c,stdin); /* retract current char */ return(ID, ... the current identifier ...); • NFA and DFA accepts string x if there exists a path from the start state to a final state labeled with characters in x Next: 1. finite automata are generalized transition diagrams ! 2. how to build finite automata from regular expressions? NFA: multiple paths DFA: one unique path Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 11 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 12 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Example: NFA Example: DFA start a final state start b b final state state state a b 0 1 2 3 0 1 2 3 start a b b start b a a a b An NFA accepts (a|b)*abb A DFA accepts (a|b)*abb There are many possible moves --- to accept a string, we only need one There is only one possible sequence of moves --- either lead to a final state and sequence of moves that lead to a final state. accept or the input string is rejected input string: aabb a a b b input string: aabb One sucessful sequence: 0 0 1 2 3 The sucessful sequence: a a b b Another unsuccessful sequence: a a b b 0 1 1 2 3 0 0 0 0 0 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 13 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 14 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Transition Table NFA with -transitions • Finite Automata can also be represented using transition tables 1. NFA can have -transitions --- edges labelled with For NFA, each entry is a set of states: For DFA, each entry is a unique state: a 1 2 a STATE ab STATE ab 0 0 {0,1} {0} 010 b 1-{2} 112 3 4 b 2-{3} 213 3-- 310 accepts the regular language denoted by (aa*|bb*) Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 15 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 16 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Regular Expressions -> NFA RE -> NFA (cont’d) • How to construct NFA (with -transitions) from a regular expression ? 2. “Inductive” Construction N1 : NFA for R1 • Algorithm : apply the following construction rules , use unique names for all N2 : NFA for R2 • R1 | R2 the states. (inportant invariant: always one final state !) N1 i f 1. Basic Construction the new and N2 unique final state initial state final state • for N1 and N2 i f for N1 and N2 merge : final state of N1 •a a and initial state of N i f • R1 R2 2 initial state N1 N2 final state for N1 for N2 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 17 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 18 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS RE -> NFA (cont’d) Example : RE -> NFA Converting the regular expression : (a|b)*abb 2.
Recommended publications
  • A Way with Words: Recent Advances in Lexical Theory and Analysis: a Festschrift for Patrick Hanks
    A Way with Words: Recent Advances in Lexical Theory and Analysis: A Festschrift for Patrick Hanks Gilles-Maurice de Schryver (editor) (Ghent University and University of the Western Cape) Kampala: Menha Publishers, 2010, vii+375 pp; ISBN 978-9970-10-101-6, e59.95 Reviewed by Paul Cook University of Toronto In his introduction to this collection of articles dedicated to Patrick Hanks, de Schryver presents a quote from Atkins referring to Hanks as “the ideal lexicographer’s lexicogra- pher.” Indeed, Hanks has had a formidable career in lexicography, including playing an editorial role in the production of four major English dictionaries. But Hanks’s achievements reach far beyond lexicography; in particular, Hanks has made numerous contributions to computational linguistics. Hanks is co-author of the tremendously influential paper “Word association norms, mutual information, and lexicography” (Church and Hanks 1989) and maintains close ties to our field. Furthermore, Hanks has advanced the understanding of the relationship between words and their meanings in text through his theory of norms and exploitations (Hanks forthcoming). The range of Hanks’s interests is reflected in the authors and topics of the articles in this Festschrift; this review concentrates more on those articles that are likely to be of most interest to computational linguists, and does not discuss some articles that appeal primarily to a lexicographical audience. Following the introduction to Hanks and his career by de Schryver, the collection is split into three parts: Theoretical Aspects and Background, Computing Lexical Relations, and Lexical Analysis and Dictionary Writing. 1. Part I: Theoretical Aspects and Background Part I begins with an unfinished article by the late John Sinclair, in which he begins to put forward an argument that multi-word units of meaning should be given the same status in dictionaries as ordinary headwords.
    [Show full text]
  • Finetuning Pre-Trained Language Models for Sentiment Classification of COVID19 Tweets
    Technological University Dublin ARROW@TU Dublin Dissertations School of Computer Sciences 2020 Finetuning Pre-Trained Language Models for Sentiment Classification of COVID19 Tweets Arjun Dussa Technological University Dublin Follow this and additional works at: https://arrow.tudublin.ie/scschcomdis Part of the Computer Engineering Commons Recommended Citation Dussa, A. (2020) Finetuning Pre-trained language models for sentiment classification of COVID19 tweets,Dissertation, Technological University Dublin. doi:10.21427/fhx8-vk25 This Dissertation is brought to you for free and open access by the School of Computer Sciences at ARROW@TU Dublin. It has been accepted for inclusion in Dissertations by an authorized administrator of ARROW@TU Dublin. For more information, please contact [email protected], [email protected]. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 License Finetuning Pre-trained language models for sentiment classification of COVID19 tweets Arjun Dussa A dissertation submitted in partial fulfilment of the requirements of Technological University Dublin for the degree of M.Sc. in Computer Science (Data Analytics) September 2020 Declaration I certify that this dissertation which I now submit for examination for the award of MSc in Computing (Data Analytics), is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the test of my work. This dissertation was prepared according to the regulations for postgraduate study of the Technological University Dublin and has not been submitted in whole or part for an award in any other Institute or University.
    [Show full text]
  • Lexical Analysis
    From words to numbers Parsing, tokenization, extract information Natural Language Processing Piotr Fulmański Lecture goals • Tokenizing your text into words and n-grams (tokens) • Compressing your token vocabulary with stemming and lemmatization • Building a vector representation of a statement Natural Language Processing in Action by Hobson Lane Cole Howard Hannes Max Hapke Manning Publications, 2019 Terminology Terminology • A phoneme is a unit of sound that distinguishes one word from another in a particular language. • In linguistics, a word of a spoken language can be defined as the smallest sequence of phonemes that can be uttered in isolation with objective or practical meaning. • The concept of "word" is usually distinguished from that of a morpheme. • Every word is composed of one or more morphemes. • A morphem is the smallest meaningful unit in a language even if it will not stand on its own. A morpheme is not necessarily the same as a word. The main difference between a morpheme and a word is that a morpheme sometimes does not stand alone, but a word, by definition, always stands alone. Terminology SEGMENTATION • Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. • Word segmentation is the problem of dividing a string of written language into its component words. Text segmentation, retrieved 2020-10-20, https://en.wikipedia.org/wiki/Text_segmentation Terminology SEGMENTATION IS NOT SO EASY • Trying to resolve the question of what a word is and how to divide up text into words we can face many "exceptions": • Is “ice cream” one word or two to you? Don’t both words have entries in your mental dictionary that are separate from the compound word “ice cream”? • What about the contraction “don’t”? Should that string of characters be split into one or two “packets of meaning?” • The single statement “Don’t!” means “Don’t you do that!” or “You, do not do that!” That’s three hidden packets of meaning for a total of five tokens you’d like your machine to know about.
    [Show full text]
  • Lexical Sense Labeling and Sentiment Potential Analysis Using Corpus-Based Dependency Graph
    mathematics Article Lexical Sense Labeling and Sentiment Potential Analysis Using Corpus-Based Dependency Graph Tajana Ban Kirigin 1,* , Sanda Bujaˇci´cBabi´c 1 and Benedikt Perak 2 1 Department of Mathematics, University of Rijeka, R. Matejˇci´c2, 51000 Rijeka, Croatia; [email protected] 2 Faculty of Humanities and Social Sciences, University of Rijeka, SveuˇcilišnaAvenija 4, 51000 Rijeka, Croatia; [email protected] * Correspondence: [email protected] Abstract: This paper describes a graph method for labeling word senses and identifying lexical sentiment potential by integrating the corpus-based syntactic-semantic dependency graph layer, lexical semantic and sentiment dictionaries. The method, implemented as ConGraCNet application on different languages and corpora, projects a semantic function onto a particular syntactical de- pendency layer and constructs a seed lexeme graph with collocates of high conceptual similarity. The seed lexeme graph is clustered into subgraphs that reveal the polysemous semantic nature of a lexeme in a corpus. The construction of the WordNet hypernym graph provides a set of synset labels that generalize the senses for each lexical cluster. By integrating sentiment dictionaries, we introduce graph propagation methods for sentiment analysis. Original dictionary sentiment values are integrated into ConGraCNet lexical graph to compute sentiment values of node lexemes and lexical clusters, and identify the sentiment potential of lexemes with respect to a corpus. The method can be used to resolve sparseness of sentiment dictionaries and enrich the sentiment evaluation of Citation: Ban Kirigin, T.; lexical structures in sentiment dictionaries by revealing the relative sentiment potential of polysemous Bujaˇci´cBabi´c,S.; Perak, B. Lexical Sense Labeling and Sentiment lexemes with respect to a specific corpus.
    [Show full text]
  • Automated Citation Sentiment Analysis Using High Order N-Grams: a Preliminary Investigation
    Turkish Journal of Electrical Engineering & Computer Sciences Turk J Elec Eng & Comp Sci (2018) 26: 1922 – 1932 http://journals.tubitak.gov.tr/elektrik/ © TÜBİTAK Research Article doi:10.3906/elk-1712-24 Automated citation sentiment analysis using high order n-grams: a preliminary investigation Muhammad Touseef IKRAM1;∗,, Muhammad Tanvir AFZAL1, Naveed Anwer BUTT2, 1Department of Computer Science, Capital University of Science and Technology, Islamabad, Pakistan 2Department of Computer Science, University of Gujrat, Gujrat, Pakistan Received: 03.12.2017 • Accepted/Published Online: 02.02.2018 • Final Version: 27.07.2018 Abstract: Scientific papers hold an association with previous research contributions (i.e. books, journals or conference papers, and web resources) in the form of citations. Citations are deemed as a link or relatedness of the previous work to the cited work. The nature of the cited material could be supportive (positive), contrastive (negative), or objective (neutral). Extraction of the author’s sentiment towards the cited scientific articles is an emerging research discipline due to various linguistic differences between the citation sentences and other domains of sentiment analysis. In this paper, we propose a technique for the identification of the sentiment of the citing author towards the cited paper by extracting unigram, bigram, trigram, and pentagram adjective and adverb patterns from the citation text. After POS tagging of the citation text, we use the sentence parser for the extraction of linguistic features comprising adjectives, adverbs, and n-grams from the citation text. A sentiment score is then assigned to distinguish them as positive, negative, and neutral. In addition, the proposed technique is compared with manually classified citation text and 2 commercial tools, namely SEMANTRIA and THEYSAY, to determine their applicability to the citation corpus.
    [Show full text]
  • A Framework for Representing Lexical Resources
    A framework for representing lexical resources Fabrice Issac LDI Universite´ Paris 13 Abstract demonstration of the use of lexical resources in different languages. Our goal is to propose a description model for the lexicon. We describe a software 2 Context framework for representing the lexicon and its variations called Proteus. Various Whatever the writing system of a language (logo- examples show the different possibilities graphic, syllabic or alphabetic), it seems that the offered by this tool. We conclude with word is a central concept. Nonetheless, the very a demonstration of the use of lexical re- definition of a word is subject to variation depend- sources in complex, real examples. ing on the language studied. For some Asian lan- guages such as Mandarin or Vietnamese, the no- 1 Introduction tion of word delimiter does not exist ; for others, such as French or English, the space is a good in- Natural language processing relies as well on dicator. Likewise, for some languages, preposi- methods, algorithms, or formal models as on lin- tions are included in words, while for others they guistic resources. Processing textual data involves form a separate unit. a classic sequence of steps : it begins with the nor- malisation of data and ends with their lexical, syn- Languages can be classified according to their tactic and semantic analysis. Lexical resources are morphological mechanisms and their complexity ; the first to be used and must be of excellent qual- for instance, the morphological systems of French ity. or English are relatively simple compared to that Traditionally, a lexical resource is represented of Arabic or Greek.
    [Show full text]
  • Content Extraction and Lexical Analysis from Customer-Agent Interactions
    Lexical Analysis and Content Extraction from Customer-Agent Interactions Sergiu Nisioi Anca Bucur? Liviu P. Dinu Human Language Technologies Research Center, ?Center of Excellence in Image Studies, University of Bucharest fsergiu.nisioi,[email protected], [email protected] Abstract but also to forward a solution for cleaning and ex- tracting useful content from raw text. Given the In this paper, we provide a lexical compara- large amounts of unstructured data that is being tive analysis of the vocabulary used by cus- collected as email exchanges, we believe that our tomers and agents in an Enterprise Resource proposed method can be a viable solution for con- Planning (ERP) environment and a potential solution to clean the data and extract relevant tent extraction and cleanup as a preprocessing step content for NLP. As a result, we demonstrate for indexing and search, near-duplicate detection, that the actual vocabulary for the language that accurate classification by categories, user intent prevails in the ERP conversations is highly di- extraction or automatic reply generation. vergent from the standardized dictionary and We carry our analysis for Romanian (ISO 639- further different from general language usage 1 ro) - a Romance language spoken by almost as extracted from the Common Crawl cor- 24 million people, but with a relatively limited pus. Moreover, in specific business commu- nication circumstances, where it is expected number of NLP resources. The purpose of our to observe a high usage of standardized lan- approach is twofold - to provide a comparative guage, code switching and non-standard ex- analysis between how words are used in question- pression are predominant, emphasizing once answer interactions between customers and call more the discrepancy between the day-to-day center agents (at the corpus level) and language as use of language and the standardized one.
    [Show full text]
  • A Corpus-Based Investigation of Lexical Cohesion in En & It
    A CORPUS-BASED INVESTIGATION OF LEXICAL COHESION IN EN & IT NON-TRANSLATED TEXTS AND IN IT TRANSLATED TEXTS A thesis submitted To Kent State University in partial Fufillment of the requirements for the Degree of Doctor of Philosophy by Leonardo Giannossa August, 2012 © Copyright by Leonardo Giannossa 2012 All Rights Reserved ii Dissertation written by Leonardo Giannossa M.A., University of Bari – Italy, 2007 B.A., University of Bari, Italy, 2005 Approved by ______________________________, Chair, Doctoral Dissertation Committee Brian Baer ______________________________, Members, Doctoral Dissertation Committee Richard K. Washbourne ______________________________, Erik Angelone ______________________________, Patricia Dunmire ______________________________, Sarah Rilling Accepted by ______________________________, Interim Chair, Modern and Classical Language Studies Keiran J. Dunne ______________________________, Dean, College of Arts and Sciences Timothy Moerland iii Table of Contents LIST OF FIGURES .......................................................................................................... vii LIST OF TABLES ........................................................................................................... viii DEDICATION ................................................................................................................... xi ACKNOWLEDGEMENTS .............................................................................................. xii ABSTRACT ....................................................................................................................
    [Show full text]
  • Lexical Analysis
    Lexical Analysis Lecture 3 Instructor: Fredrik Kjolstad Slide design by Prof. Alex Aiken, with modifications 1 Outline • Informal sketch of lexical analysis – Identifies tokens in input string • Issues in lexical analysis – Lookahead – Ambiguities • Specifying lexers (aka. scanners) – By regular expressions (aka. regex) – Examples of regular expressions 2 Lexical Analysis • What do we want to do? Example: if (i == j) Z = 0; else Z = 1; • The input is just a string of characters: \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1; • Goal: Partition input string into substrings – Where the substrings are called tokens 3 What’s a Token? • A syntactic category – In English: noun, verb, adjective, … – In a programming language: Identifier, Integer, Keyword, Whitespace, … 4 Tokens • A token class corresponds to a set of strings Infinite set • Examples i var1 ports – Identifier: strings of letters or foo Person digits, starting with a letter … – Integer: a non-empty string of digits – Keyword: “else” or “if” or “begin” or … – Whitespace: a non-empty sequence of blanks, newlines, and tabs 5 What are Tokens For? • Classify program substrings according to role • Lexical analysis produces a stream of tokens • … which is input to the parser • Parser relies on token distinctions – An identifier is treated differently than a keyword 6 Designing a Lexical Analyzer: Step 1 • Define a finite set of tokens – Tokens describe all items of interest • Identifiers, integers, keywords – Choice of tokens depends on • language • design of parser 7 Example • Recall \tif
    [Show full text]
  • Text-To-Speech Synthesis
    Sproat, R. & Olive, J. “Text-to-Speech Synthesis” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton: CRC Press LLC, 1999 c 1999byCRCPressLLC 46 Text-to-Speech Synthesis Richard Sproat 46.1 Introduction 46.2 Text Analysis and Linguistic Analysis Bell Laboratories • • • Lucent Technologies Text Preprocessing Accentuation Word Pronunciation In- tonational Phrasing • Segmental Durations • Intonation Joseph Olive 46.3 Speech Synthesis Bell Laboratories 46.4 The Future of TTS Lucent Technologies References 46.1 Introduction Text-to-speech synthesis has had a long history, one that can be traced back at least to Dudley’s “Voder”, developed at Bell Laboratories and demonstrated at the 1939 World’s Fair [1]. Practical systems for automatically generating speech parameters from a linguistic representation (such as a phoneme string) were not available until the 1960s, and systems for converting from ordinary text into speech were first completed in the 1970s, with MITalk being the best-known such system [2]. Many projects in text-to-speech conversion have been initiated in the intervening years, and papers on many of these systems have been published.1 It is tempting to think of the problem of converting written text into speech as “speech recognition in reverse”: current speech recognition systems are generally deemed successful if they can convert speech input into the sequence of words that was uttered by the speaker, so one might imagine that a text-to-speech (TTS) synthesizer would start with the words in the text, convert each word one-by-one into speech (being careful to pronounce each word correctly), and concatenate the result together.
    [Show full text]
  • Robust Extended Tokenization Framework for Romanian by Semantic Parallel Texts Processing
    International Journal on Natural Language Computing (IJNLC) Vol. 2, No.6, December 2013 ROBUST EXTENDED TOKENIZATION FRAMEWORK FOR ROMANIAN BY SEMANTIC PARALLEL TEXTS PROCESSING Eng. Marius Zubac 1 and Prof. PhD Eng. Vasile D ădârlat 2 1Department of Computer Science, Technical University, Cluj-Napoca, Romania 2Department of Computer Science, Technical University, Cluj-Napoca, Romania ABSTRACT Tokenization is considered a solved problem when reduced to just word borders identification, punctuation and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related information as necessary. We also claim that semantic disambiguation performs much better in a bilingual context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text provided by high profile on-line machine translation services performs almost to the same level with human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm incorporated in TORO can be used as a criterion for on-line machine translation services comparative quality assessment and we provide a setup for this purpose. KEYWORDS Semantic Tokenization, Bi-lingual Corpora, Romanian WordNet, Bing, Google Translate, Comparative MT Quality Assessment 1. INTRODUCTION When faced with the task of text processing today’s systems need to implement robust procedures for input text acceptance no matter the final processing purpose ranging from information retrieval and sentiment analysis to machine translation applications. This acceptance criteria may include various filtering pre-processing of the text, however the vast majority of researchers [1], [2], [3] agree that tokenization should be the first step of the text processing pipe as shown in the Figure 1 preceding any lexical analysis step.
    [Show full text]
  • Semantic Analysis Lexical Analysis
    9. Text and Document Analytics Jacobs University Visualization and Computer Graphics Lab Unstructured Data Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo bought <a href>Frank Rizzo Loanee: Frank Rizzo his home from Lake </a> Bought Lender: MWF View Real Estate in <a hef>this home</a> Agency: Lake View 1992. from <a href>Lake Amount: $200,000 He paid $200,000 View Real Estate</a> Term: 15 years under a15-year loan In <b>1992</b>. ) Loans($200K,[map],...) from MW Financial. <p>... Jacobs University Visualization and Computer Graphics Lab Data Analytics 312 Unstructured Data • There are many examples of text-based documents (all in ‘electronic’ format…) – e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories, etc. • Not enough time or patience to read – Can we extract the most vital kernels of information? • Find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first Jacobs University Visualization and Computer Graphics Lab Data Analytics 313 Tasks Jacobs University Visualization and Computer Graphics Lab Data Analytics 314 9.1 Text Mining Jacobs University Visualization and Computer Graphics Lab Text Mining Jacobs University Visualization and Computer Graphics Lab Data Analytics 316 Text Mining • Typically falls into one of two categories: – Analysis of text: I have a bunch of text I am interested in, tell me something about it • E.g. sentiment analysis, “buzz” searches –Retrieval: There is a large corpus of text documents, and I want the one closest to a specified query • E.g.
    [Show full text]