Optical Character Recognition Errors and Their Effects on Natural Language Processing

Total Page:16

File Type:pdf, Size:1020Kb

Optical Character Recognition Errors and Their Effects on Natural Language Processing International Journal on Document Analysis and Recognition manuscript No. (will be inserted by the editor) Optical Character Recognition Errors and Their Effects on Natural Language Processing Daniel Lopresti Department of Computer Science and Engineering, Lehigh University, 19 Memorial Drive West, Bethlehem, PA 18015, USA Received December 19, 2008 / Revised August 23, 2009 Abstract. Errors are unavoidable in advanced computer the risk of over- and under-segmenting characters during vision applications such as optical character recognition, OCR, where should the line be drawn to maximize over- and the noise induced by these errors presents a seri- all performance? The answers to these questions should ous challenge to downstream processes that attempt to influence the way we design and build document analysis make use of such data. In this paper, we apply a new systems. paradigm we have proposed for measuring the impact Researchers have already begun studying problems of recognition errors on the stages of a standard text relating to processing text data from noisy sources. To analysis pipeline: sentence boundary detection, tokeniza- date, this work has focused predominately on errors that tion, and part-of-speech tagging. Our methodology for- arise during speech recognition. For example, Palmer and mulates error classification as an optimization problem Ostendorf describe an approach for improving named en- solvable using a hierarchical dynamic programming ap- tity extraction by explicitly modeling speech recognition proach. Errors and their cascading effects are isolated errors through the use of statistics annotated with confi- and analyzed as they travel through the pipeline. We dence scores [18]. The inaugural Workshop on Analytics present experimental results based on a large collection for Noisy Unstructured Text Data [23] and its followup of scanned pages to study the varying impact depending workshops [24,25] have featured papers examining the on the nature of the error and the character(s) involved. problem of noise from a variety of perspectives, with This dataset has also been made available online to en- most emphasizing issues that are inherent in written and courage future investigations. spoken language. There has been less work, however, in the case of noise induced by optical character recognition. Early pa- pers by Taghva, Borsack, and Condit show that moder- Key words: Performance evaluation – Optical charac- ate error rates have little impact on the effectiveness of ter recognition – Sentence boundary detection – Tok- traditional information retrieval measures [21], but this enization – Part-of-speech tagging conclusion is tied to certain assumptions about the IR model (“bag of words”), the OCR error rate (not too high), and the length of the documents (not too short). 1 Introduction Miller, et al. study the performance of named entity ex- traction under a variety of scenarios involving both ASR Despite decades of research and the existence of estab- and OCR output [17], although speech is their primary lished commercial products, the output from optical char- interest. They found that by training their system on acter recognition (OCR) processes often contain errors. both clean and noisy input material, performance de- The more highly degraded the input, the greater the er- graded linearly as a function of word error rates. ror rate. Since such systems can form the first stage in Farooq and Al-Onaizan proposed an approach for a pipeline where later stages are designed to support so- improving the output of machine translation when pre- phisticated information extraction and exploitation ap- sented with OCR’ed input by modeling the error correc- plications, it is important to understand the effects of tion process itself as a translation problem [5]. recognition errors on downstream text analysis routines. A paper by Jing, Lopresti, and Shih studied the prob- Are all recognition errors equal in impact, or are some lem of summarizing textual documents that had under- worse than others? Can the performance of each stage gone optical character recognition and hence suffered be optimized in isolation, or must the end-to-end sys- from typical OCR errors [10]. From the standpoint of tem be considered? What are the most serious forms of performance evaluation, this work employed a variety degradation a page can suffer in the context of natural of indirect measures: for example, comparing the total language processing? In balancing the tradeoff between 2 Daniel Lopresti: Optical Character Recognition Errors and Their Effects on Natural Language Processing Fig. 1. Propagation of OCR errors through NLP stages (the “error cascade”). number of sentences returned by sentence boundary de- A brief synoposis of each stage and its potential prob- tection for clean and noisy versions of the same input lem areas is listed in Table 1. The interactions between text, or counting the number of incomplete parse trees errors that arise during OCR and later stages can be generated by a part-of-speech tagger. complex. Several common scenarios are depicted in Fig. 1. In two later papers [12,13], we turned to the question It is easy to imagine a single error propagating through of performance evaluation for text analysis pipelines, the pipeline and inducing a corresponding error at each proposing a paradigm based the hierarchical application of the later steps in the process (Case 1 in the figure). of approximate string matching techniques. This flexi- However, in the best case, an OCR error could have no ble yet mathematically rigorous approach both quanti- impact whatsoever on any of the later stages (Case 2); fies the performance of a given processing stage as well for example, tag and tap are both verbs, so the sen- as identifies explicitly the errors it has made. Also pre- tence boundary, tokenization, and part-of-speech tagging sented were the results of pilot studies where small sets of would remain unchanged if g were misrecognized as p. documents (tens of pages) were OCR’ed and then piped On the other hand, misrecognizing a comma (,)asa through standard routines for sentence boundary detec- period (.) creates a new sentence boundary, but might tion, tokenization, and part-of-speech tagging, demon- not affect the stages after this (Case 3). More intrigu- strating the utility of the approach. ing are latent errors which have no effect on one stage, In the present paper, we employ this same evaluation but reappear later in the processing pipline (Cases 4 and paradigm, but using a much larger and more realistic 5). OCR errors which change the tokenization or part- dataset totaling over 3,000 scanned pages which we are of-speech tagging while leaving sentence boundaries un- also making available to the community to foster work in changed fall in this category (e.g., faulty word segmen- this area [14]. We study the impact of several real-world tations that insert or delete whitespace characters). Fi- degradations on optical character recognition and the nally, a single OCR error can induce multiple errors in a NLP processes that follow it, and plot later-stage per- later stage, its impact mushrooming to neighboring to- formance as a function of the input OCR accuracy. We kens (Case 6). conclude by outlining possible topics for future research. In selecting implementations of the above stages to test, we choose to employ freely available open source software rather than proprietary, commercial packages. 2 Stages in Text Analysis From the standpoint of our work, we require behavior that is representative, not necessarily “best-in-class.” For In this section, we describe the prototypical stages that sufficiently noisy inputs, the same methodology and con- are common to many text analysis systems, discuss some clusions are likely to apply no matter what algorithm of the problems that can arise, and then list the specific is used. Comparing different techniques for realizing a packages we use in our work. The stages, in order, are: given stage to determine which is most robust in the (1) optical character recognition, (2) sentence boundary presence of OCR errors would make an interesting topic detection, (3) tokenization, and (4) part-of-speech tag- for future research. ging. These basic procedures are of interest because they form the basis for more sophisticated natural language applications, including named entity identification and 2.1 Optical character recognition extraction, topic detection and clustering, and summa- rization. In addition, the problem of identifying tabular The first stage of the pipeline is optical character recog- structures that should not be parsed as sentential text is nition, the conversion of the scanned input image from also discussed as a pre-processing step. bitmap format to encoded text. Optical character recog- Daniel Lopresti: Optical Character Recognition Errors and Their Effects on Natural Language Processing 3 Table 1. Text processing stages: functions and problems. Processing Stage Intended Function Potential Problems(s) Optical character recognition Transcribe input bitmap into encoded Current OCR is “brittle;” errors made text (hopefully accurately). early-on propagate to later stages. Sentence boundary detection Break input into sentence-sized units, Missing or spurious sentence boundaries one per text line. due to OCR errors on punctuation. Tokenization Break each sentence into word (or word- Missing or spurious tokens due to OCR like) tokens delimited by white space. errors on whitespace and punctuation. Part-of-speech tagging Takes tokenized text and attaches la- Bad PoS tags due to failed tokenization bel to each token indicating its part-of- or OCR errors that alter orthographies. speech. Fig. 3. OCR output for the image from Fig. 2. boundary detector we used in our test is the MXTERMI- Fig. 2. Example of a portion of a dark photocopy. NATOR package by Reynar and Ratnaparkhi [20]. An example of its output for a “clean” (error-free) text frag- ment consisting of two sentences is shown in Fig. 4(b).1 nition performs quite well on clean inputs in a known font.
Recommended publications
  • Aviation Speech Recognition System Using Artificial Intelligence
    SPEECH RECOGNITION SYSTEM OVERVIEW AVIATION SPEECH RECOGNITION SYSTEM USING ARTIFICIAL INTELLIGENCE Appareo’s embeddable AI model for air traffic control (ATC) transcription makes the company one of the world leaders in aviation artificial intelligence. TECHNOLOGY APPLICATION As featured in Appareo’s two flight apps for pilots, Stratus InsightTM and Stratus Horizon Pro, the custom speech recognition system ATC Transcription makes the company one of the world leaders in aviation artificial intelligence. ATC Transcription is a recurrent neural network that transcribes analog or digital aviation audio into text in near-real time. This in-house artificial intelligence, trained with Appareo’s proprietary dataset of flight-deck audio, takes the terabytes of training data and its accompanying transcriptions and reduces that content to a powerful 160 MB model that can be run inside the aircraft. BENEFITS OF ATC TRANSCRIPTION: • Provides improved situational awareness by identifying, capturing, and presenting ATC communications relevant to your aircraft’s operation for review or replay. • Provides the ability to stream transcribed speech to on-board (avionic) or off-board (iPad/ iPhone) display sources. • Provides a continuous representation of secondary audio channels (ATIS, AWOS, etc.) for review in textual form, allowing your attention to focus on a single channel. • ...and more. What makes ATC Transcription special is the context-specific training work that Appareo has done, using state-of-the-art artificial intelligence development techniques, to create an AI that understands the aviation industry vocabulary. Other natural language processing work is easily confused by the cadence, noise, and vocabulary of the aviation industry, yet Appareo’s groundbreaking work has overcome these barriers and brought functional artificial intelligence to the cockpit.
    [Show full text]
  • Making Connections: Typography, Layout and Language
    From: AAAI Technical Report FS-99-04. Compilation copyright © 1999, AAAI (www.aaai.org). All rights reserved. Making connections: typography, layout and language Robert Waller Information Design Unit & Coventry University [email protected] Abstract paragraphs are an interpolation, and would more com- These working notes summarise a genre theory that accounts fortably appear in a panel or box if I were able to write in for document layout in a three part communication model that another genre Ð textbook perhaps. Limited to the linearity recognises not only the effort of the writer to set out a topic of prose, I should really spend time crafting my language, and the purposeful effort by a reader to access information, and working out a way to knit the anecdote more closely but also the professional and manufacturing processes that 3 intervene. I suggest that layout genres use conventions that at into my argument. some historical point are rooted in functionality (of document A year or two back I redesigned the medical journal The generation, manufacture or use) but which have become con- Lancet. Part of the brief was to make it clearer to readers ventionalised. It follows that genres will shift over time as that as well as acting as a primary research journal reasons to generate or access information change, and as text through its refereed papers and letter, The Lancet also technologies develop. Case studies are described that illus- trate key aspects of the model and offer insight into the way contains highly topical and readable medical journalism. designers think about layout.
    [Show full text]
  • The Arydshln Package∗
    The arydshln package∗ Hiroshi Nakashima (Kyoto University) 2019/02/21 Abstract This file gives LATEX's array and tabular environments the capability to draw horizontal/vertical dash-lines. Contents 1 Introduction 3 2 Usage 3 2.1 Loading Package . 3 2.2 Basic Usage . 4 2.3 Style Parameters . 4 2.4 Fine Tuning . 5 2.5 Finer Tuning . 5 2.6 Performance Tuning . 6 2.7 Compatibility with Other Packages . 7 3 Known Problems 9 4 Implementation 10 4.1 Problems and Solutions . 10 4.2 Another Old Problem . 13 4.3 Register Declaration . 14 4.4 Initialization . 17 4.5 Making Preamble . 22 4.6 Building Columns . 27 4.7 Multi-columns . 30 4.8 End of Rows . 32 4.9 Horizontal Lines . 33 4.10 End of Environment . 37 4.11 Drawing Vertical Lines . 38 ∗This file has version number v1.76, last revised 2019/02/21. 1 4.12 Drawing Dash-lines . 44 4.13 Shorthand Activation . 45 4.14 Compatibility with colortab ........................... 48 4.15 Compatibility with longtable ........................... 48 4.15.1 Initialization . 49 4.15.2 Ending Chunks . 51 4.15.3 Horizontal Lines and p-Boxes . 53 4.15.4 First Chunk . 55 4.15.5 Output Routine . 56 4.16 Compatibility with colortbl ........................... 60 4.16.1 Initialization, Cell Coloring and Finalization . 62 4.16.2 Horizontal Line Coloring . 63 4.16.3 Vertical Line Coloring . 65 4.16.4 Compatibility with longtable ...................... 68 2 1 Introduction In January 1993, Weimin Zhang kindly posted a style hvdashln written by the author, which draws horizontal/vertical dash-lines in LATEX's array and tabular environments, to the news group comp.text.tex.
    [Show full text]
  • Copyrighted Material
    INDEX A Bertsch, Fred, 16 Caslon Italic, 86 accents, 224 Best, Mark, 87 Caslon Openface, 68 Adobe Bickham Script Pro, 30, 208 Betz, Jennifer, 292 Cassandre, A. M., 87 Adobe Caslon Pro, 40 Bézier curve, 281 Cassidy, Brian, 268, 279 Adobe InDesign soft ware, 116, 128, 130, 163, Bible, 6–7 casual scripts typeface design, 44 168, 173, 175, 182, 188, 190, 195, 218 Bickham Script Pro, 43 cave drawing, type development, 3–4 Adobe Minion Pro, 195 Bilardello, Robin, 122 Caxton, 110 Adobe Systems, 20, 29 Binner Gothic, 92 centered type alignment Adobe Text Composer, 173 Birch, 95 formatting, 114–15, 116 Adobe Wood Type Ornaments, 229 bitmapped (screen) fonts, 28–29 horizontal alignment, 168–69 AIDS awareness, 79 Black, Kathleen, 233 Century, 189 Akuin, Vincent, 157 black letter typeface design, 45 Chan, Derek, 132 Alexander Isley, Inc., 138 Black Sabbath, 96 Chantry, Art, 84, 121, 140, 148 Alfon, 71 Blake, Marty, 90, 92, 95, 140, 204 character, glyph compared, 49 alignment block type project, 62–63 character parts, typeface design, 38–39 fi ne-tuning, 167–71 Blok Design, 141 character relationships, kerning, spacing formatting, 114–23 Bodoni, 95, 99 considerations, 187–89 alternate characters, refi nement, 208 Bodoni, Giambattista, 14, 15 Charlemagne, 206 American Type Founders (ATF), 16 boldface, hierarchy and emphasis technique, China, type development, 5 Amnesty International, 246 143 Cholla typeface family, 122 A N D, 150, 225 boustrophedon, Greek alphabet, 5 circle P (sound recording copyright And Atelier, 139 bowl symbol), 223 angled brackets,
    [Show full text]
  • A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech
    A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech Joshua Y. Kim1, Chunfeng Liu1, Rafael A. Calvo1*, Kathryn McCabe2, Silas C. R. Taylor3, Björn W. Schuller4, Kaihang Wu1 1 University of Sydney, Faculty of Engineering and Information Technologies 2 University of California, Davis, Psychiatry and Behavioral Sciences 3 University of New South Wales, Faculty of Medicine 4 Imperial College London, Department of Computing Abstract large-scale commercial products such as Google Home and Amazon Alexa. Mainstream ASR Automatic Speech Recognition (ASR) systems use only voice as inputs, but there is systems have proliferated over the recent potential benefit in using multi-modal data in order years to the point that free platforms such to improve accuracy [1]. Compared to machines, as YouTube now provide speech recognition services. Given the wide humans are highly skilled in utilizing such selection of ASR systems, we contribute to unstructured multi-modal information. For the field of automatic speech recognition example, a human speaker is attuned to nonverbal by comparing the relative performance of behavior signals and actively looks for these non- two sets of manual transcriptions and five verbal ‘hints’ that a listener understands the speech sets of automatic transcriptions (Google content, and if not, they adjust their speech Cloud, IBM Watson, Microsoft Azure, accordingly. Therefore, understanding nonverbal Trint, and YouTube) to help researchers to responses to unintelligible speech can both select accurate transcription services. In improve future ASR systems to mark uncertain addition, we identify nonverbal behaviors transcriptions, and also help to provide feedback so that are associated with unintelligible speech, as indicated by high word error that the speaker can improve his or her verbal rates.
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Synthesis and Recognition of Speech Creating and Listening to Speech
    ISSN 1883-1974 (Print) ISSN 1884-0787 (Online) National Institute of Informatics News NII Interview 51 A Combination of Speech Synthesis and Speech Oct. 2014 Recognition Creates an Affluent Society NII Special 1 “Statistical Speech Synthesis” Technology with a Rapidly Growing Application Area NII Special 2 Finding Practical Application for Speech Recognition Feature Synthesis and Recognition of Speech Creating and Listening to Speech A digital book version of “NII Today” is now available. http://www.nii.ac.jp/about/publication/today/ This English language edition NII Today corresponds to No. 65 of the Japanese edition [Advance Notice] Great news! NII Interview Yamagishi-sensei will create my voice! Yamagishi One result is a speech translation sys- sound. Bit (NII Character) A Combination of Speech Synthesis tem. This system recognizes speech and translates it Ohkawara I would like your comments on the fu- using machine translation to synthesize speech, also ture challenges. and Speech Recognition automatically translating it into every language to Ono The challenge for speech recognition is how speak. Moreover, the speech is created with a human close it will come to humans in the distant speech A Word from the Interviewer voice. In second language learning, you can under- case. If this study is advanced, it will be possible to Creates an Affluent Society stand how you should pronounce it with your own summarize the contents of a meeting and to automati- voice. If this is further advanced, the system could cally take the minutes. If a robot understands the con- have an actor in a movie speak in a different language tents of conversations by multiple people in a natural More and more people have begun to use smart- a smartphone, I use speech input more often.
    [Show full text]
  • Natural Language Processing in Speech Understanding Systems
    Working Paper Series ISSN 11 70-487X Natural language processing in speech understanding systems by Dr Geoffrey Holmes Working Paper 92/6 September, 1992 © 1992 by Dr Geoffrey Holmes Department of Computer Science The University of Waikato Private Bag 3105 Hamilton, New Zealand Natural Language P rocessing In Speech Understanding Systems G. Holmes Department of Computer Science, University of Waikato, New Zealand Overview Speech understanding systems (SUS's) came of age in late 1071 as a result of a five year devel­ opment. programme instigated by the Information Processing Technology Office of the Advanced Research Projects Agency (ARPA) of the Department of Defense in the United States. The aim of the progranune was to research and tlevelop practical man-machine conuuunication system s. It has been argued since, t hat. t he main contribution of this project was not in the development of speech science, but in the development of artificial intelligence. That debate is beyond the scope of th.is paper, though no one would question the fact. that the field to benefit most within artificial intelligence as a result of this progranune is natural language understan ding. More recent projects of a similar nature, such as projects in the Unite<l Kiug<lom's ALVEY programme and Ew·ope's ESPRIT programme have added further developments to this important field. Th.is paper presents a review of some of the natural language processing techniques used within speech understanding syst:ems. In particular. t.ecl.miq11es for handling syntactic, semantic and pragmatic informat.ion are ,Uscussed. They are integrated into SUS's as knowledge sources.
    [Show full text]
  • 5 the Rules of Typography
    The Rules of Typography Typographic Terminology THE RULES OF TYPOGRAPHY Typographic Terminology TYPEFACE VS. FONT: Two Definitions Typeface Font THE FULL FAMILY vs. ONE WEIGHT A full family of fonts A member of a typeface family example: Helvetica Neue example: Helvetica Neue Bold THE DESIGN vs. THE DIGITAL FILE The intellectual property created A digital file of a typeface by a type designer THE RULES OF TYPOGRAPHY Typographic Terminology LEADING 16/20 16/29 “In a badly designed book, the letters “In a badly designed book, the letters Leading refers to the amount of space mill and stand like starving horses in a mill and stand like starving horses in a between lines of type using points as field. In a book designed by rote, they the measurement. The name was sit like stale bread and mutton on field. In a book designed by rote, they derived from the strips of lead that the page. In a well-made book, where sit like stale bread and mutton on were used during the typesetting designer, compositor and printer have process to create the space. Now we all done their jobs, no matter how the page. In a well-made book, where perform this digitally—also with many thousands of lines and pages, the designer, compositor and printer have consideration to the optimal setting letters are alive. They dance in their for any particular typeface. seats. Sometimes they rise and dance all done their jobs, no matter how When speaking about leading we in the margins and aisles.” many thousands of lines and pages, the first say the type size “on” the leading.
    [Show full text]
  • Lecture 11 – Typography
    Typography Steven R. Bagley Introduction • Looked at the beginning at representing text in a computer • And then how to describe the appearance of a page • Now going to start to looking at how to convert from text to into a PDL • Formatting algorithms Formatting • At its most basic formatting text is breaking it into words and fitting them into a space forming lines • But it turns out its a hard problem to solve • Because the text has to look nice • And computers don’t do nice… Formatting • Lots of parameters that affect formatting • All of which are interrelated, so changing one may require you to change another • And the effects are psychological affecting the readability of a block of text Formatting • Goal is to be able to quantify good layout • Computer can then select one layout over another • To do that we’re going to need to understand what makes good text layout • And so understand the lingo of typography • A lot of the terms are steeped in history… Mainly from hot-metal typesetting… Show face to face with progress ;) Measurements • Basic units of measurement in typography are the point and the pica • Derived from the inch • Point is 1/72 inch • Traditionally, has been slightly less but now rounded to 1/72 • Pica is a 1/12 inch (or 6pts) ~1/72.27 So 12 point text is 1pica high Measurements • These are absolute measurements • It’s also useful to have measurements that are relative to the current point size • Particularly used for horizontal spacing to remain proportional as the font gets bigger em and en • One such relative measurement is the em • Specified as being the same as the point size • Also have the en which is half the point size • Will also see references to M/3, M/4 and M/5 Character size • The size of text is referred to the point size (and not font size) • Describes the distance from baseline to baseline when text is set solid • That is when the bits of metal type are abutted together Letterforms have tone, timbre character, just as words and sentences do.
    [Show full text]
  • Self-Publishing Guide to Doing It Right the First Time
    FORMATTING MANUSCRIPT ARE YOU PUZZLED BY SELF PUBLISHING? A SELF-PUBLISHING GUIDE TO DOING IT RIGHT THE FIRST TIME. PRINTING EDITING PROMOTION TABLE OF CONTENTS How to use this Book Copyright example page Dedication example page Prologue 9 The Process Defined 10 Your First Steps 12 - The Manuscript 12 - Sourcing Printing Quotes 13 Design and Formatting 14 - The Basics: Page Overview 15 - The Basics: Page Overview - Defined 16 - The Basics: Typography 18 - The Basics: Font Size 20 - The Basics: Line Length 21 - The Basics: Line Spacing 22 - The Basics: Choose your Binding 23 - The Basics: Picking your Paper Stock 23 - The Basics: Formatting for Print 24 TABLE OF CONTENTS CON’T Tutorials 25 - Setting your Margins 25 - Setting your Page Size 26 - Setting Line Spacing 27 - Setting Chapters and Headings 28 - Creating a Table of Contents 29 - Saving a Print-Ready File 30 - Creating your Cover File 31 Registration & ISBN 33 The Printing Process 34 Glossary 35 Notes 36 The Table of Contents gives the reader a comprehensive overview of what topics or chapters that your manuscript PRO TIP has, as well as the page number each chapter begins with. These are most common in print - electronic documents FRONT MATTER have started to adapt this principle as well. - TABLE OF CONTENTS HOW TO USE THIS BOOK Congratulations! You have finished your manuscript and have decided to self-publish your book. We have taken our 28 years of expertise and developed an easy to use guide to take you through the world of self- publishing. This guide is a reliable, practical, and straightforward reference guide for you.
    [Show full text]
  • Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020
    WHOLE-WORD SEGMENTAL SPEECH RECOGNITION WITH ACOUSTIC WORD EMBEDDINGS Bowen Shi, Shane Settle, Karen Livescu TTI-Chicago, USA fbshi,settle.shane,[email protected] ABSTRACT Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length seg- ments of frames. We consider segmental models for whole- word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi de- coding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we inves- tigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embed- dings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional Fig. 1. Whole-word segmental model for speech recognition. (smaller) gains can be obtained by pre-training the word pre- Note: boundary frames are not shared. diction layer with AGWEs. Our final models improve over segmental models, where the sequence probability is com- prior A2W models. puted based on segment scores instead of frame probabilities. Index Terms— speech recognition, segmental model, Segmental models have a long history in speech recognition acoustic-to-word, acoustic word embeddings, pre-training research, but they have been used primarily for phonetic recog- nition or as phone-level acoustic models [11–18].
    [Show full text]