Regular Expressions

Total Page:16

File Type:pdf, Size:1020Kb

Regular Expressions REGULAR EXPRESSIONS 1. WHAT ARE REGULAR EXPRESSIONS Regular expressions (or ‘regex patterns’ or ‘grep patterns’) are expressions that stand for symbols or strings of symbols, or more often a class of symbols or strings of symbols. You may be familiar with this from the SEARCH AND REPLACE function of word processors like MS Word, OpenOffice Writer, etc., where this mechanism is known as pattern matching, or wildcard operators. For example, a period (.) may stand for ‘any character’, such that m.le would find male, mile, mole, mule, etc. Obviously, such a mechanism is very useful in corpus linguistics, e.g. in order to search for different tense forms of the same verb. This handout explains the basic use of regular expressions using two regex packages that can be downloaded at no charge from the Internet: TextPad (for Windows) and BBEdit Lite (for Mac OS and OSX). Both are powerful text editors that can actually be used as rudimentary concordancers. Note that this handout will only deal with searching, not with replacing. Read the documentation for the software packages to find out more about their replace functions; however, you should never use the replace function with your original corpus files. 2. TWO REGEX TOOLS 2.1 TEXT PAD (WINDOWS) This shareware software package can be downloaded at www.textpad.com. Before you work with it, go to the CONFIGURE menu, choose the PREFERENCES command and then the EDITOR subcommand and activate the USE POSIX control box. TextPad has two types of search commands, both in the SEARCH menu: FIND and FIND IN FILES. The first is shown in Figure 1a. Figure 1a: TextPad FIND dialogue box When working with regular expressions, make sure that the REGULAR EXPRESSION dialogue box is activated. The regex pattern is typed into the FIND WHAT box. If you click on FIND, the next occurrence of your pattern in the currently open document will be found. If you click on MARK ALL, TextPad will mark all lines containing an occurrence of your search pattern. You can then use the BOOKMARKED LINES Corpus Linguistics 1/4 © 2003 Anatol Stefanowitsch [email protected] Regular Expressions 2/4 subccommand from the COPY OTHER command in the EDIT menu to copy all occurrences and paste them into a new document. You can also search all open documents by activating the IN ALL DOCUMENTS control box. Note that you can perform case-sensitive and case-insensitive searches by activating or deactivating the appropriate control box. TextPad can also search multiple files in a single pass if they are not currently open. To do this, you use the FIND IN FILES command, whose dialogue box is shown in Figure 1b. Figure 1b: TextPad FIND IN FILES dialogue box Here you have the same basic options as before, but in addition you can specify a file type (e.g. .txt) in the IN FILES box and a folder in the IN FOLDER box. TextPad will then search all files of the specified type in the specified folder, and create a new document listing all lines containing an occurrence of your search pattern (to do this, make sure the radio button ALL MATCHING LINES is activated). 2.2 BBEDIT LITE (MAC) This freeware software package can be downloaded at www.barebones.com. BBEdit’s FIND & REPLACE dialogue box is shown in Figure 2. When working with regular expressions, make sure that the USE GREP control box is activated. Figure 2: BBEdit FIND & REPLACE dialogue box Regular Expressions 3/4 The regex pattern is typed into the SEARCH FOR box. If you click on FIND, the next occurrence of your pattern in the currently open document will be found. If you click on FIND ALL, a new document is created, which lists all lines containing an occurrence of your search pattern. Note that you can perform case-sensitive and case-insensitive searches by activating or deactivating the appropriate control box. Like TextPad, BBEdit can search multiple files in a single pass. To do this, you simply activate the MULTI-FILE SEARCH control box and then choose the folder containing the files you want to search using the right OTHER switch The name of the folder which you have selected will appear in the lowest of the three text boxes. Again, by using the FIND ALL command, you can generate a document listing all lines that contain your search pattern (along with the name and path of the file in which it was found). 3. TWO DIALECTS OF REGEX Table 1 lists the most important regex characters in TextPad and BBEdit: Table 1: Regex characters in TextPad and BBEdit TEXTPAD BBEDIT LITE EXPLANATION .. Any character (including whitespace characters) except a line break [xyz] [xyz] Any of the characters x, y, z Example: b[aeiou]t finds bat, bet, bit, bot, and but [a-z] [a-z] Any characters from a to z in the ASCII table [^xyz] [^xyz] Any character except x,y,z Example: b[^u]t finds e.g. bat, bit and bet but not but ^^ Beginning of a line (unless used in square brackets, cf. preceding entry) $$ End of a line (unless used in square brackets) \< Left word boundary (beginning of a word) Example: \<un finds un at the beginning of a word, as in undo, unnatural, until \> Right word boundary (end of a word) Example: ing\> finds ing at the end of a word, as in running, thinking, and ring \t \t Tab \f \f Page break (Form Feed). \n \n (Unix) Line break (Newline) \r (Mac) ** Zero or more occurrences of the preceding character Example: but?s finds bus, buts, and butts; f[aeiou]*l finds e.g. fail, foil, feel, fool, foul, foal, etc. ?? Zero or one occurrence of the preceding character Example: but?s finds bus, and buts; honou?r finds honor and honour ++ One or more occurrences of the preceding character Example: but+s finds buts and butts, but not bus {x} Exactly x occurrences of the preceding character {x,} At least x occurrences of the preceding character {x,y} At least x, but no more than y occurrences of the preceding character (x|y) (x|y) Either x or y Example f(a|i)t finds fat or fit; (a|the) finds a and the; (a|the|this) finds a, the, and this. \\ Cancels the status of a character as a wildcard; e.g. ? finds one or more occurrences of the preceding character, but \? finds question marks Regular Expressions 4/4 In addition, there are some predefined expressions for whole classes of characters, as shown in Table 2: Table 2: Regex character classes in TextPad and BBEdit [:alpha:] Any alphabetical character [:lower:] Any lowecase alphabetical character [:upper:] Any uppercase alphabetical character [:alnum:] \w Any alphanumeric character [:word:] Any alphanumeric character, hyphen, and apostrophe \W Any character (including whitespace) except alphanumeric characters [:digit:] \d or # Any numerical character \D Any character except alphanumeric characters [:blank:] Space or tab [:space:] \s Any whitespace character [:graph:] \S Any character except whitespace characters [:punct:] Any character except alphanumeric and whitespace characters 4. EXERCISES 1. For each of the following adjectives, design a regex pattern that will retrieve all of its forms. TALL (tall, taller, tallest) FIT (fit, fitter, fittest) NICE (nice, nicer, nicest) SCARY (scary, scarier, scariest) 2. For each of the following nouns, design a regex pattern that will retrieve all of its forms: BOOK (book, books) CHILD (child, children) BUS (bus, buses) LEAF (leaf, leaves) WOMAN (woman, women) MOUSE (mouse, mice) 3. For each of the following verbs, design at least one regex pattern that will retrieve all of its forms: WALK (walk, walks, walking, walked) HIT (hit, hits, hitting) FLIP (flip, flips, flipping, flipped) SIT (sit, sits, sitting, sat) STEAL (steal, steals, stealing, stole, stolen) FIND (find, finds, finding, found) SING (sing, sings, singing, sang, sung) TAKE (take, takes, taking, took, taken) FLY (fly, flies, flying, flew, flown) WREAK (wreaks, wreaked, wrought, wreaking) Ger. SPRINGEN (spring, springe, springst, springt, springen, sprang, sprangst, sprangt, sprangen, gesprungen) 4. Use TextPad or BBEdit to search a 1-million word corpus (like BROWN, FROWN, LOB, FROB, etc.) for some of the patterns you have designed..
Recommended publications
  • Use Perl Regular Expressions in SAS® Shuguang Zhang, WRDS, Philadelphia, PA
    NESUG 2007 Programming Beyond the Basics Use Perl Regular Expressions in SAS® Shuguang Zhang, WRDS, Philadelphia, PA ABSTRACT Regular Expression (Regexp) enhance search and replace operations on text. In SAS®, the INDEX, SCAN and SUBSTR functions along with concatenation (||) can be used for simple search and replace operations on static text. These functions lack flexibility and make searching dynamic text difficult, and involve more function calls. Regexp combines most, if not all, of these steps into one expression. This makes code less error prone, easier to maintain, clearer, and can improve performance. This paper will discuss three ways to use Perl Regular Expression in SAS: 1. Use SAS PRX functions; 2. Use Perl Regular Expression with filename statement through a PIPE such as ‘Filename fileref PIPE 'Perl programm'; 3. Use an X command such as ‘X Perl_program’; Three typical uses of regular expressions will also be discussed and example(s) will be presented for each: 1. Test for a pattern of characters within a string; 2. Replace text; 3. Extract a substring. INTRODUCTION Perl is short for “Practical Extraction and Report Language". Larry Wall Created Perl in mid-1980s when he was trying to produce some reports from a Usenet-Nes-like hierarchy of files. Perl tries to fill the gap between low-level programming and high-level programming and it is easy, nearly unlimited, and fast. A regular expression, often called a pattern in Perl, is a template that either matches or does not match a given string. That is, there are an infinite number of possible text strings.
    [Show full text]
  • Lecture 18: Theory of Computation Regular Expressions and Dfas
    Introduction to Theoretical CS Lecture 18: Theory of Computation Two fundamental questions. ! What can a computer do? ! What can a computer do with limited resources? General approach. Pentium IV running Linux kernel 2.4.22 ! Don't talk about specific machines or problems. ! Consider minimal abstract machines. ! Consider general classes of problems. COS126: General Computer Science • http://www.cs.Princeton.EDU/~cos126 2 Why Learn Theory In theory . Regular Expressions and DFAs ! Deeper understanding of what is a computer and computing. ! Foundation of all modern computers. ! Pure science. ! Philosophical implications. a* | (a*ba*ba*ba*)* In practice . ! Web search: theory of pattern matching. ! Sequential circuits: theory of finite state automata. a a a ! Compilers: theory of context free grammars. b b ! Cryptography: theory of computational complexity. 0 1 2 ! Data compression: theory of information. b "In theory there is no difference between theory and practice. In practice there is." -Yogi Berra 3 4 Pattern Matching Applications Regular Expressions: Basic Operations Test if a string matches some pattern. Regular expression. Notation to specify a set of strings. ! Process natural language. ! Scan for virus signatures. ! Search for information using Google. Operation Regular Expression Yes No ! Access information in digital libraries. ! Retrieve information from Lexis/Nexis. Concatenation aabaab aabaab every other string ! Search-and-replace in a word processors. cumulus succubus Wildcard .u.u.u. ! Filter text (spam, NetNanny, Carnivore, malware). jugulum tumultuous ! Validate data-entry fields (dates, email, URL, credit card). aa Union aa | baab baab every other string ! Search for markers in human genome using PROSITE patterns. aa ab Closure ab*a abbba ababa Parse text files.
    [Show full text]
  • Bbedit 13.5 User Manual
    User Manual BBEdit™ Professional Code and Text Editor for the Macintosh Bare Bones Software, Inc. ™ BBEdit 13.5 Product Design Jim Correia, Rich Siegel, Steve Kalkwarf, Patrick Woolsey Product Engineering Jim Correia, Seth Dillingham, Matt Henderson, Jon Hueras, Steve Kalkwarf, Rich Siegel, Steve Sisak Engineers Emeritus Chris Borton, Tom Emerson, Pete Gontier, Jamie McCarthy, John Norstad, Jon Pugh, Mark Romano, Eric Slosser, Rob Vaterlaus Documentation Fritz Anderson, Philip Borenstein, Stephen Chernicoff, John Gruber, Jeff Mattson, Jerry Kindall, Caroline Rose, Allan Rouselle, Rich Siegel, Vicky Wong, Patrick Woolsey Additional Engineering Polaschek Computing Icon Design Bryan Bell Factory Color Schemes Luke Andrews Additional Color Schemes Toothpaste by Cat Noon, and Xcode Dark by Andrew Carter. Used by permission. Additional Icons By icons8. Used under license Additional Artwork By Jonathan Hunt PHP keyword lists Contributed by Ted Stresen-Reuter. Previous versions by Carsten Blüm Published by: Bare Bones Software, Inc. 73 Princeton Street, Suite 206 North Chelmsford, MA 01863 USA (978) 251-0500 main (978) 251-0525 fax https://www.barebones.com/ Sales & customer service: [email protected] Technical support: [email protected] BBEdit and the BBEdit User Manual are copyright ©1992-2020 Bare Bones Software, Inc. All rights reserved. Produced/published in USA. Copyrights, Licenses & Trademarks cmark ©2014 by John MacFarlane. Used under license; part of the CommonMark project LibNcFTP Used under license from and copyright © 1996-2010 Mike Gleason & NcFTP Software Exuberant ctags ©1996-2004 Darren Hiebert (source code here) PCRE2 Library Written by Philip Hazel and Zoltán Herczeg ©1997-2018 University of Cambridge, England Info-ZIP Library ©1990-2009 Info-ZIP.
    [Show full text]
  • DFDL WG Stephen M Hanson, IBM [email protected] September 2014
    GFD-P-R.207 (OBSOLETED by GFD-P-R.240) Michael J Beckerle, Tresys Technology OGF DFDL WG Stephen M Hanson, IBM [email protected] September 2014 Data Format Description Language (DFDL) v1.0 Specification Status of This Document Grid Final Draft (GFD) Obsoletes This document obsoletes GFD-P-R.174 dated January 2011 [OBSOLETE_DFDL]. Copyright Notice Copyright © Global Grid Forum (2004-2006). Some Rights Reserved. Distribution is unlimited. Copyright © Open Grid Forum (2006-2014). Some Rights Reserved. Distribution is unlimited Abstract This document is OBSOLETE. It is superceded by GFD-P-R.240. This document provides a definition of a standard Data Format Description Language (DFDL). This language allows description of text, dense binary, and legacy data formats in a vendor- neutral declarative manner. DFDL is an extension to the XML Schema Description Language (XSDL). GFD-P-R.207 (OBSOLETED by GFD-P-R.240) September 2014 Contents Data Format Description Language (DFDL) v1.0 Specification ...................................................... 1 1. Introduction ............................................................................................................................... 9 1.1 Why is DFDL Needed? ................................................................................................... 10 1.2 What is DFDL? ................................................................................................................ 10 Simple Example ......................................................................................................
    [Show full text]
  • Perl Regular Expressions Tip Sheet Functions and Call Routines
    – Perl Regular Expressions Tip Sheet Functions and Call Routines Basic Syntax Advanced Syntax regex-id = prxparse(perl-regex) Character Behavior Character Behavior Compile Perl regular expression perl-regex and /…/ Starting and ending regex delimiters non-meta Match character return regex-id to be used by other PRX functions. | Alternation character () Grouping {}[]()^ Metacharacters, to match these pos = prxmatch(regex-id | perl-regex, source) $.|*+?\ characters, override (escape) with \ Search in source and return position of match or zero Wildcards/Character Class Shorthands \ Override (escape) next metacharacter if no match is found. Character Behavior \n Match capture buffer n Match any one character . (?:…) Non-capturing group new-string = prxchange(regex-id | perl-regex, times, \w Match a word character (alphanumeric old-string) plus "_") Lazy Repetition Factors Search and replace times number of times in old- \W Match a non-word character (match minimum number of times possible) string and return modified string in new-string. \s Match a whitespace character Character Behavior \S Match a non-whitespace character *? Match 0 or more times call prxchange(regex-id, times, old-string, new- \d Match a digit character +? Match 1 or more times string, res-length, trunc-value, num-of-changes) Match a non-digit character ?? Match 0 or 1 time Same as prior example and place length of result in \D {n}? Match exactly n times res-length, if result is too long to fit into new-string, Character Classes Match at least n times trunc-value is set to 1, and the number of changes is {n,}? Character Behavior Match at least n but not more than m placed in num-of-changes.
    [Show full text]
  • Unicode Regular Expressions Technical Reports
    7/1/2019 UTS #18: Unicode Regular Expressions Technical Reports Working Draft for Proposed Update Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS Version 20 Editors Mark Davis, Andy Heninger Date 2019-07-01 This Version http://www.unicode.org/reports/tr18/tr18-20.html Previous Version http://www.unicode.org/reports/tr18/tr18-19.html Latest Version http://www.unicode.org/reports/tr18/ Latest Proposed http://www.unicode.org/reports/tr18/proposed.html Update Revision 20 Summary This document describes guidelines for how to adapt regular expression engines to use Unicode. Status This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress. A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. Contents 0 Introduction 0.1 Notation 0.2 Conformance 1 Basic Unicode Support: Level 1 1.1 Hex Notation 1.1.1 Hex Notation and Normalization 1.2 Properties 1.2.1 General
    [Show full text]
  • Regular Expressions with a Brief Intro to FSM
    Regular Expressions with a brief intro to FSM 15-123 Systems Skills in C and Unix Case for regular expressions • Many web applications require pattern matching – look for <a href> tag for links – Token search • A regular expression – A pattern that defines a class of strings – Special syntax used to represent the class • Eg; *.c - any pattern that ends with .c Formal Languages • Formal language consists of – An alphabet – Formal grammar • Formal grammar defines – Strings that belong to language • Formal languages with formal semantics generates rules for semantic specifications of programming languages Automaton • An automaton ( or automata in plural) is a machine that can recognize valid strings generated by a formal language . • A finite automata is a mathematical model of a finite state machine (FSM), an abstract model under which all modern computers are built. Automaton • A FSM is a machine that consists of a set of finite states and a transition table. • The FSM can be in any one of the states and can transit from one state to another based on a series of rules given by a transition function. Example What does this machine represents? Describe the kind of strings it will accept. Exercise • Draw a FSM that accepts any string with even number of A’s. Assume the alphabet is {A,B} Build a FSM • Stream: “I love cats and more cats and big cats ” • Pattern: “cat” Regular Expressions Regex versus FSM • A regular expressions and FSM’s are equivalent concepts. • Regular expression is a pattern that can be recognized by a FSM. • Regex is an example of how good theory leads to good programs Regular Expression • regex defines a class of patterns – Patterns that ends with a “*” • Regex utilities in unix – grep , awk , sed • Applications – Pattern matching (DNA) – Web searches Regex Engine • A software that can process a string to find regex matches.
    [Show full text]
  • Regular Expressions
    CS 172: Computability and Complexity Regular Expressions Sanjit A. Seshia EECS, UC Berkeley Acknowledgments: L.von Ahn, L. Blum, M. Blum The Picture So Far DFA NFA Regular language S. A. Seshia 2 Today’s Lecture DFA NFA Regular Regular language expression S. A. Seshia 3 Regular Expressions • What is a regular expression? S. A. Seshia 4 Regular Expressions • Q. What is a regular expression? • A. It’s a “textual”/ “algebraic” representation of a regular language – A DFA can be viewed as a “pictorial” / “explicit” representation • We will prove that a regular expressions (regexps) indeed represent regular languages S. A. Seshia 5 Regular Expressions: Definition σ is a regular expression representing { σσσ} ( σσσ ∈∈∈ ΣΣΣ ) ε is a regular expression representing { ε} ∅ is a regular expression representing ∅∅∅ If R 1 and R 2 are regular expressions representing L 1 and L 2 then: (R 1R2) represents L 1⋅⋅⋅L2 (R 1 ∪∪∪ R2) represents L 1 ∪∪∪ L2 (R 1)* represents L 1* S. A. Seshia 6 Operator Precedence 1. *** 2. ( often left out; ⋅⋅⋅ a ··· b ab ) 3. ∪∪∪ S. A. Seshia 7 Example of Precedence R1*R 2 ∪∪∪ R3 = ( ())R1* R2 ∪∪∪ R3 S. A. Seshia 8 What’s the regexp? { w | w has exactly a single 1 } 0*10* S. A. Seshia 9 What language does ∅∅∅* represent? {ε} S. A. Seshia 10 What’s the regexp? { w | w has length ≥ 3 and its 3rd symbol is 0 } ΣΣΣ2 0 ΣΣΣ* Σ = (0 ∪∪∪ 1) S. A. Seshia 11 Some Identities Let R, S, T be regular expressions • R ∪∪∪∅∅∅ = ? • R ···∅∅∅ = ? • Prove: R ( S ∪∪∪ T ) = R S ∪∪∪ R T (what’s the proof idea?) S.
    [Show full text]
  • Context-Free Grammar for the Syntax of Regular Expression Over the ASCII
    Context-free Grammar for the syntax of regular expression over the ASCII character set assumption : • A regular expression is to be interpreted a Haskell string, then is used to match against a Haskell string. Therefore, each regexp is enclosed inside a pair of double quotes, just like any Haskell string. For clarity, a regexp is highlighted and a “Haskell input string” is quoted for the examples in this document. • Since ASCII character strings will be encoded as in Haskell, therefore special control ASCII characters such as NUL and DEL are handled by Haskell. context-free grammar : BNF notation is used to describe the syntax of regular expressions defined in this document, with the following basic rules: • <nonterminal> ::= choice1 | choice2 | ... • Double quotes are used when necessary to reflect the literal meaning of the content itself. <regexp> ::= <union> | <concat> <union> ::= <regexp> "|" <concat> <concat> ::= <term><concat> | <term> <term> ::= <star> | <element> <star> ::= <element>* <element> ::= <group> | <char> | <emptySet> | <emptyStr> <group> ::= (<regexp>) <char> ::= <alphanum> | <symbol> | <white> <alphanum> ::= A | B | C | ... | Z | a | b | c | ... | z | 0 | 1 | 2 | ... | 9 <symbol> ::= ! | " | # | $ | % | & | ' | + | , | - | . | / | : | ; | < | = | > | ? | @ | [ | ] | ^ | _ | ` | { | } | ~ | <sp> | \<metachar> <sp> ::= " " <metachar> ::= \ | "|" | ( | ) | * | <white> <white> ::= <tab> | <vtab> | <nline> <tab> ::= \t <vtab> ::= \v <nline> ::= \n <emptySet> ::= Ø <emptyStr> ::= "" Explanations : 1. Definition of <metachar> in our definition of regexp: Symbol meaning \ Used to escape a metacharacter, \* means the star char itself | Specifies alternatives, y|n|m means y OR n OR m (...) Used for grouping, giving the group priority * Used to indicate zero or more of a regexp, a* matches the empty string, “a”, “aa”, “aaa” and so on Whi tespace char meaning \n A new line character \t A horizontal tab character \v A vertical tab character 2.
    [Show full text]
  • PHP Regular Expressions
    PHP Regular Expressions What is Regular Expression Regular Expressions, commonly known as "regex" or "RegExp", are a specially formatted text strings used to find patterns in text. Regular expressions are one of the most powerful tools available today for effective and efficient text processing and manipulations. For example, it can be used to verify whether the format of data i.e. name, email, phone number, etc. entered by the user was correct or not, find or replace matching string within text content, and so on. PHP (version 5.3 and above) supports Perl style regular expressions via its preg_ family of functions. Why Perl style regular expressions? Because Perl (Practical Extraction and Report Language) was the first mainstream programming language that provided integrated support for regular expressions and it is well known for its strong support of regular expressions and its extraordinary text processing and manipulation capabilities. Let's begin with a brief overview of the commonly used PHP's built-in pattern- matching functions before delving deep into the world of regular expressions. Function What it Does preg_match() Perform a regular expression match. preg_match_all() Perform a global regular expression match. preg_replace() Perform a regular expression search and replace. preg_grep() Returns the elements of the input array that matched the pattern. preg_split() Splits up a string into substrings using a regular expression. preg_quote() Quote regular expression characters found within a string. Note: The PHP preg_match() function stops searching after it finds the first match, whereas the preg_match_all() function continues searching until the end of the string and find all possible matches instead of stopping at the first match.
    [Show full text]
  • Notetab User Manual
    NoteTab User Manual Copyright © 1995-2016, FOOKES Holding Ltd, Switzerland NoteTab® Tame Your Text with NoteTab by FOOKES Holding Ltd A leading-edge text and HTML editor. Handle a stack of huge files with ease, format text, use a spell-checker, and perform system-wide searches and multi-line global replacements. Build document templates, convert text to HTML on the fly, and take charge of your code with a bunch of handy HTML tools. Use a power-packed scripting language to create anything from a text macro to a mini-application. Winner of top industry awards since 1998. “NoteTab” and “Fookes” are registered trademarks of Fookes Holding Ltd. All other trademarks and service marks, both marked and not marked, are the property of their respective ow ners. NoteTab® Copyright © 1995-2016, FOOKES Holding Ltd, Switzerland All rights reserved. No parts of this work may be reproduced in any form or by any means - graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems - without the written permission of the publisher. “NoteTab” and “Fookes” are registered trademarks of Fookes Holding Ltd. All other trademarks and service marks, both marked and not marked, are the property of their respective owners. While every precaution has been taken in the preparation of this document, the publisher and the author assume no responsibility for errors or omissions, or for damages resulting from the use of information contained in this document or from the use of programs and source code that may accompany it. In no event shall the publisher and the author be liable for any loss of profit or any other commercial damage caused or alleged to have been caused directly or indirectly by this document.
    [Show full text]
  • Sample Chapter 3
    108_GILLAM.ch03.fm Page 61 Monday, August 19, 2002 1:58 PM 3 Architecture: Not Just a Pile of Code Charts f you’re used to working with ASCII or other similar encodings designed I for European languages, you’ll find Unicode noticeably different from those other standards. You’ll also find that when you’re dealing with Unicode text, various assumptions you may have made in the past about how you deal with text don’t hold. If you’ve worked with encodings for other languages, at least some characteristics of Unicode will be familiar to you, but even then, some pieces of Unicode will be unfamiliar. Unicode is more than just a big pile of code charts. To be sure, it includes a big pile of code charts, but Unicode goes much further. It doesn’t just take a bunch of character forms and assign numbers to them; it adds a wealth of infor- mation on what those characters mean and how they are used. Unlike virtually all other character encoding standards, Unicode isn’t de- signed for the encoding of a single language or a family of closely related lan- guages. Rather, Unicode is designed for the encoding of all written languages. The current version doesn’t give you a way to encode all written languages (and in fact, this concept is such a slippery thing to define that it probably never will), but it does provide a way to encode an extremely wide variety of lan- guages. The languages vary tremendously in how they are written, so Unicode must be flexible enough to accommodate all of them.
    [Show full text]