Introducing Regular Expressions
Total Page:16
File Type:pdf, Size:1020Kb
Introducing Regular Expressions wnload from Wow! eBook <www.wowebook.com> o D Michael Fitzgerald Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo Introducing Regular Expressions by Michael Fitzgerald Copyright © 2012 Michael Fitzgerald. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Editor: Simon St. Laurent Indexer: Lucie Haskins Production Editor: Holly Bauer Cover Designer: Karen Montgomery Proofreader: Julie Van Keuren Interior Designer: David Futato Illustrator: Rebecca Demarest July 2012: First Edition. Revision History for the First Edition: 2012-07-10 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449392680 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Introducing Regular Expressions, the image of a fruit bat, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. ISBN: 978-1-449-39268-0 [LSI] 1341860829 Table of Contents Preface ..................................................................... vii 1. What Is a Regular Expression? ............................................. 1 Getting Started with Regexpal 2 Matching a North American Phone Number 2 Matching Digits with a Character Class 4 Using a Character Shorthand 5 Matching Any Character 5 Capturing Groups and Back References 6 Using Quantifiers 6 Quoting Literals 8 A Sample of Applications 9 What You Learned in Chapter 1 11 Technical Notes 11 2. Simple Pattern Matching ................................................ 13 Matching String Literals 15 Matching Digits 15 Matching Non-Digits 17 Matching Word and Non-Word Characters 18 Matching Whitespace 20 Matching Any Character, Once Again 22 Marking Up the Text 24 Using sed to Mark Up Text 24 Using Perl to Mark Up Text 25 What You Learned in Chapter 2 27 Technical Notes 27 3. Boundaries ........................................................... 29 The Beginning and End of a Line 29 Word and Non-word Boundaries 31 iii Other Anchors 33 Quoting a Group of Characters as Literals 34 Adding Tags 34 Adding Tags with sed 36 Adding Tags with Perl 37 What You Learned in Chapter 3 38 Technical Notes 38 4. Alternation, Groups, and Backreferences ................................... 41 Alternation 41 Subpatterns 45 Capturing Groups and Backreferences 46 Named Groups 48 Non-Capturing Groups 49 Atomic Groups 50 What You Learned in Chapter 4 50 Technical Notes 51 5. Character Classes ...................................................... 53 Negated Character Classes 55 Union and Difference 56 POSIX Character Classes 56 What You Learned in Chapter 5 59 Technical Notes 60 6. Matching Unicode and Other Characters ................................... 61 Matching a Unicode Character 62 Using vim 63 Matching Characters with Octal Numbers 64 Matching Unicode Character Properties 65 Matching Control Characters 68 What You Learned in Chapter 6 70 Technical Notes 71 7. Quantifiers ............................................................ 73 Greedy, Lazy, and Possessive 74 Matching with *, +, and ? 74 Matching a Specific Number of Times 75 Lazy Quantifiers 76 Possessive Quantifiers 77 What You Learned in Chapter 7 78 Technical Notes 79 iv | Table of Contents 8. Lookarounds .......................................................... 81 Positive Lookaheads 81 Negative Lookaheads 84 Positive Lookbehinds 85 Negative Lookbehinds 85 What You Learned in Chapter 8 86 Technical Notes 86 9. Marking Up a Document with HTML ....................................... 87 Matching Tags 87 Transforming Plain Text with sed 88 Substitution with sed 89 Handling Roman Numerals with sed 90 Handling a Specific Paragraph with sed 91 Handling the Lines of the Poem with sed 91 Appending Tags 92 Using a Command File with sed 92 Transforming Plain Text with Perl 94 Handling Roman Numerals with Perl 95 Handling a Specific Paragraph with Perl 96 Handling the Lines of the Poem with Perl 96 Using a File of Commands with Perl 97 What You Learned in Chapter 9 98 Technical Notes 98 10. The End of the Beginning .............................................. 101 Learning More 102 Notable Tools, Implementations, and Libraries 103 Perl 103 PCRE 103 Ruby (Oniguruma) 104 Python 104 RE2 105 Matching a North American Phone Number 105 Matching an Email Address 105 What You Learned in Chapter 10 106 Appendix: Regular Expression Reference ....................................... 107 Regular Expression Glossary .................................................. 123 Index ..................................................................... 129 Table of Contents | v Preface This book shows you how to write regular expressions through examples. Its goal is to make learning regular expressions as easy as possible. In fact, this book demonstrates nearly every concept it presents by way of example so you can easily imitate and try them yourself. Regular expressions help you find patterns in text strings. More precisely, they are specially encoded text strings that match patterns in sets of strings, most often strings that are found in documents or files. Regular expressions began to emerge when mathematician Stephen Kleene wrote his book Introduction to Metamathematics (New York, Van Nostrand), first published in 1952, though the concepts had been around since the early 1940s. They became more widely available to computer scientists with the advent of the Unix operating system— the work of Brian Kernighan, Dennis Ritchie, Ken Thompson, and others at AT&T Bell Labs—and its utilities, such as sed and grep, in the early 1970s. The earliest appearance that I can find of regular expressions in a computer application is in the QED editor. QED, short for Quick Editor, was written for the Berkeley Time- sharing System, which ran on the Scientific Data Systems SDS 940. Documented in 1970, it was a rewrite by Ken Thompson of a previous editor on MIT’s Compatible Time-Sharing System and yielded one of the earliest if not first practical implementa- tions of regular expressions in computing. (Table A-1 in Appendix documents the regex features of QED.) I’ll use a variety of tools to demonstrate the examples. You will, I hope, find most of them usable and useful; others won’t be usable because they are not readily available on your Windows system. You can skip the ones that aren’t practical for you or that aren’t appealing. But I recommend that anyone who is serious about a career in com- puting learn about regular expressions in a Unix-based environment. I have worked in that environment for 25 years and still learn new things every day. “Those who don’t understand Unix are condemned to reinvent it, poorly.” —Henry Spencer vii Some of the tools I’ll show you are available online via a web browser, which will be the easiest for most readers to use. Others you’ll use from a command or a shell prompt, and a few you’ll run on the desktop. The tools, if you don’t have them, will be easy to download. The majority are free or won’t cost you much money. This book also goes light on jargon. I’ll share with you what the correct terms are when necessary, but in small doses. I use this approach because over the years, I’ve found that jargon can often create barriers. In other words, I’ll try not to overwhelm you with the dry language that describes regular expressions. That is because the basic philoso- phy of this book is this: Doing useful things can come before knowing everything about a given subject. There are lots of different implementations of regular expressions. You will find regular expressions used in Unix command-line tools like vi (vim), grep, and sed, among others. You will find regular expressions in programming languages like Perl (of course), Java, JavaScript, C# or Ruby, and many more, and you will find them in declarative lan- guages like XSLT 2.0. You will also find them in applications like Notepad++, Oxygen, or TextMate, among many others. Most of these implementations have similarities and differences. I won’t cover all those differences in this book, but I will touch on a good number of them. If I attempted to document all the differences between all implementations, I’d have to be hospitalized. I won’t get bogged down in these kinds of details in this book. You’re expecting an introductory text, as advertised, and that is what you’ll get. Who Should Read This Book The audience for this book is people who haven't ever written a regular expression before. If you are new to regular expressions or programming, this book is a good place to start. In other words, I am writing for the reader who has heard of regular expressions and is interested in them but who doesn’t really understand them yet. If that is you, then this book is a good fit. The order I’ll go in to cover the features of regex is from the simple to the complex. In other words, we’ll go step by simple step. Now, if you happen to already know something about regular expressions and how to use them, or if you are an experienced programmer, this book may not be where you want to start. This is a beginner’s book, for rank beginners who need some hand- holding. If you have written some regular expressions before, and feel familiar with them, you can start here if you want, but I’m planning to take it slower than you will probably like.