<<

1 Single Characters Lightgrep Cheat Sheet 4 Grouping c the character c∗ Notes & Examples () makes any pattern S atomic \a +0007 (BEL) bell Characters: \ U+001B (ESC) escape 5 Concatenation & Alternation .*?\x00 (= null-terminated string) \f U+000C (FF) form feed ST S T \z50\z4B\z03\z04 (= ZIP signature) matches , then matches \ U+000A (NL) newline S|T matches S or T , preferring S \ U+000D (CR) carriage return \N{EURO SIGN} , \N{NO-BREAK SPACE} \t U+0009 (TAB) horizontal tab \x{042F} (= CYRILLIC CAPITAL ) 6 Repetition \ooo U+ ooo , 1–3 digits , ≤ 0377 \+12\.5% (= escaping metacharacters) S \x hh U+00 hh , 2 digits h Repeats . . . Grouping: Operators bind tightly. Use (aa)+ , S S \x{ hhhhhh } U+ hhhhhh , 1–6 hex digits h * 0 or more times (= {0,} ) not aa+ , to match pairs of a’s. S S \z hh the byte 0x hh (not the character!) † + 1 or more times (= {1,} ) Ordered alternation: a|ab matches a twice in S? 0 or 1 time (= S{0,1} ) \N{ name } the character called name aab . Left alternatives preferred to right.

Greedy S n n \N{U+ hhhhhh } same as \x{ hhhhhh } { ,} or more times S{n,m} n–m times, inclusive \c the character c‡ Repetition: Greedy operators match as much as possible. Reluctant operators match as little S*? 0 or more times (= S{0,} ) ∗ as possible. matches all of ; S S except U+0000 (NUL) and metacharacters a+a aaaa a+?a +? 1 or more times (= {1,} ) matches the first aa , then the second aa . S S †Lightgrep extension; not part of PCRE. ?? 0 or 1 time (= {0,1} ) ‡ .+ will (uselessly) match the entire input. S{n,}? n or more times except any of: adefnprstwDPSW1234567890 Reluctant Prefer reluctant operators when possible. S{n,m}? n–m times, inclusive 2 Named Character Classes Character classes: 7 Selected Properties . any character [abc] = a, b, or c Any Assigned \d [0-9] (= ASCII digits) [^a] = anything but a Alphabetic White_Space \D [^0-9] [A-Z] = A to Z Uppercase Lowercase \s [\t\n\f\r] (= ASCII whitespace) [A\-Z] ASCII Noncharacter_Code_Point \S [^\t\n\f\r] = A, Z, or hyphen (!) Name= name Default_Ignorable_Code_Point \ [0-9A-Za-z_] (= ASCII words) [A-Zaeiou] = capitals \W [^0-9A-Za-z_] or lowercase General_Category= category \p{ property } any character having property [.+*?\]] L, Letter P , \P{ property } any character lacking property = ., +, *, ?, or ] Lu , Uppercase Letter Pc , Connector Punctuation [Q\z00-\z7F] Ll , Lowercase Letter Pd , Punctuation 3 Character Classes = Q or 7-bit bytes Lt , Titlecase Letter Ps , Open Punctuation Lm , Modifier Letter Pe , Close Punctuation [stu ff] any character in stu ff [[abcd][bce]] Lo , Other Letter , Initial Punctuation [^ stu ff] any character not in stu ff = a, b, c, d, or e [[abcd]&&[bce]] M, Mark Pf , Final Punctuation where stu ff is. . . Mn , Non-Spacing Mark Po , Other Punctuation c = b or c a character Me , Enclosing Mark Z , Separator a b [[abcd]--[bce]] - a character range, inclusive N, Number Zs , Space Separator hh = a or d \z a byte Nd , Digit Number Zl , Line Separator hh hh [[abcd]~~[bce]] \z -\z a byte range, inclusive Nl , Letter Number Zp , Paragraph Separator S = a, d, or e [ ] a character class No , OtherNumber C , Other ST S T [\p{Greek}\d] ∪ (union) S, Symbol Cc , Control S T S T = Greek or digits && ∩ (intersection) Sm , MathSymbol Cf , Format S T S T [^\p{Greek}7] -- − (di fference) Sc , Cs , Surrogate S~~ T S △ T (symmetric di fference, XOR) = neither Greek nor 7 [\p{Greek}&&\p{Ll}] Sk , Modifier Symbol Co , Private Use 8 EnCase GREP Syntax = lowercase Greek So , Other Symbol Cn , Not Assigned = script c the character c (except metacharacters) Operators need not be Common Greek Cyrillic Armenian Hebrew Ara- \x hh U+00 hh , 2 hexadecimal digits h escaped inside char- bic Syraic Bengali Gu- \w hhhh U+ hhhh , 4 hexadecimal digits h acter classes. jarati Oriya Tamil Telugu Kannada Sin- \c the character c hala Thai Lao Tibetan Georgian . any character Lightgrep Search Ethiopic Cherokee Runic Khmer Mongolian # [0-9] (= ASCII digits) for EnCase !R Han Yi Old_Italic [a-b] any character in the range a–b Gothic Inherited Tagalog Hanunoo Buhid Tagbanwa [S] any character in S Fast Search for Limbu Tai_Le Linear_B Ugaritic Shavian Osmanya [^ S] any character not in S Cypriot Buginese Coptic New_Tai_Lue Glagolitic (S) grouping Forensics Syloti_Nagri Old_Persian Kharoshthi - S* repeat S 0 or more times (max 255) linese Phoenician Phags_Pa Nko Sudanese S+ repeat S 1 or more times (max 255) www.lightgrep.com Lepcha . . . See Unicode Standard for more. S? repeat S 0 or 1 or time S{n,m} repeat S n –m times (max 255) Email addresses: [a-z\d!#$%&’*+/=?^_‘{|}~-][a-z\d!#$%&’*+/=?^_‘{|}~.-]{0,63} ST matches S, then matches T @[a-z\d.-]{1,253}\.[a-z\d-]{2,22} S|T matches S or T Hostnames: ([a-z\d]([a-z\d_-]{0,61}[a-z\d])?\.){2,5}[a-z\d][a-z\d-]{1,22} N. American phone numbers: \(?\d{3}[ ).-]{0,2}\d{3}[ .-]?\d{4}\D 9 Importing from EnCase into Lightgrep Visa, MasterCard: \d{4}([ -]?\d{4}){3} American Express: 3[47]\d{2}[ -]?\d{6}[ -]?\d{5} \w hhhh −→ \x hhhh S* and S+ are limited to Diners Club: 3[08]\d{2}[ -]?\d{6}[ -]?\d{4} # −→ \d 255 repetitions by EnCase; EMF header: \z01\z00\z00\z00.{36}\z20EMF S S JPEG: \zFF\zD8\zFF[\zC4\zDB\zE0-\zEF\zFE] Footer: \zFF\zD9 * −→ {0,255} Lightgrep preserves this in GIF: GIF8[79] Footer: \z00\z3B BMP: BM.{4}\z00\z00\z00\z00.{4}\z28 S+ −→ S{1,255} imported patterns. PNG: \z89\z50\z4E\z47 Footer: \z49\z45\z4E\z44\zAE\z42\z60\z82 \w is limited to BMP characters ( ≤ U+10000) only. ZIP: \z50\z4B\z03\z04 Footer: \z50\z4B\z05\z06 RAR: \z52\z61\z72\z21\z1a\z07\z00...[\z00-\z7F] Footer: \z88\zC4\z3D\z7B\z00\z40\z07\z00 Some people, when confronted with a problem, think “ know, GZIP: \z1F\z8B\z08 MS O ffice 97–03: \zD0\zCF\z11\zE0\zA1\zB1\z1A\zE1 I’ll use regular expressions.” Now they have two problems. LNK: \z4c\z00\z00\z00\z01\z14\z02\z00 —JWZ in alt.religion.emacs, 12 August 1997 PDF: \z25\z50\z44\z46\z2D\z31 Footer: \z25\z45\z4F\z46