<<

Regular Expressions Tip Sheet

Functions and Call Routines Basic Syntax Advanced Syntax regex-id = prxparse(perl-regex) Behavior Character Behavior Compile Perl regular expression perl-regex and /…/ Starting and ending regex non-meta Match character return regex-id to be used by other PRX functions. | Alternation character () Grouping {}[]()^ , to match these pos = prxmatch(regex-id | perl-regex, source) $.|*+?\ characters, override (escape) with \ Search in source and return position of match or zero Wildcards/Character Class Shorthands \ Override (escape) next if no match is found. Character Behavior \n Match capture buffer n Match any one character . (?:…) Non-capturing group new-string = prxchange(regex-id | perl-regex, times, \w Match a word character (alphanumeric old-string) plus "_") Lazy Repetition Factors Search and replace times number of times in old- \W Match a non-word character (match minimum number of times possible) string and return modified string in new-string. \s Match a Character Behavior

\S Match a non-whitespace character *? Match 0 or more times call prxchange(regex-id, times, old-string, new- \d Match a digit character +? Match 1 or more times string, res-length, trunc-value, num-of-changes) Match a non-digit character ?? Match 0 or 1 time Same as prior example and place length of result in \D {n}? Match exactly n times res-length, if result is too long to fit into new-string, Character Classes Match at least n times trunc-value is to 1, and the number of changes is {n,}? Character Behavior Match at least n but not more than m placed in num-of-changes. {n,m}? Match a character in the times […] Match a character not in the brackets text = prxposn(regex-id, n, source) [^…] Match a character in the range a to z Look-Ahead and Look-Behind After a call to prxmatch or prxchange, prxposn [a-z] Character Behavior return the text of capture buffer n. Position Matching Zero-width positive look-ahead (?=…) Character Behavior assertion. E.g. regex1 regex2 , call prxposn(regex-id, n, pos, len) (?= ) Match beginning of line a match is found if both regex1 and After a call to prxmatch or prxchange, call prxposn ^ Match end of line regex2 match. regex2 is not sets pos and len to the position and length of capture $ included in the final match. buffer n. \b Match word boundary \B Match non-word boundary (?!…) Zero-width negative look-ahead call prxnext(regex-id, start, stop, source, pos, len) assertion. E.g. regex1(?!regex2), Search in source between positions start and stop. Repetition Factors a match is found if regex1 matches (greedy, match as many times as possible) Set pos and len to the position and length of the and regex2 does not match. regex2 Character Behavior match. Also set start to pos+len+1 so another search is not included in the final match. Match 0 or more times can easily begin where this one left off. * (?<=…) Zero-width positive look-behind + Match 1 or more times assertion. E.g. (?<=regex1)regex2, call prxdebug(on-off) ? Match 1 or 0 times a match is found if both regex1 and Pass 1 to enable debug output to the SAS Log. {n} Match exactly n times regex2 match. regex1 is not Pass 0 to disable debug output to the SAS Log. {n,} Match at least n times included in the final match. {n,m} Match at least n but not more than m (?

Basic Example Search and Replace #1 Search and Extract data _null_; data _null_; data _null_; pos=prxmatch('/world/', input; length first last phone $ 16; 'Hello world!'); _infile_ = retain re; put pos=; ('s/

x + y < 15 if area_code ^in ("828" "336" Data Validation x < 10 < y "704" "910" y < 11 "919" "252") then data phone_numbers; putlog "NOTE: Not in NC: " length first last phone $ 16; first last phone; Search and Replace #2 input first last phone & $16.; end; datalines; data reversed_names; datalines; Thomas Archer (919)319-1677 Thomas Archer (919)319-1677 Lucy Barr 800-899-2164 input name & $32.; datalines; Lucy Barr (800)899-2164 Tom Joad (508) 852-2146 Tom Joad (508) 852-2146 Laurie Gil (252)152-7583 Jones, Fred Kavich, ; Laurie Gil (252)352-7583 Turley, Ron ; data invalid; Dulix, Yolanda set phone_numbers; ; Output:

where not NOTE: Not in NC, Lucy Barr (800)899-2164 data names; prxmatch("/\([2-9]\d\d\) ?" || NOTE: Not in NC, Tom Joad (508) 852-2146 "[2-9]\d\d-\d\d\d\d/",phone); set reversed_names; name = prxchange('s/(\w+), (\w+)/$2 $1/', run; -1, name); proc sql; /* Same as prior data step */ run; create table invalid as proc sql; /* Same as prior data step */ select * from phone_numbers where not create table names as select prxmatch("/\([2-9]\d\d\) ?" || "[2-9]\d\d-\d\d\d\d/",phone); prxchange('s/(\w+), (\w+)/$2 $1/', -1, name) quit; as name from reversed_names; Output: quit; Obs first last phone

1 Lucy Barr 800-899-2164 Output:

2 Laurie Gil (252)152-7583 Obs name 1 Fred Jones 2 Kate Kavich For complete information refer to the 3 Ron Turley Base SAS documentation at 4 Yolanda Dulix support.sas.com/base