
Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description [abc] One of three characters (a, b, OR c) [a-z] A single lowercase letter [a-z0-9] A single lowercase letter OR number (not both) . Any one character \. A period (“.”) * 0 to many ? 0 or 1 + 1 or many REGEX String Mark regular expressions as raw strings Starts with r” Use square brackets for “any character from inside the bracket” r“[bce]” – matches “b”, or “c”, or “e” (But not “be” or “bc”) Use ranges or classes of characters r“[A-Z]” – matches any uppercase letter r“[a-z]” – matches any lowercase letter r“[0-9]” – matches any digit Searching for hyphens: include – right after the [ or right before ] r”[-a-z]” – matches any hyphen OR any lowercase letter Regex String r“[bce]at” Matches “bat”, “cat”, “eat” r“.at” Matches 3 letter words that end in “at” r“at\.” Matches “at.” Regex in Python Import statement import re Compiling the regex regex = re.compile(regular_expression_extring) regex is now a regular expression tool we can use Using regex results = regex.search(text) results = regex.findall(text) results = regex.finditer(text) Regular Expression Examples Use “^” at the start of a [] for negation: r“[^a-z]” – match anything except lowercase letters r“[0-9]” – match anything except decimal digits Use ^ at the start of the expression (not inside []) to mean “the start of the string” (i.e., searching from the beginning of the string only) i.e., if searching through a list of strings, only match strings that start with the expression Use $ for the end of the string. Pre defined characters: Character Meaning \d Any digit – means the same as [0-9] \D Anything EXCEPT digits – means the same as [^0-9] \s Any whitespace character “ “, “\t” “\n”, etc. – [ \t\n] \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_] Regular Expression Examples r“[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]” Phone number written as “123-456-7890” Except, that’s a little redundant, right? We can write the same patter above as r“[0-9]{3}-[0-9]{3}-[0-9]{4}” {x} means repeat look for the previous pattern to repeat x times “[abn]{6}” would match “banana”, for example (or “nnnaaa”) “[abn]{3,6}” would match “ban”, “nan”, “abba”, “banana”, etc. Regex Examples Most English first names: r”[A-Z][a-z]+” Dates: [0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR [0-9]{4}[/-][0-9]{2}[/-][0-9]{2} SSN [0-9]{3}-[0-9]{2}-[0-9]{4} Regex findall Find all returns a list of all the strings that match the regex. Example, let’s consider this pattern for emails: r"[a-z0-9]+@[a-z]+\.[a-z]+“ Using that, let’s find all the emails at: https://engineering.virginia.edu/departments/computer- science/faculty Pulling down emails of CS Faculty import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails) But wait… We get the result: [email protected] That doesn’t seem right… shouldn’t emails end in com, edu, or org? Let’s try this pattern: r"([a-z0-9]+@[a-z\.]+\.(com|edu|org)) That gives us tuples like: ('[email protected]', 'edu’) Wait…why tuples? Next time Groups Using them Getting individual groups The match object More practice.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-