Regular Expressions CS 2110 What Is a Regular Expression?

Regular Expressions CS 2110 What is a regular expression? A special string for describing a pattern of characters. Examples: Regular expression Description [abc] One of three characters (a, b, OR c) [a-z] A single lowercase letter [a-z0-9] A single lowercase letter OR number (not both) . Any one character \. A period (“.”) * 0 to many ? 0 or 1 + 1 or many REGEX String Mark regular expressions as raw strings Starts with r” Use square brackets for “any character from inside the bracket” r“[bce]” – matches “b”, or “c”, or “e” (But not “be” or “bc”) Use ranges or classes of characters r“[A-Z]” – matches any uppercase letter r“[a-z]” – matches any lowercase letter r“[0-9]” – matches any digit Searching for hyphens: include – right after the [ or right before ] r”[-a-z]” – matches any hyphen OR any lowercase letter Regex String r“[bce]at” Matches “bat”, “cat”, “eat” r“.at” Matches 3 letter words that end in “at” r“at\.” Matches “at.” Regex in Python Import statement import re Compiling the regex regex = re.compile(regular_expression_extring) regex is now a regular expression tool we can use Using regex results = regex.search(text) results = regex.findall(text) results = regex.finditer(text) Regular Expression Examples Use “^” at the start of a [] for negation: r“[^a-z]” – match anything except lowercase letters r“[0-9]” – match anything except decimal digits Use ^ at the start of the expression (not inside []) to mean “the start of the string” (i.e., searching from the beginning of the string only) i.e., if searching through a list of strings, only match strings that start with the expression Use $ for the end of the string. Pre defined characters: Character Meaning \d Any digit – means the same as [0-9] \D Anything EXCEPT digits – means the same as [^0-9] \s Any whitespace character “ “, “\t” “\n”, etc. – [ \t\n] \S Any NON-whitespace character \\ Match a literal backslash \w Matches ANY alphanumeric character and underscore [a-zA-Z0-9_] \W Matches any non-alphanumeric character [^a-zA-Z0-9_] Regular Expression Examples r“[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]” Phone number written as “123-456-7890” Except, that’s a little redundant, right? We can write the same patter above as r“[0-9]{3}-[0-9]{3}-[0-9]{4}” {x} means repeat look for the previous pattern to repeat x times “[abn]{6}” would match “banana”, for example (or “nnnaaa”) “[abn]{3,6}” would match “ban”, “nan”, “abba”, “banana”, etc. Regex Examples Most English first names: r”[A-Z][a-z]+” Dates: [0-9]{2}[/-][0-9]{2}[/-][0-9]{4} OR [0-9]{4}[/-][0-9]{2}[/-][0-9]{2} SSN [0-9]{3}-[0-9]{2}-[0-9]{4} Regex findall Find all returns a list of all the strings that match the regex. Example, let’s consider this pattern for emails: r"[a-z0-9]+@[a-z]+\.[a-z]+“ Using that, let’s find all the emails at: https://engineering.virginia.edu/departments/computer- science/faculty Pulling down emails of CS Faculty import re import urllib.request url = "https://engineering.virginia.edu/departments/computer- science/faculty" phone_number_pattern = r"[a-z0-9]+@[a-z]+\.[a-z]+" req = urllib.request.urlopen(url) html = req.read().decode("UTF-8") regex = re.compile(phone_number_pattern) emails = regex.findall(html) print(emails) But wait… We get the result: [email protected] That doesn’t seem right… shouldn’t emails end in com, edu, or org? Let’s try this pattern: r"([a-z0-9]+@[a-z\.]+\.(com|edu|org)) That gives us tuples like: ('[email protected]', 'edu’) Wait…why tuples? Next time Groups Using them Getting individual groups The match object More practice.

Regular Expressions CS 2110 What Is a Regular Expression?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support