Programming Fundamentals and Python

Programming Fundamentals and Python Steven Bird Ewan Klein Edward Loper 2006-01-15 Version: 0.6 Revision: 1.21 Copyright: c 2001-2006 University of Pennsylvania License: Creative Commons Attribution-NonCommercial-ShareAlike License Note This is a draft. Please send any feedback to the authors. 1 Introduction This tutorial provides a non-technical overview of Python. It contains many working program fragments which you should try yourself. For a more detailed overview, we recommend that you consult one of the introductions listed in the further reading section below. More advanced programming material is contained in a later tutorial. 2 Processing lists and strings In writing Python programs for natural language processing we make heavy use of lists. A list is simply an ordered sequence of items. Each item could be a string, a number, or some more complex object such as a list. A Python list is represented as comma-separated items, enclosed inside brackets, e.g. ['John', 14, 'Sep', 1984] is a list consisting of four elements. We initialize a list by giving a variable name, followed by the equals sign, followed by this square bracket notation, e.g.: >>> a = ['colourless', 'green', 'ideas'] This command sets the value of variable a. To see the value of this variable, we can give the command print a. In interactive mode, we can just type the variable name: >>> a ['colourless', 'green', 'ideas'] To nd out the length of a list, we can use the len() function, which takes the list variable as its argument: >>> len(a) 3 We can access the items of a list individually by using indexes. Note that list items are counted starting from zero. 1 >>> a[0] 'colourless' >>> a[1] 'green' We can also index from the end of the list, using negative numbers. Thus, index -1 returns the last item: >>> a[-1] 'ideas' Sometimes we mistakenly try to access an index position which does not exist. If so, Python reports an error: >>> a[5] IndexError: list index out of range It is often useful to access multiple items in a list. In Python this is done using a special construct called the slice. In general, if a is a list, then a[m:n] is the slice starting at item m, and going up to (but not including) item n. >>> a[1:3] ['green', 'ideas'] If we leave out the rst number, it is assumed to be zero, the start of the list. So a[:n] nds everything up to (but not including) item n. Similarly, a[m:] nds everything from item m to the end: >>> a[1:] ['green', 'ideas'] >>> a[:2] ['colourless', 'green'] Lists can be concatenated, sorted and reversed, as we se below: >>> b = a + ['sleep', 'furiously'] >>> print b ['colourless', 'green', 'ideas', 'sleep', 'furiously'] >>> b.sort() >>> print b ['colourless', 'furiously', 'green', 'ideas', 'sleep'] >>> b.reverse() >>> print b ['sleep', 'ideas', 'green', 'furiously', 'colourless'] Note The individual elements of the above list are strings, and when we try to concatenate them using the + operator, Python performs string concatenation. E.g. b[2] + b[1] concatenates two strings to give 'greenideas'. We can use the for item in list syntax to iterate over the items of a list. In the following fragment we iterate over each word in the above b list, and print the rst character of each word: >>> for w in b: ... print w[0] ... s Natural Language Toolkit 2 nltk.sourceforge.net i g f c Observe that we used the bracket notation to access characters in a string, just as we earlier used the notation to access items in a list. We can combine these two kinds of access, e.g. to nd the third word (at index 2), and print the second character (at index 1): >>> b[2] 'green' >>> b[2][1] 'r' So far we have accessed items in a list by giving their index. We can also access them in terms of their content. The index() function returns the rst index where the specied item was found: >>> b.index('green') 2 We have already seen the string concatenation operator, +. We can multiply strings: >>> 'sleep' * 3 'sleepsleepsleep' Here is a useful idiom for reversing strings: >>> b[::-1] 'sselruoloc ylsuoifur neerg saedi peels' We can also join and split strings: >>> c = ' '.join(b) >>> c 'sleep ideas green furiously colourless' >>> c.split('r') ['sleep ideas g', 'een fu', 'iously colou', 'less'] More information on lists and strings can be found by typing help(list) and help(str) at the command prompt. To nd out about individual functions, you can type, e.g. help(list.append). 3 Dictionaries in Python Lists and strings are accessed by indexes. This turns out to be too limiting for many applications. Instead, we would like to access items by their names. For instance, we access a printed dictionary by looking up an entry by its headword, not by its number. Python provides a dictionary data type. We access it using the familiar bracket notation. Here we declare that d is a dictionary, then add three entries to it: >>> d = {} >>> d['colourless'] = 'adj' >>> d['furiously'] = 'adv' >>> d['ideas'] = 'n' The keys of the dictionary are just the headwords. However, they are not in any predened order. Natural Language Toolkit 3 nltk.sourceforge.net >>> d.keys() ['furiously', 'colourless', 'ideas'] >>> d['ideas'] 'n' We can iterate over the items in a dictionary using the for entry in dict syntax, as illustrated below: >>> for w in d: ... print "%s [%s]," % (w, d[w]), furiously [adv], colourless [adj], ideas [n], We can print an entire dictionary just by typing its name at the interactive command prompt: >>> d {'furiously': 'adv', 'colourless': 'adj', 'ideas': 'n'} If we try to access a non-existent entry, e.g. by typing d['sleep'], the Python interpreter will report an error. Two other functions are useful when we don't know if an entry exists, has_key(), and get(), as illustrated below: >>> print d.has_key('ideas') True >>> print d.get('sleep') None As a simple rule of thumb, dictionary entries are like variable names. We can create them simply by assigning to them, e.g. x = 2 (variable), d['x'] = 2 (dictionary entry). We can access them by reference, e.g. print x (variable), print d['x'] (dictionary entry). We can use dictionaries to count word occurrences. For example, the following code reads Macbeth and counts the frequency of each word: >>> from nltk_lite.corpora import gutenberg >>> count = {} # initialize a dictionary >>> for word in gutenberg.raw('shakespeare-macbeth'): # tokenize Macbeth ... word = word.lower() # normalize to lowercase ... if word not in count: # seen this word before? ... count[word] = 0 # if not, set count to zero ... count[word] += 1 Now we can inspect the dictionary: >>> print count['scotland'] 12 >>> frequencies = [(freq, word) for (word, freq) in count.items()] >>> frequencies.sort() >>> frequencies.reverse() >>> print frequencies[:20] [(1986, ','), (1245, '.'), (692, 'the'), (654, "'"), (567, 'and'), (482, ':'), (399, 'to'), (365, 'of'), (360, 'i'), (255, 'a'), (246, 'that'), (242, '?'), (224, 'd'), (218, 'you'), (213, 'in'), (207, 'my'), (198, 'is'), (170, 'not'), (165, 'it'), (156, 'with')] 4 Regular Expressions Finally, we look at Python's regular expression module re, for substituting and searching within strings. >>> import re >>> from nltk_lite.utilities import re_show >>> sent = "colourless green ideas sleep furiously" Natural Language Toolkit 4 nltk.sourceforge.net We use a utility function re_show to show how regular expressions match against substrings. First we search for all instances of a particular character or character sequence: >>> re_show('l', sent) co{l}our{l}ess green ideas s{l}eep furious{l}y >>> re_show('green', sent) colourless {green} ideas sleep furiously Now we can perform substitutions. In the rst instance we replace all instances of l with s. Note that this generates a string as output, and doesn't modify the original string. Then we replace any instances of green with red. >>> re.sub('l', 's', sent) 'cosoursess green ideas sseep furioussy' >>> re.sub('green', 'red', sent) 'colourless red ideas sleep furiously' So far we have only seen simple patterns, consisting of individual characters or sequences of characters. However, regular expressions can also contain special syntax, such as | for disjunction, e.g.: >>> re_show('(green|sleep)', sent) colourless {green} ideas {sleep} furiously >>> re.findall('(green|sleep)', sent) ['green', 'sleep'] We can also disjoin individual characters. For example, [aeiou] matches any of a, e, i, o, or u, that is, any vowel. The expression [âeiou] matches anything that is not a vowel. In the following example, we match sequences consisting of non-vowels followed by vowels. >>> re_show('[âeiou][aeiou]', sent) {co}{lo}ur{le}ss g{re}en{ i}{de}as s{le}ep {fu}{ri}ously >>> re.findall('[âeiou][aeiou]', sent) ['co', 'lo', 'le', 're', ' i', 'de', 'le', 'fu', 'ri'] We can put parentheses around parts of an expression in order to generate structured results. For example, here we see all those non-vowel characters which appear before a vowel: >>> re.findall('([âeiou])[aeiou]', sent) ['c', 'l', 'l', 'r', ' ', 'd', 'l', 'f', 'r'] We can even generate pairs (or tuples), which we may then go on and tabulate. >>> re.findall('([âeiou])([aeiou])', sent) [('c', 'o'), ('l', 'o'), ('l', 'e'), ('r', 'e'), (' ', 'i'), ('d', 'e'), ('l', 'e'), ('f', 'u'), ('r', 'i')] For an extended discussion of regular expressions, please see the regular expression tutorial. 5 Accessing Files and the Web It is easy to access local les, and web-pages in Python. Here are some examples. (You will need to create a le called corpus.txt before you can open it for reading.) >>> print open('corpus.txt').read() Hello world. This is a test file. Natural Language Toolkit 5 nltk.sourceforge.net >>> from urllib import urlopen >>> page = urlopen("http://news.bbc.co.uk/").read() >>> page = re.sub('<[^>]*>', '', page) # strip HTML markup >>> page = re.sub('\s+', ' ', page) # strip whitespace >>> print page[:60] BBC NEWS | News Front Page News Sport Weather World Service 6 Accessing NLTK NLTK consists of a set of Python modules, each of which denes classes and functions related to a single data structure or task.

Load more