<<

Introduction to Programming with Python

Lecture 7: Dictionaries

1 Lists - reminder

A list is an ordered sequence of elements. Create a list in Python: >>> my_list = [2,3,5,7,11] >>> my_list [2,3,5,7,11] >>> my_list[0] 2 >>> my_list[-1] 11

2 List Methods

Function Description lst.append(item) append an item to the end of lst.count(val) return the number of occurrences of value lst.extend(another_lst) extend list by appending items from another list lst.index(val) return first index of value lst.insert(ind, item) insert an item before item at index ind lst.pop(), lst.pop(ind) remove and return the last item or item at index ind lst.remove(value) remove first occurrence of a value lst.reverse() reverse the list lst.sort() sort the list

These are queries that do not change the list

3 Tuples

A tuple is similar to a list, but it is immutable. Syntax: note the parentheses!

>>> t = ("don't", "worry", "be", "happy") # definition >>> t ("don't", 'worry', 'be', 'happy') >>> t[0] # indexing "don't" >>> t[-1] # backwords indexing 'happy' >>> t[1:3] # slicing ('worry', 'be')

4 Tuples

>>> t[0] = 'do' # try to change Traceback (most recent call last): File "", line 1, in t[0]='do' TypeError: 'tuple' object does not support item assignment

No append / extend / remove in Tuples!

5 Tuples

• Fixed size • Immutable (similarly to Strings) • What are they good for (compared to list)? • Simpler (“light weight”) • Staff multiple things into a single container • Immutable (e.g., records in database, safe code)

6 Dictionaries (Hash Tables) keys values • Key – Value mapping • No order • Fast! • Usage examples: • Database • Dictionary • Phone book

7 Dictionaries

Access to the data in the dictionary: • Given a key, it is easy to get the value. • Given a value, you need to go over all the dictionary to get the key.

Intuition - Yellow Pages: • Given a name, it’s easy to find the right phone number • Given a phone number it’s difficult to match the name

8 Dictionaries

Dictionary: a set of key-value pairs.

>>> dict_name = {key1:val1, key2:val2,…}

Keys are unique and immutable.

9 Dictionaries

Example: “144” - Map names to phone numbers:

>>> phonebook = {'Eric Cartman': '2020', 'Stan March': '5711', '': '2781'} >>> phonebook {'Kyle Broflovski': '2781', 'Eric Cartman': ‘2020', 'Stan March': '5711'}

Note:The pairs order changed!

10 Dictionaries

Access dictionary Items: >>> phonebook['Eric Cartman'] '2020'

Add a new person: >>> phonebook['Kenny McCormick'] = '1632' >>> phonebook {'Kyle Broflovski': '2781', 'Eric Cartman': '2020', 'Kenny McCormick': '1632', 'Stan March': '5711'}

11 Dictionaries

What happens when we add a key that already exists? >>> phonebook['Kenny McCormick']= '2222' >>> phonebook {'Kyle Broflovski': '2781', 'Eric Cartman': '2020', 'Kenny McCormick': '2222', 'Stan March': '5711'}

Kenny’s phone was previously 1632 and now changed to 2222

How can we add another Kenny McCormick in the phone book?

12 Dictionaries

Idea: Add address to the key new key

>>> phonebook= {['Kenny McCormick', 'Southpark']: '2222'} Traceback (most recent call last): File "", line 1, in phonebook= {['Kenny McCormick', 'Southpark']: '2222'} TypeError: unhashable type: 'list'

What’s the problem?

13 Dictionaries

Fix: use tuples as keys!

>>> phonebook= {('Kenny McCormick', 'Southpark'): '2222'} >>> phonebook {('Kenny McCormick', 'Southpark'): '2222'}

14 Dictionary Methods

Function Description D.get(k, [d]) return D[k] for k in D, otherwise d (default: None). k in D True if D has a key k, False otherwise. D.has_key(k) in old Python D.items() list of (key, value) pairs, as 2-tuples D.keys() list of D's keys D.values() list of D's values D.pop(k, [d]) remove the specified key k and return its value. If k is not found, return d. • An argument shown in [] has a default value. what is: v in D.values() ?

15 Example: Frequency Counter

. Assume you want to learn about the frequencies of English letters usage in text. . You find a long and representative text and start counting. . Which data structure will you use to keep your findings?

s u p e r c a l i f r a g i l i s t i c e x p i a l i d o c i o u s

16 Frequency Counter text = 'supercalifragilisticexpialidocious' # count letters – build letters histogram char_count = {} for char in text: count = char_count.get(char, 0) count += 1 char_count[char] = count # a shorter version: char_count = {} for char in text: char_count[char] = char_count.get(char, 0) + 1 >>> char_count {'f': 1, 'e': 2, 'r': 2, 'g': 1, 'i': 7, 's': 3, 'l': 3, 'd': 1, 'o': 2, 'x': 1, 'c': 3, 'a': 3, 't': 1, 'u': 2, 'p': 2} 17 Frequency Counter

# text = 'supercalifragilisticexpialidocious' # sort alphabetically chars = char_count.keys() sorted(iterable): returns a sorted list sorted_chars = sorted(chars) of the objects in iterable (e.g., lists, strings) # print for char in sorted_chars: print char , ':', char_count[char]

The output is: a : 3 c : 3 d :1 e : 2 f : 1 … 18 Application: Data Analysis

• We are living in the digital information era • Google • etc. • To understand the information we “see” we need to • inspect • clean • transform • model • This process is crucial for decision making

19 Data Analysis Examples

• Google • Stock market trends • Genome-disease association • Face recognition • Business intelligence • Speech recognition • Text categorization

20 Text Categorization / Document Classification

21 How is it Done? • Manually  • Automatically • Gather document statistics • Measure how similar it is to documents in each category • Today we will collect word-statistics from several well known books

22 Plan • Find data • Collect word statistics • Observe results

23 Find Data • This might be the hardest task for many applications! • Project Gutenberg (http://www.gutenberg.org/) • Alice's Adventures in Wonderland (https://raw.githubusercontent.com/GITenberg/Alice-s-Adventures-in- Wonderland_11/master/11.txt) • The Bible, King James version, Book 1: Genesis (https://raw.githubusercontent.com/GITenberg/The- Bible-King-James-Version-Complete_30/master/30.txt)

24 Reading an online Book import urllib.request # urllib in old Python 2.xx url =urllib.request.urlopen(‘https://raw.githubusercontent.com/GITenberg/Alice- s-Adventures-in-Wonderland_11/master/11.txt’ ( alice = url.read)(

URL = uniform resource locator. A web-address, which is a string that constitutes a reference to a web resource.

25 Print Most Popular Words (High Level) print_most_popular: Input: a url address url, an integer n, a string book_name Output: The function reads the text, finds the top 푛 most popular word and prints the book name and the popular words.

26 Modular Programming

• Top-down approach: first write what you plan to do and then implement the details • Clear for readers • Easy to debug and test • Easy to maintain

27 PrintMostPopular Build Word-Occurrences Dictionary def print_most_popular(url,푛, book_name): '''url - text n - num of popular words to print bookName - name of book ''' text = url.read() # reads the entire file, returns its contents as a string. words = text.split() String.split() – returns a list. word_count = {} Splits the string by whitespaces for w in words: word_count[w] = word_count.get(w, 0)+1

counts = word_count.values() counts = sorted(counts) threshold = counts[-n]

print( '*****', book_name, '*****‘) for word in word_count.keys(): if word_count[word] >= threshold: print (word,word_count[word]) 28 Results

29 How is it Really Done?

• Preprocessing (e.g., words to lower case, remove punctuation signs) • Word count • Enhance statistics • Discard stop words (e.g., and, of, a) • Stemming (e.g., go & went) (מילים נרדפות) Synonyms • • bigrams, trigrams • Similarity measures to existing documents / categories

30 More about dictionaries… Hash Functions • The type for a dictionary keys must be • Immutable • Hashable

31 Hashing • Hash function h: Mapping from U to the slots of a hash table T[0..m–1]. h : U  {0,1,…, m–1} • With arrays, key k maps to slot A[k]. • With hash tables, key k maps or “hashes” to slot T[h(k)]. • H(k) is the hash value of key k. Hashing

0 U (universe of keys) h(k1)

h(k4)

k1 K k4

(actual k2 collision h(k2)=h(k5) keys) k5 k3

h(k3)

m–1 Issues with Hashing • Multiple keys can hash to the same slot – collisions are possible. • Design hash functions such that collisions are minimized. • But avoiding collisions is impossible. • Design collision-resolution techniques. • If keys are well dispersed in table then all operations can be made to have very fast running time Method of Resolution

• Chaining:

0

• Store all elements that hash to the same k1 k4 slot in a linked list. k5 k2 k6

• Store a pointer to the head of the linked k7 k3 k8

list in the hash table slot. m–1 Collision Resolution by Chaining

0 U (universe of keys) h(k )=h(k ) X 1 4

k1 k K 4

(actual k2 X k k6 h(k )=h(k )=h(k ) keys) 5 2 5 6 k7 k8 k 3 X h(k3)=h(k7)

h(k8) m–1

Comp 122, Fall 2003 Collision Resolution by Chaining

0 U (universe of keys) k1 k4

k1 k K 4

(actual k2 k k6 keys) 5 k5 k2 k6 k7 k8 k3 k7 k3

k8 m–1 More about dictionaries… Sorting Dictionaries

numbers = {'first': 1, 'second': 2, 'third': 3, 'Fourth': 4}

Dictionaries are not ordered >>> numbers.keys() #no order is guaranteed ['second', 'Fourth', 'third', 'first'] >>> numbers.values() #no order is guaranteed [2,4,3,1]

Sorting the keys of a dictionary # This is the same as calling sorted(numbers.keys()) >>> sorted(numbers) ['Fourth’,'first', 'second', 'third']

Sorting the values of a dictionary # We have to call numbers.values() here >>> sorted(numbers.values()) [1, 2, 3, 4] 38 More about dictionaries… Sorting Dictionaries

numbers = {'first': 1, 'second': 2, 'third': 3, 'Fourth': 4}

Printing keys+values sorted by keys >>> for key in sorted(numbers.keys()): print ("%s: %d" % (key, numbers[key]))

Fourth: 4 first: 1 second: 2 third: 3

39 More about dictionaries… Sorting Dictionaries

List comprehension is a natural way to generate a list: >>> S = [x**2 for x in range(10)] ]81 ,64 ,49 ,36 ,25 ,16 ,9 ,4 ,1 ,0[

Useful for dictionaries too:

numbers = {'first': 1, 'second': 2, 'third': 3, 'Fourth': 4}

>>> [value for (key, value) in sorted(numbers.items(), reverse=True)] [3, 2, 1, 4]

40 Sort Dictionary Values By Key

• We wish to write a function that given a dictionary returns a list of its values, sorted by the keys (in ascending order of the keys)

• This can be done in several ways, let’s try a recursive solution.

41 Sort a Dictionary Values By Key recursively • base case: • For an empty dictionary we return an empty list (no values) • Recursive call: • Select the maximum key and add its value to the result list as the first element. • Recursive call for the rest of the dictionary and add the results to the list of values.

42 Sort a Dictionary Values By Key

Find the max key

Find value for max key, and remove the key from the dictionary

Add max_val to the result list

Recursive call for d after max_key is removed (a smaller problem). The result list is extended with the result of the recursive call

43