Dictionaries
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Programming with Python Lecture 7: Dictionaries 1 Lists - reminder A list is an ordered sequence of elements. Create a list in Python: >>> my_list = [2,3,5,7,11] >>> my_list [2,3,5,7,11] >>> my_list[0] 2 >>> my_list[-1] 11 2 List Methods Function Description lst.append(item) append an item to the end of the list lst.count(val) return the number of occurrences of value lst.extend(another_lst) extend list by appending items from another list lst.index(val) return first index of value lst.insert(ind, item) insert an item before item at index ind lst.pop(), lst.pop(ind) remove and return the last item or item at index ind lst.remove(value) remove first occurrence of a value lst.reverse() reverse the list lst.sort() sort the list These are queries that do not change the list 3 Tuples A tuple is similar to a list, but it is immutable. Syntax: note the parentheses! >>> t = ("don't", "worry", "be", "happy") # definition >>> t ("don't", 'worry', 'be', 'happy') >>> t[0] # indexing "don't" >>> t[-1] # backwords indexing 'happy' >>> t[1:3] # slicing ('worry', 'be') 4 Tuples >>> t[0] = 'do' # try to change Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> t[0]='do' TypeError: 'tuple' object does not support item assignment No append / extend / remove in Tuples! 5 Tuples • Fixed size • Immutable (similarly to Strings) • What are they good for (compared to list)? • Simpler (“light weight”) • Staff multiple things into a single container • Immutable (e.g., records in database, safe code) 6 Dictionaries (Hash Tables) keys values • Key – Value mapping • No order • Fast! • Usage examples: • Database • Dictionary • Phone book 7 Dictionaries Access to the data in the dictionary: • Given a key, it is easy to get the value. • Given a value, you need to go over all the dictionary to get the key. Intuition - Yellow Pages: • Given a name, it’s easy to find the right phone number • Given a phone number it’s difficult to match the name 8 Dictionaries Dictionary: a set of key-value pairs. >>> dict_name = {key1:val1, key2:val2,…} Keys are unique and immutable. 9 Dictionaries Example: “144” - Map names to phone numbers: >>> phonebook = {'Eric Cartman': '2020', 'Stan March': '5711', 'Kyle Broflovski': '2781'} >>> phonebook {'Kyle Broflovski': '2781', 'Eric Cartman': ‘2020', 'Stan March': '5711'} Note:The pairs order changed! 10 Dictionaries Access dictionary Items: >>> phonebook['Eric Cartman'] '2020' Add a new person: >>> phonebook['Kenny McCormick'] = '1632' >>> phonebook {'Kyle Broflovski': '2781', 'Eric Cartman': '2020', 'Kenny McCormick': '1632', 'Stan March': '5711'} 11 Dictionaries What happens when we add a key that already exists? >>> phonebook['Kenny McCormick']= '2222' >>> phonebook {'Kyle Broflovski': '2781', 'Eric Cartman': '2020', 'Kenny McCormick': '2222', 'Stan March': '5711'} Kenny’s phone was previously 1632 and now changed to 2222 How can we add another Kenny McCormick in the phone book? 12 Dictionaries Idea: Add address to the key new key >>> phonebook= {['Kenny McCormick', 'Southpark']: '2222'} Traceback (most recent call last): File "<pyshell#15>", line 1, in <module> phonebook= {['Kenny McCormick', 'Southpark']: '2222'} TypeError: unhashable type: 'list' What’s the problem? 13 Dictionaries Fix: use tuples as keys! >>> phonebook= {('Kenny McCormick', 'Southpark'): '2222'} >>> phonebook {('Kenny McCormick', 'Southpark'): '2222'} 14 Dictionary Methods Function Description D.get(k, [d]) return D[k] for k in D, otherwise d (default: None). k in D True if D has a key k, False otherwise. D.has_key(k) in old Python D.items() list of (key, value) pairs, as 2-tuples D.keys() list of D's keys D.values() list of D's values D.pop(k, [d]) remove the specified key k and return its value. If k is not found, return d. • An argument shown in [] has a default value. what is: v in D.values() ? 15 Example: Frequency Counter . Assume you want to learn about the frequencies of English letters usage in text. You find a long and representative text and start counting. Which data structure will you use to keep your findings? s u p e r c a l i f r a g i l i s t i c e x p i a l i d o c i o u s 16 Frequency Counter text = 'supercalifragilisticexpialidocious' # count letters – build letters histogram char_count = {} for char in text: count = char_count.get(char, 0) count += 1 char_count[char] = count # a shorter version: char_count = {} for char in text: char_count[char] = char_count.get(char, 0) + 1 >>> char_count {'f': 1, 'e': 2, 'r': 2, 'g': 1, 'i': 7, 's': 3, 'l': 3, 'd': 1, 'o': 2, 'x': 1, 'c': 3, 'a': 3, 't': 1, 'u': 2, 'p': 2} 17 Frequency Counter # text = 'supercalifragilisticexpialidocious' # sort alphabetically chars = char_count.keys() sorted(iterable): returns a sorted list sorted_chars = sorted(chars) of the objects in iterable (e.g., lists, strings) # print for char in sorted_chars: print char , ':', char_count[char] The output is: a : 3 c : 3 d :1 e : 2 f : 1 … 18 Application: Data Analysis • We are living in the digital information era • Google • Facebook etc. • To understand the information we “see” we need to • inspect • clean • transform • model • This process is crucial for decision making 19 Data Analysis Examples • Google • Stock market trends • Genome-disease association • Face recognition • Business intelligence • Speech recognition • Text categorization 20 Text Categorization / Document Classification 21 How is it Done? • Manually • Automatically • Gather document statistics • Measure how similar it is to documents in each category • Today we will collect word-statistics from several well known books 22 Plan • Find data • Collect word statistics • Observe results 23 Find Data • This might be the hardest task for many applications! • Project Gutenberg (http://www.gutenberg.org/) • Alice's Adventures in Wonderland (https://raw.githubusercontent.com/GITenberg/Alice-s-Adventures-in- Wonderland_11/master/11.txt) • The Bible, King James version, Book 1: Genesis (https://raw.githubusercontent.com/GITenberg/The- Bible-King-James-Version-Complete_30/master/30.txt) 24 Reading an online Book import urllib.request # urllib in old Python 2.xx url =urllib.request.urlopen(‘https://raw.githubusercontent.com/GITenberg/Alice- s-Adventures-in-Wonderland_11/master/11.txt’ ( alice = url.read)( URL = uniform resource locator. A web-address, which is a string that constitutes a reference to a web resource. 25 Print Most Popular Words (High Level) print_most_popular: Input: a url address url, an integer n, a string book_name Output: The function reads the text, finds the top 푛 most popular word and prints the book name and the popular words. 26 Modular Programming • Top-down approach: first write what you plan to do and then implement the details • Clear for readers • Easy to debug and test • Easy to maintain 27 PrintMostPopular Build Word-Occurrences Dictionary def print_most_popular(url,푛, book_name): '''url - text n - num of popular words to print bookName - name of book ''' text = url.read() # reads the entire file, returns its contents as a string. words = text.split() String.split() – returns a list. word_count = {} Splits the string by whitespaces for w in words: word_count[w] = word_count.get(w, 0)+1 counts = word_count.values() counts = sorted(counts) threshold = counts[-n] print( '*****', book_name, '*****‘) for word in word_count.keys(): if word_count[word] >= threshold: print (word,word_count[word]) 28 Results 29 How is it Really Done? • Preprocessing (e.g., words to lower case, remove punctuation signs) • Word count • Enhance statistics • Discard stop words (e.g., and, of, a) • Stemming (e.g., go & went) (מילים נרדפות) Synonyms • • bigrams, trigrams • Similarity measures to existing documents / categories 30 More about dictionaries… Hash Functions • The type for a dictionary keys must be • Immutable • Hashable 31 Hashing • Hash function h: Mapping from U to the slots of a hash table T[0..m–1]. h : U {0,1,…, m–1} • With arrays, key k maps to slot A[k]. • With hash tables, key k maps or “hashes” to slot T[h(k)]. • H(k) is the hash value of key k. Hashing 0 U (universe of keys) h(k1) h(k4) k1 K k4 (actual k2 collision h(k2)=h(k5) keys) k5 k3 h(k3) m–1 Issues with Hashing • Multiple keys can hash to the same slot – collisions are possible. • Design hash functions such that collisions are minimized. • But avoiding collisions is impossible. • Design collision-resolution techniques. • If keys are well dispersed in table then all operations can be made to have very fast running time Method of Resolution • Chaining: 0 • Store all elements that hash to the same k1 k4 slot in a linked list. k5 k2 k6 • Store a pointer to the head of the linked k7 k3 k8 list in the hash table slot. m–1 Collision Resolution by Chaining 0 U (universe of keys) h(k )=h(k ) X 1 4 k1 k K 4 (actual k2 X k k6 h(k )=h(k )=h(k ) keys) 5 2 5 6 k7 k8 k 3 X h(k3)=h(k7) h(k8) m–1 Comp 122, Fall 2003 Collision Resolution by Chaining 0 U (universe of keys) k1 k4 k1 k K 4 (actual k2 k k6 keys) 5 k5 k2 k6 k7 k8 k3 k7 k3 k8 m–1 More about dictionaries… Sorting Dictionaries numbers = {'first': 1, 'second': 2, 'third': 3, 'Fourth': 4} Dictionaries are not ordered >>> numbers.keys() #no order is guaranteed ['second', 'Fourth', 'third', 'first'] >>> numbers.values() #no order is guaranteed [2,4,3,1] Sorting the keys of a dictionary # This is the same as calling sorted(numbers.keys()) >>> sorted(numbers) ['Fourth’,'first', 'second', 'third'] Sorting the values of a dictionary # We have to call numbers.values() here >>> sorted(numbers.values()) [1, 2, 3,