<<

CS 111 Green: Program Design I Lecture 25: Phylogenetic Trees; Brian Green, ShareAlike 2.0 license Speed; Trees in CS; Speed

Robert H. Sloan (CS) & Rachel Poretsky (Bio) By Nascarking - Own work, CC BY-SA 4.0, University of Illinois, Chicago https://commons.wikimedia.org/w/index.php?curid=38671041 November 28, 2017 As organisms evolve and diverge, their DNA sequences accumulate differences, also known as mutations.

http://media.hhmi.org/biointeractive/click/Phylogenetic_Trees/10.html Creating phylogenetic trees from DNA sequences Creating phylogenetic trees from DNA sequences

http://media.hhmi.org/biointeractive/click/Phylogenetic_Trees/22.html Molecular phylogeny predicts biology

SPEED 1: A CONNECTION TO DICTIONARIES Dictionary, oh how do I love thee?

n Lists in Python are wonderful, and amazing n The one thing that might be better is dictionaries n All-purpose for key-value linkages

q If you're Google and want to remember every page that has the phrase "Tree of Life" on it?

n Dictionary with string keys, and list of strings (URLS) values

q Doing any sort of translation, ranging from English to Spanish to mRNA to Amino acids?

q Etc.? n What about speed? Big speed differences

n Much of both code we've written and computer applications we use take not time at all n Open a 30 page document in Word (very roughly 90,000 characters) and search for a word in it n Color changes in Photoshop or iPhoto happen as you change the slider n But, if you wrote things in an unfortunate way, your DNA manipulation could have slowed down at 5,000 characters, and population simulation could be quite slow Where does the speed go?

n Is it that Word (and Photoshop, etc.) are so fast? n Or that Python is so slow? n Or that our data in biology is so big? n It’s some of each, not a simple problem with an obvious answer. n We’ll consider today: How fast can "ordinary" computers get

q If time later this week or next: What’s not computable, no matter how fast you go First: How fast are things, including Prof. Sloan's Computer

n Recall from start of class: Computer consists of, as key elements, a CPU and RAM (working memory) n Relevant to recent Piazza discussion of buying computers n Prof. Sloan's computer: medium-high end MacBook Air

q when it was purchased 4.5 years ago in mid-2013 n How would we determine speed of its CPU and amount of RAM? n And how fast it can execute some Python stuff? Let's create some lists of sizes

n 1 million n To create list of consecutive n 10 million integers up to 1000:

A. n 100 million range(1000)

B. list(range(1000))

C. Other

D. No clue Which came first, the dictionary or the list

n Which is faster, the dictionary or the list?

q Not spending much time (so far) in CS 111 on issues of how fast our programs run

q Find one item in list (or that it's missing). E.g., Python in for list

n Have to go through the entire list, can take lots of time if you have a really big list

q Find one item in dictionary:

n Constant time, independent of size of dictionary!

n Under the hood is the magic of hashing, discussed in CS 251 From playing around

n Even in (somewhat slow) Python on Prof. Sloan's (somewhat slow) laptop, 1 million simple operations go faster than human perception n Issues arise with:

q Really big numbers, like n = 100 million

q Medium-large numbers n where we need n**2 operations

n 10,000**2 is 100 million

q Small numbers n where we need 2**n operations

n 2**30 is around a billion Speed irrelevant for Toy Tree of Life

n Here is our toy tree of life, representing 15 species, in a dictionary, key = species, value = its parent: tax_dict = { 'Pan troglodytes' : 'Hominoidea', 'Pongo abelii' : 'Hominoidea', 'Hominoidea' : 'Simiiformes', 'Simiiformes' : 'Haplorrhini', 'Tarsius ' : 'Tarsiiformes', 'Haplorrhini' : '','Tarsiiformes' : 'Haplorrhini', ' tardigradus' : 'Lorisidae','Lorisidae' : '', 'Strepsirrhini' : 'Primates','Allocebus trichotis' : '', 'Lemuriformes' : 'Strepsirrhini',' alleni' : 'Lorisiformes', 'Lorisiformes' : 'Strepsirrhini','Galago moholi' : 'Lorisiformes'} Speed highly relevant for real Tree of Life

n Open Tree of Life has 2,637,204 descendant tips n And is too big to download through standard web services n But if you're a biologist working on evolution, you get it, and you really care about speed in accessing a node

q (Specialists who work on it may in fact be using a format just for things very much like it, called "Newick String") Is it faster?

n Is creating a 2.6 million entry dictionary dramatically faster than creating a 2.6 million item list?

A. Yes, that's the point of dictionaries

B. No, the creation part isn't where the speedup is

C. No clue TREES AND THE COMPUTER SCIENTIST Ways computer scientists organize, think about data

n Strings: Every CS 1 I've taught 1990–2017, many languages n Files: Every CS 1 I've taught 1990–2017, many languages n lists: Every CS 1 in Python n dataframes: Really useful for data science n dictionaries: Almost every CS 1 in Python n graphs (networks): Very useful abstract model, CS 251 n trees: Very useful abstract model, CS 251, first look here n matrices Trees to a computer scientist

n Nodes, connected by edges, each node has unique parent except one node that is the root (Here 2)

n Leaves: Nodes with no children (2, 4, 5, 11)

n Computer scientists always draw trees with root at top of page (unlike everybody else including biologists)

n Useful model of lots of things Another example: File/directory tree on computer

n / is the root of the tree

q (Linux or OS X)

q In Windows C:\

n Each branch or subtree is either a directory/folder or file (leaf) SPEED MORE GENERALLY What a computer really understands

n Computers really do not understand Python, nor Java, nor any other language. n The basic computer only understands one kind of language: machine language.

q Machine language consists of instructions to the computer expressed in terms of values in bytes.

q These instructions tell the computer to do very low-level activities. Machine language trips the right switches

n The computer doesn’t really understand machine language.

n The computer is just a machine, with lots of switches that make data flow this way or that way.

n Machine language is just a bunch of switch settings that cause the computer to do a bunch of other switch settings.

n We interpret those switchings to be addition, subtraction, loading, and storing.

q In the end, it’s all about encoding. A byte of switches Assembler and machine language

n Machine language looks just like a bunch of numbers. n Assembler language (Coming in CS 261) is a set of words that corresponds to the machine language.

q It’s a one-to-one relationship.

q A word of assembler equals one machine language instruction, typically.

n (Often, just a single byte.) Each kind of processor has its own machine language

n Today most laptops use some generation of Intel Core i5 or i7 CPU n There are many other processors

Each processor understands only its own machine language