Introduction to Scientific Computing for Biologists – 2013/2014 –

Introduction to Scientific Computing for Biologists – 2013/2014 – Stefano Allesina Contents 1 Before we begin ................................................9 1.1 About this course9 1.2 What you need 10 1.3 What you should do 10 1.4 Typing it all 11 1.5 Choosing an editor 11 1.6 Homework, grades and final project 11 2 UNIX .......................................................... 13 2.1 Why UNIX? 13 2.2 Directory structure 14 2.3 The shell 14 2.4 Basic UNIX commands 15 2.5 Command arguments 16 2.6 Redirection and pipes 16 2.7 Wildcards 17 2.8 Selecting lines using grep 17 2.9 Finding files with find 19 2.10 Creating complex flows with xargs 21 2.11 One-liners 22 2.12 Installing software 23 2.13 Exercises 23 2.13.1 What does this do?............................................... 23 2.13.2 FASTA.......................................................... 23 2.13.3 Darwin......................................................... 25 2.13.4 Temperatures.................................................... 25 2.14 References and readings 26 3 Version control ................................................ 27 3.1 What is version control? 27 3.2 git 28 3.3 Installing git 28 3.4 The three states 28 3.5 First-time setup 29 3.6 Your first repository 29 3.7 Files to ignore 31 3.8 Removing a file 31 3.9 Accessing the history of the repository 31 3.10 Reverting to a previous version 32 3.11 Branching 33 3.12 Remote repositories 34 3.13 Trying it out 34 3.14 Summary of basic commands 35 3.14.1 Creating repositories.............................................. 35 3.14.2 Changes....................................................... 35 3.14.3 History......................................................... 35 3.14.4 Branches....................................................... 35 3.14.5 Remote repositories............................................... 36 3.14.6 Discarding changes............................................... 36 3.15 References and readings 36 4 Basic programming ............................................ 39 4.1 Programming in science 39 4.2 Installing python 40 4.3 Basic python 41 4.4 Writing python code 47 4.5 Control statements 49 4.6 Loops in python 50 4.6.1 Exercises........................................................ 51 4.7 Executing and importing your code 53 4.8 Basic unit testing 53 4.9 Variable scope 55 4.10 Copying mutable objects 56 4.11 String manipulation 57 4.12 Input and output 58 4.13 Exercises 60 4.13.1 Similarity of DNA sequences........................................ 60 4.13.2 Using docstrings.................................................. 62 4.13.3 Counting self-incompatibility........................................ 62 4.13.4 C. elegans mortality.............................................. 62 4.14 References and readings 62 5 Scientific computing ........................................... 65 5.1 Using IPython 65 5.1.1 Debugging with pdb.............................................. 66 5.1.2 Profiling in IPython................................................ 68 5.1.3 Other magic commands........................................... 70 5.2 Regular expressions 71 5.2.1 Regular expressions in python....................................... 71 5.2.2 Groups in regular expressions....................................... 73 5.3 Exercises 75 5.3.1 Translate........................................................ 75 5.3.2 Years.......................................................... 75 5.3.3 Blackbirds....................................................... 76 5.3.4 Nature papers................................................... 76 5.3.5 Average temperatures............................................ 76 5.4 The scientific library SciPy 77 5.4.1 Linear algebra................................................... 77 5.4.2 Numerical integration............................................. 83 5.4.3 Random number distributions....................................... 85 5.5 Exercises 86 5.5.1 Integration...................................................... 86 5.5.2 Political blogs.................................................... 86 5.6 References and readings 87 6 Scientific typesetting ........................................... 89 6.1 Why LATEX? 89 6.2 Installing LATEX 90 6.3 A basic example 90 6.4 A brief tour of LATEX 92 6.4.1 Spaces, new lines and special characters............................. 92 6.4.2 Document structure............................................... 92 6.4.3 Typesetting math................................................. 94 6.4.4 Comments...................................................... 96 6.4.5 Long documents................................................. 97 6.4.6 Justification and alignement........................................ 97 6.4.7 Sections and special environments................................... 97 6.4.8 Typesetting tables................................................ 98 6.4.9 Typesetting matrices............................................. 100 6.4.10 Figures........................................................ 101 6.4.11 Itemized and numbered lists....................................... 102 6.4.12 Typefaces...................................................... 103 6.4.13 Bibliography.................................................... 104 6.5 Exercises 104 6.5.1 Equations...................................................... 104 6.5.2 CV........................................................... 105 6.5.3 Reformatting a paper............................................ 105 6.6 References and readings 105 6.6.1 Great online resources............................................ 105 6.6.2 Tutorials & Essays................................................ 105 7 Statistical computing ......................................... 107 7.1 What is R? 107 7.2 Installing R 107 7.3 Useful commands 107 7.4 Basic operations 108 7.5 Operators 109 7.6 Data types 109 7.7 Importing and Exporting Data 113 7.8 Useful Functions 113 7.9 Control Functions 114 7.10 Type Conversion and Special Values 115 7.11 Subsetting Data 115 7.12 Basic Statistics in R 117 7.13 Basic Graphs in R 118 7.14 Writing Your Functions 118 7.15 Packages 119 7.16 Vectorize it! 120 7.17 Launching R in a Terminal 121 7.18 Analyzing Nepotism in Italian Academia 121 7.18.1 Exploring the Data............................................... 122 7.18.2 Core Functions.................................................. 123 7.18.3 Formatting and Multiple Hypothesis Testing............................ 126 7.19 Homework 128 7.20 References 128 8 Drawing figures ............................................... 129 8.1 Publication-quality figures in R 129 8.2 Basic graphs with qplot 129 8.3 Various geom 133 8.4 Exercise 134 8.5 Advanced plotting: ggplot 134 8.6 Case study 1: plotting a matrix 134 8.7 Case study 2: plotting two dataframes 137 8.8 Case study 3: annotating the plot 139 8.9 Case study 4: mathematical display 141 8.10 Readings 143 9 Databases ................................................... 145 9.1 What is a database? 145 9.2 Types of databases 145 9.3 Designing a database 146 9.4 SQL 147 9.5 Installing SQLite 147 9.6 Your first database 147 9.6.1 SELECT........................................................ 149 9.7 Creating tables 156 9.7.1 Data types..................................................... 156 9.7.2 Importing csv data.............................................. 156 9.8 Joining tables 159 9.9 Creating views 160 9.10 Exercises 161 9.11 Use a GUI 162 9.12 Inserting and deleting records and tables 162 9.13 Accessing databases programmatically 163 9.14 Readings 165 10 Scripting and HPC ............................................ 167 10.1 What is scripting? 167 10.2 Two types of scripts 167 10.3 Tabs to commas, commas to tabs 168 10.4 Variables 169 10.5 Computing on beast 171 10.5.1 Interactive use.................................................. 172 10.5.2 Batch use...................................................... 172 10.5.3 Writing a script that does it........................................ 173 10.5.4 Useful commands............................................... 174 11 Final thoughts ................................................ 177 About this course What you need What you should do Typing it all Choosing an editor Homework, grades and final project 1 — Before we begin “Science is what we understand well enough to explain to a computer. Art is everything else we do." — Donald Knuth 1.1 About this course This course is something I have been thinking about for a long time. Before I started my Ph.D., I worked for a year as a computer programmer for a small telephone company. This experience changed dramatically the way I approached scientific research. Learning how to program efficiently, how to organize code and data, how to automate analysis and how to collaborate with others has greatly improved my abilities, and has led me to attempt projects that would have been impossible to complete without these skills. Throughout my career, I taught my fellow students and postdocs programming tricks, encouraged them to further their computational skills, and constantly pursued new and better ways of carrying out my research. When I started my laboratory at the University of Chicago, this interest has become a necessity. Managing many projects involving many collaborators is very difficult, if one does not organize the data, code, figures and papers in a logical and effective way. Computing is a challenge for scientists, especially for those not trained in the so-called “hard sciences”. By definition,

Load more