Modern Perl for Biologists II | Deeper Concepts M-BIM / INBTBI
Total Page:16
File Type:pdf, Size:1020Kb
Modern Perl for Biologists II | Deeper Concepts M-BIM / INBTBI Denis BAURAIN / ULiège Edition 2020–2021 Denis BAURAIN / ULiège ii Modern Perl for Biologists II | Deeper Concepts Acknowledgment This course has been developed over eight years (2013–2020) primarily as teaching materials for the Modern Perl module of the (INBTBI) Bioinformatics Training organized by the Biotechnology Training Centre in the GIGA Tower of the University of Liège (Belgium). Until 2016, the Biotechnology Training Centre was directed by Laurent Corbesier, who is warmly thanked for his support. The training itself was funded through the Bioinformatics Training Program, 6th call 2010–2014, BioWin (the Health Cluster of Wallonia), Le Forem (the Walloon Office for Employment and Training), and the Biotechnology Training Centre (Forem-GIGA). This two-part document benefitted from the feed-back of about 150 students, trainees and colleagues, among which Arnaud Di Franco, Agnieszka Misztak, Loïc Meunier and Sinaeda Anderssen are espe- cially acknowledged. If you spot typos, grammatical or technical errors, or have any suggestion on how to improve the next edition of this course, let me know ([email protected]). I will answer all messages. Finally, I thank Pierre (Pit) Tocquin for his help with the mighty triad Markdown / Pandoc / LATEX. How to read this course? This document is written in American English. The first time they are introduced, programming terms and biological terms are typeset in bold and italic, respectively, whereas computer keywords are always typeset in a monospaced font (Consolas). All these terms are indexed at most once per section in which they appear (see Index). While the main text is typeset in Palatino, user statements (either at the shell or as small code excerpts) and computer answers are typeset in the same monospaced font as computer keywords. $ echo "Hello world!" # user statement (note the shell prompt) Hello world! # computer answer Example BOX Advanced material that can be skipped on first reading is enclosed in boxes with a light blue background. A List of Boxes is available for convenience. General programming advices and good practices that apply beyond the Perl language are typeset in a smaller size and introduced by a yellow bar surmounted by an icon. 1 say <<'EOT'; 2 Complete program listings are typeset 3 in the same monospaced font as keywords. 4 Their lines are numbered. 5 EOT Denis BAURAIN / ULiège iii Modern Perl for Biologists II | Deeper Concepts Denis BAURAIN / ULiège iv Modern Perl for Biologists II | Deeper Concepts Contents Acknowledgment ......................................... iii How to read this course? ..................................... iii I Lesson 6 1 1 Thinking in list context 3 1.1 Old-school plots in the terminal .............................. 3 1.2 Perl values: Lists ...................................... 7 1.3 Storing lists in arrays .................................... 12 1.4 Storing lists in hashes .................................... 16 1.5 Working with lists ...................................... 17 2 Formatting output 21 2.1 check_overlap.pl output sample ............................. 21 2.2 How to round numbers? .................................. 21 2.2.1 floor and ceil ................................... 22 2.2.2 sprintf ....................................... 22 2.3 The repetition operator ................................... 23 2.4 More on Smart::Comments ................................. 25 3 Functions 27 3.1 What are functions? ..................................... 27 3.2 Defining and using functions ................................ 27 3.3 Function parameters and return values .......................... 29 3.3.1 The default array @_ ................................ 29 3.3.2 return ........................................ 29 3.3.3 Argument aliasing ................................. 29 3.3.4 Lexical and file variables .............................. 31 3.3.5 Argument slurping ................................. 31 3.3.6 Argument currying ................................. 31 3.3.7 Bare return statements ............................... 33 3.3.8 Argument flattening ................................ 33 3.4 Bonus—Computing anagrams using recursive function calls . 35 Homework 39 Denis BAURAIN / ULiège v Modern Perl for Biologists II | Deeper Concepts CONTENTS II Lesson 7 41 4 Sorted codon usage 43 4.1 A gentle introduction to references ............................. 43 4.1.1 Motivation ..................................... 43 4.1.2 Defining references ................................. 45 4.1.3 Using references .................................. 46 4.1.4 Anonymous arrays ................................. 46 4.1.5 Peeking into nested data structures ........................ 48 4.2 Sorting tables ........................................ 49 4.2.1 sort ......................................... 49 4.2.2 Sort blocks and sort comparison operators .................... 49 4.2.3 Sorting contexts ................................... 50 4.2.4 The dereferencing arrow .............................. 52 4.2.5 Short-circuiting and multiple sorting criteria ................... 53 5 In silico restriction mapping 55 5.1 Your very personal cutter .................................. 55 5.1.1 How to build an enzyme database? ........................ 55 5.1.2 How to fetch sequences from the NCBI website? . 56 5.1.3 How to try the program? .............................. 56 5.2 The code for cutter.pl .................................. 57 Homework 67 III Lesson 8 69 6 The innards of cutter.pl 71 6.1 Plain Old Documentation .................................. 71 6.2 Getopt::Euclid ...................................... 74 6.3 Overview of cutter.pl .................................. 75 6.3.1 Command-line interface .............................. 75 6.3.2 Architecture ..................................... 77 6.4 More on references ..................................... 80 6.4.1 Tabular file parsing ................................. 82 6.4.1.1 read_enzymes .............................. 82 6.4.2 Defining and using nested data structures .................... 84 6.4.2.1 compute_cuts .............................. 84 6.4.2.2 collect_sites .............................. 87 6.4.2.3 infer_fragments ............................ 88 7 Parsing BLAST reports 93 7.1 BLAST tabular format .................................... 93 7.2 How to write a Perl module? ................................ 94 7.2.1 The code for blast_table.t ............................ 95 7.2.2 The code for parser.pl .............................. 96 7.2.3 The code for BlastTable.pm ............................ 97 Homework 101 Denis BAURAIN / ULiège vi Modern Perl for Biologists II | Deeper Concepts CONTENTS IV Lesson 9 103 8 More on Perl modules 105 8.1 Structure of a Perl distribution ............................... 105 8.2 Installing Perl modules ................................... 106 8.2.1 Fine-tuning the installation process . 107 8.2.2 Specifying module dependencies . 107 8.3 Anatomy of a Perl module ................................. 110 8.3.1 packages and namespaces ............................. 110 8.3.2 Exporting symbols ................................. 112 8.4 Automated tests ....................................... 113 9 Our ultimate killer app 115 9.1 Annotating sequences with reference sequences ..................... 115 9.2 Packaging read_fasta and friends ............................ 115 9.2.1 The code for fasta_file.t ............................ 117 9.2.2 The code for FastaFile.pm ............................ 118 9.3 Building our application .................................. 123 9.3.1 The code for annotate.pl ............................. 123 Homework 129 V Lesson 10 131 10 Idiomatic Perl 133 10.1 More on closures ...................................... 133 10.1.1 Function references ................................. 133 10.1.2 subs with a memory ................................ 134 10.2 The default variables .................................... 137 10.2.1 The topic variable as the default iterator . 138 10.3 Implicit loops ........................................ 140 10.3.1 grep ......................................... 140 10.3.2 map .......................................... 142 10.4 Bonus—Higher-Order Perl ................................. 146 11 Dissecting our annotation app 151 11.1 More on automated tests .................................. 151 11.2 Interesting bits in Forem::FastaFile . 152 11.3 A mighty modular script: annotate.pl . 154 11.3.1 Conciseness, readability and maintainability by code reuse . 154 11.3.2 Building programs with programs . 156 11.3.3 Calling external programs ............................. 158 11.3.4 Wrapping it up ................................... 159 Homework 161 Index 163 Denis BAURAIN / ULiège vii Modern Perl for Biologists II | Deeper Concepts CONTENTS Denis BAURAIN / ULiège viii Modern Perl for Biologists II | Deeper Concepts List of Boxes 1 Operator precedence in list definitions ............................. 10 2 Array slices and the amount context ............................... 14 3 Accessing individual values in lists ............................... 15 4 Reference count and the garbage collector ........................... 47 5 Sorting in natural order ...................................... 51 6 Short-circuiting and assignments ................................ 54 7 Named arguments ........................................ 76 8 Closures and BEGIN code blocks ................................ 79 9 Scalar references