Programming 2

Bioinformatics: Issues and Algorithms CSE 308-408 • Fall 2007 • Lecture 5

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 1 - Administrative notes

• Homework #1 is due on Tuesday, Sept. 11 at 5:00 pm. Submit your work using Blackboard Assignment function.

• Homework #2 will be available on Blackboard on Thursday, Sept. 13 at 9:00 am.

CSE Department Ice Cream Social (yum!) Location: Packard Lab 360 Date: Tues., Sept. 11, 4:10 pm – 5:00 pm

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 2 - Arrays

As we know, in bioinformatics, much of the data we care about consists of collections of genetic sequences. Simple scalar variables won't suffice ...

A perl list data structure #! /usr/bin/perl -w

# The 'arrays1' program.

@list_of_sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' );

print "$list_of_sequences[1]\n"; Perl array variables start with “@”

metis:~/CSE308/Chapter4% arrays1 Why did this print GCTCAGTTCT GCTCAGTTCT and metis:~/CSE308/Chapter4% not TTATTATGTT?

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 3 - Arrays

Arrays in Perl (and many other languages) start at index [0]:

#! /usr/bin/perl -w

# The 'arrays1' program.

@list_of_sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' );

print "$list_of_sequences[1]\n";

TTATTATGTT [0]

GCTCAGTTCT [1]

GACCTCTTAA [2]

metis:~/CSE308/Chapter4% arrays1 GCTCAGTTCT metis:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 4 - Manipulating arrays

#! /usr/bin/perl -w

# The 'arrays2' program.

@list_of_sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' );

print "$list_of_sequences[1]\n";

$list_of_sequences[1] = 'CTATGCGGTA'; $list_of_sequences[3] = 'GGTCCATGAA';

print "$list_of_sequences[1]\n";

TTATTATGTT [0] TTATTATGTT [0]

GCTCAGTTCT [1] CTATGCGGTA [1]

GACCTCTTAA [2] GACCTCTTAA [2]

GGTCCATGAA [3]

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 5 - Manipulating arrays

#! /usr/bin/perl -w

# The 'arrays2' program.

@list_of_sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' );

print "$list_of_sequences[1]\n";

$list_of_sequences[1] = 'CTATGCGGTA'; $list_of_sequences[3] = 'GGTCCATGAA';

print "$list_of_sequences[1]\n";

What does this do when it runs?

metis:~/CSE308/Chapter4% arrays2 GCTCAGTTCT CTATGCGGTA metis:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 6 - How big is an array?

#! /usr/bin/perl -w

# The 'arrays3' program.

@list_of_sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' );

print "The array size is: ", $#list_of_sequences+1, ".\n"; print "The array size is: ", scalar @list_of_sequences, ".\n";

Returns largest array index

Perl's scalar function converts array to a scalar by counting number of list elements metis:~/CSE308/Chapter4% arrays3 The array size is: 3. The array size is: 3. metis:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 7 - Adding elements to an array

#! /usr/bin/perl -w

# The 'arrays4' program.

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' );

print "The array size is: ", $#sequences+1, ".\n";

@sequences = ( @sequences, 'CTATGCGGTA' ) ;

print "The array size is: ", scalar @sequences, ".\n";

Perl combines these two lists

metis:~/CSE308/Chapter4% arrays4 The array size is: 3. The array size is: 4. metis:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 8 - But be careful

Notice the effect of this code:

#! /usr/bin/perl -w

# The 'arrays6' program.

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' );

print "The array size is: ", $#sequences+1, ".\n"; print "@sequences\n"; Overwrites the array @sequences = ( 'CTATGCGGTA' );

print "The array size is: ", scalar @sequences, ".\n"; print "@sequences\n";

metis:~/CSE308/Chapter4% arrays6 The array size is: 3. TTATTATGTT GCTCAGTTCT GACCTCTTAA The array size is: 1. CTATGCGGTA metis:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 9 - Adding elements to an array

An obvious extension:

metis:~/CSE308/Chapter4% more arrays8 #! /usr/bin/perl -w

# The 'arrays8' program.

@sequence_1 = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA' ); @sequence_2 = ( 'GCTCAGTTCT', 'GACCTCTTAA' ); @combined_sequences = ( @sequence_1, @sequence_2 );

print "@combined_sequences\n"; metis:~/CSE308/Chapter4%

metis:~/CSE308/Chapter4% arrays8 TTATTATGTT GCTCAGTTCT GACCTCTTAA GCTCAGTTCT GACCTCTTAA metis:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 10 - Removing elements from an array: splicing

Perl provides function for “surgically removing” part of an array:

#! /usr/bin/perl -w

# The 'remove1' program.

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'TTATTATGTT' ); @removed_elements = splice @sequences, 1, 2;

print "@removed_elements\n"; print "@sequences\n"; Remove two array elements starting at index [1]

metis:~/CSE308/Chapter4% splice1 GCTCAGTTCT GACCTCTTAA Removed elements TTATTATGTT TTATTATGTT New array metis:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 11 - Removing elements from an array: splicing

splice @sequences, OFFSET, LENGTH

Start removing at Remove this this array index many elements

Notes: • Splice returns removed elements. • If no value for LENGTH provided, every element from OFFSET onward is removed. • If no value for OFFSET provided, every element is removed. • In latter case, more efficient to write @sequences = ();

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 12 - Accessing elements in an array: slicing

To access array elements without removing them, use slice:

#! /usr/bin/perl -w

# The 'slices' program - slicing arrays.

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' ); print "@sequences\n"; Slice to access @seq_slice = @sequences[ 1 .. 3 ]; print "@seq_slice\n"; elements 1-3 print "@sequences\n"; @removed = splice @sequences, 1, 3; print "@sequences\n"; Splice to remove print "@removed\n"; elements 1-3 europa:~/CSE308/Chapter4% slices TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA Slice TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC TTATTATGTT ATCTGACCTC Splice GCTCAGTTCT GACCTCTTAA CTATGCGGTA europa:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 13 - Accessing elements in an array: slicing

Perl range operator Access 2nd through @dnas[ 1 .. 9 ] 10th elements in array

@dnas [ 1, 4, 9 ] Access 2nd, 5th, and 10th elements in array

Notes: • To access list of elements from array, use a slice. • To remove list of elements from array, use splice. • Both return the elements in question.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 14 - Pushing, popping, shifting, and unshifting

Often, manipulation of arrays involves single elements, so Perl provides special functions to make this easier: shift Removes and returns first element from array pop Removes and returns last element from array unshift Adds element (or list) onto start of array push Adds element (or list) onto end of array

Start of array

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' );

End of array

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 15 - Pushing, popping, shifting, and unshifting

#! /usr/bin/perl -w

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' ); print "@sequences\n"; #1 Removes last element $last = pop @sequences; print "@sequences\n"; $first = shift @sequences; #2 Removes first element print "@sequences\n"; unshift @sequences, $last; print "@sequences\n"; #3 Places element at start push @sequences, ( $first, $last ); print "@sequences\n"; #4 Places elements at end europa:~/CSE308/Chapter4% pushpop TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC #1 TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA GCTCAGTTCT GACCTCTTAA CTATGCGGTA #2 ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA #3 ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA TTATTATGTT ATCTGACCTC #4 europa:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 16 - Pushing, popping, shifting, and unshifting

TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC pop last element (ATCTGACCTC) “pop” TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA $last

TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA “shift” shift element (TTATTATGTT) $first GCTCAGTTCT GACCTCTTAA CTATGCGGTA

$last GCTCAGTTCT GACCTCTTAA CTATGCGGTA “unshift” unshift one new element (ATCTGACCTC) ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA

ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA $first, $last push on two new elements (TTATTATGTT ATCTGACCTC) “push” ATCTGACCTC GCTCAGTTCT GACCTCTTAA CTATGCGGTA TTATTATGTT ATCTGACCTC

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 17 - Iterating over all elements of an array

Perl makes it easy to iterate over all the elements of an array: #! /usr/bin/perl -w # The 'iterateW' program - iterate over an entire array.

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' );

$index = 0; $last_index = $#sequences;

while ( $index <= $last_index ) { print "$sequences[ $index ]\n"; ++$index; } phoebe:~/CSE308/Chapter4% iterateW TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC phoebe:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 18 - Iterating over all elements of an array, take 2

Perl also provides an even easier way to do this:

#! /usr/bin/perl -w

# The 'iterateF' program - iterate over an entire array # with 'foreach'.

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' ); foreach $value ( @sequences ) Step through all elements { print "$value\n"; } Note: changing scalar $value also changes array! phoebe:~/CSE308/Chapter4% iterateF TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC phoebe:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 19 - Easier list representations

Lists in Perl are comma-separated collections of scalars. They can be represented in a number of ways, however:

@sequences = ( 'TTATTATGTT', 'GCTCAGTTCT', 'GACCTCTTAA', 'CTATGCGGTA', 'ATCTGACCTC' );

@sequences = ( TTATTATGTT, GCTCAGTTCT, GACCTCTTAA, CTATGCGGTA, ATCTGACCTC ); You don't need quotes if there aren't any spaces

@sequences = qw( TTATTATGTT GCTCAGTTCT GACCTCTTAA CTATGCGGTA ATCTGACCTC ); Can eliminate commas by using “qw” (“quote words”)

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 20 - Hashes

In addition to arrays, Perl provides hashes , another powerful data structure that will come in handy on many occasions.

[0] TTATTATGTT seqA TTATTATGTT

[1] GCTCAGTTCT seqZ GCTCAGTTCT

[2] GACCTCTTAA seqC GACCTCTTAA

Perl array: Perl hash: indexing is implicit element accessed by specifying value Use “%” to indicate hash (“associative array”)

%sequence_hash = ( seqA, TTATTATGTT, seqZ, GCTCAGTTCT, seqC, GACCTCTTAA)

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 21 - Working with hashes

#! /usr/bin/perl -w To access a # The 'hash1' program. hash element, %nucleotide_bases = ( A, Adenine, T, Thymine ); refer to its name print "The expanded name for 'A' is $nucleotide_bases{ 'A' }\n";

phoebe:~/CSE308/Chapter4% hash1 The expanded name for 'A' is Adenine phoebe:~/CSE308/Chapter4%

#! /usr/bin/perl -w To determine names # The 'hash2' program. for a hash, use %nucleotide_bases = ( A, Adenine, T, Thymine ); @hash_names = keys %nucleotide_bases; keys function

print "The names in the %nucleotide_bases hash are: @hash_names\n";

phoebe:~/CSE308/Chapter4% hash2 The names in the %nucleotide_bases hash are: A T phoebe:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 22 - Working with hashes

#! /usr/bin/perl -w To determine size # The 'hash3' program. of a hash, use a %nucleotide_bases = ( A, Adenine, T, Thymine ); $hash_size = keys %nucleotide_bases; scalar context

print "The size of the %nucleotide_bases hash is: $hash_size\n";

phoebe:~/CSE308/Chapter4% hash3 The size of the %nucleotide_bases hash is: 2 phoebe:~/CSE308/Chapter4%

To add entries to an existing hash, do this:

%nucleotide_bases = ( A, Adenine, T, Thymine ); ... $nucleotide_bases{ 'G' } = 'Guanine'; $nucleotide_bases{ '' } = 'Cytosine';

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 23 - Working with hashes

#! /usr/bin/perl -w

# The 'hash4' program.

%nucleotide_bases = ( A, Adenine, T, Thymine );

$nucleotide_bases{ 'G' } = 'Guanine'; $nucleotide_bases{ 'C' } = 'Cytosine'; Note: Perl does not store @hash_keys = keys %nucleotide_bases; hashes in insertion order! $hash_size = keys %nucleotide_bases;

print "The keys of the %nucleotide_bases hash are @hash_keys\n"; print "The size of the %nucleotide_bases hash is: $hash_size\n";

phoebe:~/CSE308/Chapter4% hash4 The keys of the %nucleotide_bases hash are A T C G The size of the %nucleotide_bases hash is: 4 phoebe:~/CSE308/Chapter4%

Moral: don't count on internal ordering of hash elements.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 24 - Working with hashes

As a more readable shorthand notation for this:

%nucleotide_bases = ( A, Adenine, T, Thymine, G, Guanine, C, Cytosine );

Perl lets you do this:

%nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine );

You may use “=>” wherever you' use a comma, although some places are obviously better than others ...

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 25 - Removing entries from a hash

#! /usr/bin/perl -w # The 'hash5' program.

%nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine );

@hash_keys = keys %nucleotide_bases;

print "The keys of the %nucleotide_bases hash are @hash_keys\n"; delete $nucleotide_bases{ 'G' }; Removes both @hash_keys = keys %nucleotide_bases; name and value print "The keys of the %nucleotide_bases hash are @hash_keys\n";

phoebe:~/CSE308/Chapter4% hash5 The keys of the %nucleotide_bases hash are A T C G The keys of the %nucleotide_bases hash are A T C phoebe:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 26 - Undefining variables

#! /usr/bin/perl -w # The 'hash6' program.

%nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine );

@hash_keys = keys %nucleotide_bases;

print "The keys of the %nucleotide_bases hash are @hash_keys\n"; $nucleotide_bases{ 'G' } = undef; This hash entry still exists, @hash_keys = keys %nucleotide_bases; but its value is undefined print "The keys of the %nucleotide_bases hash are @hash_keys\n"; print "The expanded name for 'G' is $nucleotide_bases{ 'G' }\n";

phoebe:~/CSE308/Chapter4% hash6 The keys of the %nucleotide_bases hash are A T C G The keys of the %nucleotide_bases hash are A T C G Use of uninitialized value in (.) or string at ./hash6 line 17. Perl complains when you The expanded name for 'G' is phoebe:~/CSE308/Chapter4% try to use undefined variable

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 27 - Slicing hashes

#! /usr/bin/perl -w Note selective use of # The 'hash7' program. single quote character %gene_counts = ( Human => 31000, 'Thale cress' => 26000, 'Nematode worm' => 18000, 'Fruit fly' => 13000, Yeast => 6000, Hash slice 'Tuberculosis microbe' => 4000 );

@counts = @gene_counts{ Human, 'Fruit fly', 'Tuberculosis microbe' };

print "@counts\n"; Good formatting makes this easy to read

phoebe:~/CSE308/Chapter4% hash7 31000 13000 4000 phoebe:~/CSE308/Chapter4% Note this is an array of values

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 28 - A complete example

#! /usr/bin/perl -w # The 'genes' program - a hash of gene counts.

use constant LINE_LENGTH => 60;

%gene_counts = ( Human => 31000, 'Thale cress' => 26000, 'Nematode worm' => 18000, 'Fruit fly' => 13000, Yeast => 6000, 'Tuberculosis microbe' => 4000 ); print '-' x LINE_LENGTH, "\n"; Perl repetition operator (x) while ( ( $genome, $count ) = each %gene_counts ) { print "'$genome' has a gene count of $count\n"; }

print '-' x LINE_LENGTH, "\n"; Returns successive name/value pairings foreach $genome ( sort keys %gene_counts ) { print "'$genome' has a gene count of $gene_counts{ $genome }\n"; }

print '-' x LINE_LENGTH, "\n"; Steps through sorted keys

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 29 - A complete example

phoebe:~/CSE308/Chapter4% genes ------'Human' has a gene count of 31000 'Tuberculosis microbe' has a gene count of 4000 'Fruit fly' has a gene count of 13000 'Nematode worm' has a gene count of 18000 'Yeast' has a gene count of 6000 'Thale cress' has a gene count of 26000 ------'Fruit fly' has a gene count of 13000 'Human' has a gene count of 31000 Note that keys 'Nematode worm' has a gene count of 18000 are sorted here 'Thale cress' has a gene count of 26000 'Tuberculosis microbe' has a gene count of 4000 'Yeast' has a gene count of 6000 ------phoebe:~/CSE308/Chapter4%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 30 - Maxims from BBP Chapter 4

More key points to ponder as you start to program in Perl: • Lists in Perl are comma-separated collections of scalars. • Perl starts counting from zero, not one. • Three main contexts in Perl: numeric, list, and scalar. • To access list of values from array, use a slice. • To remove list of values from array, use splice. • Use foreach to process every element in an array. • A hash is a collection of name / value pairings. • Hash name parts must be unique.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 31 - Intro to

Following line was repeated 3 times in our complete example:

print '-' x LINE_LENGTH, "\n"; "Print a dash character LINE_LENGTH times and then follow this by printing a newline." Seems kind of cryptic and not very general ...... wouldn't it be nice to replace it by something more like this:

drawline “-”, LINE_LENGTH;

"Draw a line of dashes Or this: LINE_LENGTH long.”

drawline( “-”, LINE_LENGTH );

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 32 - Intro to subroutines

#! /usr/bin/perl -w

# first_drawline - the first demonstration program for "drawline".

use constant REPEAT_COUNT => 60; sub drawline { Subroutine “drawline” print "-" x REPEAT_COUNT, "\n"; } specified here

print "This is the first_drawline program.\n"; drawline; print "It's purpose is to demonstrate the first version of drawline.\n"; drawline; print "Sorry, but it is not very exciting.\n"; Subroutine invoked here phoebe:~/CSE308/Chapter5% first_drawline This is the first_drawline program. ------It's purpose is to demonstrate the first version of drawline. ------Sorry, but it is not very exciting. phoebe:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 33 - Better, more flexible subroutines

Our previous example was quite limited:

sub drawline { print "-" x REPEAT_COUNT, "\n"; } • Only prints dash (-) character. • Only prints character REPEAT_COUNT times.

Subroutines can accept parameters as input:

drawline “-”, LINE_LENGTH;

“Draw a line consisting of the specified character of the specified length.”

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 34 - Better, more flexible subroutines

Subroutines can accept parameters as input:

drawline “-”, LINE_LENGTH;

First parameter (character to use) Second parameter (line length)

sub drawline { print $_[0] x $_[1], "\n"; } @_ is called “default array”

(This notation works, but it's a little awkward. We'll see something better soon.)

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 35 - A better drawline subroutine

#! /usr/bin/perl -w # second_drawline - the second demonstration program for "drawline". use constant REPEAT_COUNT => 60; First parameter sub drawline { print $_[0] x $_[1], "\n"; } Second parameter print "This is the second_drawline program.\n"; drawline "-", REPEAT_COUNT; print "Sorry, but it is still not very exciting. However, it is more useful.\n";

drawline "=", REPEAT_COUNT; drawline "-oOo-", 12; Note variety of ways drawline "- ", 30; drawline ">>==<<==", 8; drawline can be invoked

altair:~/CSE308/Chapter5% second_drawline This is the second_drawline program. ------Sorry, but it is still not very exciting. However, it is more useful. ======-oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo------>>==<<==>>==<<==>>==<<==>>==<<==>>==<<==>>==<<==>>==<<==>>==<<== altair:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 36 - Using shift() to process the default array

#! /usr/bin/perl -w # third_drawline - the third demonstration program for "drawline". use constant REPEAT_COUNT => 60; Each call to shift() returns sub drawline { print shift() x shift(), "\n"; next item in default array }

print "This is the third_drawline program.\n"; drawline "-", REPEAT_COUNT; print "Sorry, but it is still not very exciting. However, it is more useful.\n";

drawline "=", REPEAT_COUNT; drawline "-oOo-", 12; drawline "- ", 30; drawline ">>==<<==", 8;

europa:~/CSE308/Chapter5% third_drawline This is the third_drawline program. ------Sorry, but it is still not very exciting. However, it is more useful. ======-oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo------>>==<<==>>==<<==>>==<<==>>==<<==>>==<<==>>==<<==>>==<<==>>==<<== europa:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 37 - Better processing of parameters

What happens if we call a subroutine with too few parameters?

#! /usr/bin/perl -w # third_drawline - the third demonstration program for "drawline".

use constant REPEAT_COUNT => 60;

sub drawline { print shift() x shift(), "\n"; } Note missing parameters

print "This is the third_drawline program.\n"; drawline; print "Sorry, but it is still not very exciting. However, it is more useful.\n";

drawline "=", REPEAT_COUNT; drawline "-oOo-", 12; drawline "- ", 30; drawline ">>==<<==", 8;

europa:~/CSE308/Chapter5% third_drawline This is the third_drawline program. Use of uninitialized value in repeat (x) at ./third_drawline line 8.

... It would be better if there was a reasonable default behavior here

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 38 - Better processing of parameters

#! /usr/bin/perl -w # fourth_drawline - the fourth demonstration program for "drawline". use constant REPEAT_COUNT => 60; If no parameters present, sub drawline { $chars = shift || "-"; uses dash as default $count = shift || REPEAT_COUNT;

print $chars x $count, "\n"; If count not present, uses } REPEAT_COUNT as default print "This is the fourth_drawline program.\n"; drawline; print "Sorry, but it is still not very exciting. However, it is more useful.\n";

drawline "=", REPEAT_COUNT; drawline "-oOo-", 12; drawline "- ", 30; drawline ">>==<<==", 8;

europa:~/CSE308/Chapter5% fourth_drawline This is the fourth_drawline program. ------Sorry, but it is still not very exciting. However, it is more useful. ======-oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo--oOo- ...

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 39 - Better processing of parameters

“What if” scenarios ...

drawline "===", 10;

======europa:~/CSE308/Chapter5%

drawline "*"; It would be great if ************************************************************ europa:~/CSE308/Chapter5% drawline could handle its parameters in any order drawline 40;

40404040404040404040404040404040404040404040404040404040404040404040404040 4040404040404040404040404040404040404040404040 europa:~/CSE308/Chapter5% Not what we intended! drawline 20, "-";

Argument "-" isn't numeric in repeat (x) at ./fourth_drawline line 11. europa:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 40 - Even better processing of parameters

Perl allows programmer to pass parameters in any order:

drawline( Pattern => "*" ); drawline( Count => 20 ); drawline( Count => 5, Pattern => " -oOo- "); drawline( Pattern => "===", Count => 10 ); drawline; Note, however, that programmer must now also provide name

drawline( Count => 5, Pattern => " -oOo- ");

[0] Count Default array converted %arguments = @_; [1] 5 into hash [2] Pattern Count 5 [3] “ -oOo- ” Pattern “ -oOo- ”

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 41 - Even better processing of parameters

sub drawline { %arguments = @_;

$chars = $arguments{ Pattern } || "-"; $count = $arguments{ Count } || REPEAT_COUNT; print $chars x $count, "\n"; Convert default array } to hash, then access parameters via names

drawline( Pattern => "*" ); drawline( Count => 20 ); drawline( Count => 5, Pattern => " -oOo- "); drawline( Pattern => "===", Count => 10 ); drawline; ************************************************************ ------oOo- -oOo- -oOo- -oOo- -oOo- ======------europa:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 42 - A more flexible approach

The fact that drawline outputs a newline each time is limiting. Say we want to produce the following output:

+------+ | | | | | | +------+

Writing this:

print "+"; drawline( Count => 15 ); print "+";

Results in this: Not what we want! +------+europa:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 43 - A more flexible approach

Solve part of the problem by removing newline from drawline. The following code fragment works then:

print "+"; drawline( Count => 15 ); print "+\n";

+------+ europa:~/CSE308/Chapter5%

Getting a little too ambitious, however, results in this:

print "+", drawline( Count => 15 ), "+\n";

------+1+ europa:~/CSE308/Chapter5% Perl invokes drawline subroutine before producing any output Return value from drawline

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 44 - A more flexible approach

Even better: separate tasks of formatting and printing:

sub drawline { %arguments = @_;

$chars = $arguments{ Pattern } || "-"; $count = $arguments{ Count } || REPEAT_COUNT;

return $chars x $count; } Later invocations print lines to generate a box print "+", drawline, "+\n"; print "|", drawline ( Pattern => " " ), "|\n"; print "|", drawline ( Pattern => " " ), "|\n"; print "|", drawline ( Pattern => " " ), "|\n"; print "+", drawline, "+\n";

europa:~/CSE308/Chapter5% boxes +------+ | | | | | | +------+ europa:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 45 - Visibility and scope

Consider the following simple Perl program:

#! /usr/bin/perl -w # global_scope - the effect of "global" variables.

sub adjust_up { $other_count = 1; print "count at start of adjust_up: $count\n"; $count++; print "count at end of adjust_up: $count\n"; }

$count = 10; In other words, Perl print "count in main: $count\n"; variables are "global.” adjust_up; print "count in main: $count\n"; print "other_count in main: $other_count\n"; europa:~/CSE308/Chapter5% global_scope count in main: 10 By default, variables in count at start of adjust_up: 10 Perl are accessible count at end of adjust_up: 11 count in main: 11 anywhere, no matter other_count in main: 1 where they are defined. europa:~/CSE308/Chapter5%

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 46 - Private variables in Perl

There are times when global accessibility is not what you want.

#! /usr/bin/perl -w # private_scope - the effect of "my" variables.

sub adjust_up { my $other_count = 1; print "count at start of adjust_up: $count\n"; $count++; print "count at end of adjust_up: $count\n"; }

my $count = 10; To declare a varaible print "count in main: $count\n"; private, use “my” adjust_up; print "count in main: $count\n"; print "other_count in main: $other_count\n"; europa:~/CSE308/Chapter5% private_scope count in main: 10 count at start of adjust_up: adjust_up can't see count count at end of adjust_up: 1 count in main: 10 increment of count not visible other_count in main: europa:~/CSE308/Chapter5% main can't see other_count

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 47 - Maxims from BBP Chapter 5

Yet more key points to keep in mind as you learn Perl: • Whenever you think you will reuse code, create a subroutine. • When determining scope of a variable, consider its visibility. • Unless good reason not to, always declare variables with my. • If you must use a global variable, declare it with our.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 48 - Debugging advice

A wise and famous saying I once encountered:

“One bug is easy to find. Many bugs will blow your mind.”

Moral: write your programs in small pieces. Thoroughly test each piece before moving on. Do not type in dozens of lines of Perl code and then run it, expecting it to work – it won't. Tracking down and fixing a single bug is doable. A program that contains multiple bugs is usually beyond hope.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 49 - Wrap-up

Readings for next time: • BB&P Chapters 6-8 (more Perl programming).

Remember: • Come to class having done the readings. • Check Blackboard regularly for updates.

CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 5 - 50 -