Installing & Compiling

Spyder Source

In the following pages you will find the Perl source code for Spyder. In this document the Linux version of the source code is presented. Installation, compilation and usage will be included. A description of the modifications made to the code for compilation as a Windows executable will follow.

Installation and compilation of Spyder on a Linux/Unix or Mac platform is straightforward requiring only a few steps. Begin by creating a file “Spyder.pl” in your favourite text editor and then copy and paste the included source code to that file. Make “Spyder.pl” executable:

chmod +x Spyder.pl

Copy/move “Spyder.pl” to a folder in the path (eg. /usr/local/bin). If you do not have sufficient privileges to place this in the path when using Spyder you will need to prefix your command with “./” (eg. ./Spyder.pl ...).

Spyder is executed via the command line on all Unix type systems with the following format:

Spyder.pl [ -Options [--] [Arguments]]

Options include: in, out, help and man. Both ‘in’ and ‘out’ take arguments, however ‘out’ is optional. If no argument is given or ‘out’ is not included Spyder uses a default output filename “results.txt”. Information about Spyder can be obtained using the options ‘help’ and ‘man’. Options can be of both long and short format (eg. Long: --man and Short: -m). The short version only has a single ‘-‘, followed by the first letter of the option, whereas the long version has double ‘--‘, and the full name of the option.

Examples

Spyder.pl -help

This prints out basic usage for Spyder.

Spyder.pl -man

This will open a more comprehensive man page for Spyder, to exit type ‘q’.

Spyder.pl -i probset.list --out probset_results.txt

Here Spyder takes “probset.list” as input and writes all output to “probset_results.txt”.

Spyder.pl --in probset.list

Here Spyder takes “probset.list” as input and writes all output to “results.txt”. Windows Version

Due to differences in Operating Systems a series of modifications to the source code were required prior to compilation of the executable. Modifications can be traced in the included source code using the following key:

Red Italicized Code indicates code that was removed, and **Blue Comments** show code that was substituted for the original.

Source Code

#!/usr/bin/perl -w # # Script Name: Spyder.pl # Created: Aug 31, 2010 # # Usage: Spyder.pl [options] # # Description: This script reads in RDP Probematch output files and # determines which results differed from the Probe, the location # of the difference and the value that differs # ################################################################################################### use strict; use Switch;

use List::Util qw[min max]; use File::Basename; use Data::Dumper; use Pod::Usage;

use Getopt::Long; Getopt::Long::Configure ("no_ignore_case");

################################################################################ # Set Global Variables ################################################################################ my $GAP = -5; my $p_VER = '0.01.1'; my ($opt_help, $opt_man, $opt_infile, $opt_format, $opt_outfile, $opt_verbose);

################################################################################ # Determine Time and Date information for current date ################################################################################ my @months = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec); my @days = qw(Sun Mon Tue Wed Thu Fri Sat Sun); my ($sec, $min, $hr, $dayMon, $mon, $yroffset, $daywk, $dayyr, $dls) = localtime(); my $year = 1900 + $yroffset; my $theTime = "$hr:$min:$sec, $days[$daywk] $months[$mon] $dayMon, $year";

################################################################################ # Parse command line options ################################################################################ my $PROG_NAME = $0;

GetOptions( 'help!' => \$opt_help, 'man!' => \$opt_man, 'in=s' => \$opt_infile, 'out=s' => \$opt_outfile, 'verbose!' => \$opt_verbose, ) or pod2usage(-verbose => 1) && exit;

pod2usage(-verbose => 1) && exit if defined $opt_help; pod2usage(-verbose => 2) && exit if defined $opt_man; pod2usage(-verbose => 1) && exit if !defined $opt_infile;

################################################################################ # Verbose - Screen output - reads different based on input flags ################################################################################ print "Parsing Probematch Output..." if defined $opt_verbose; print "\n\n" if defined $opt_verbose; ################################################################################ # Sets Output filenames and path information ################################################################################ my ($outfile, $rawout, $summary, $base, $dir);

$outfile = $opt_outfile ? $opt_outfile : "results.txt"; # ** $outfile = dirname($ARGV[0]).”\\results_”.basename($ARGV[0]); **

$base = basename($outfile); $dir = dirname($outfile); if ($dir eq ".") { $rawout = "raw_".$base; $summary = "summary_".$base; } else { $rawout = $dir."/raw_".$base; # ** rawout = $dir.”\\raw_”.$base; ** $summary = $dir."/summary_".$base; # ** summary = $dir.”\\summary_”.$base; ** }

################################################################################ # Open File to Read In and File to Write Out To ################################################################################ my ($ans);

# Infile open (IN, "<$opt_infile") or die "Can't open $opt_infile:$!\n"; # ** open (IN, "<$ARGV[0]") or die "Can't open $ARGV[0]:$!\n"; **

# Oufile, Raw Output and Summary Output if ((-e $outfile) || (-e $summary)) { print "\nAt least one of: ('$outfile', '$summary') already exist!! Do you want to overwrite? [y|n] "; chomp($ans = <>);

if (($ans eq 'y') || ($ans eq 'Y')) { open (OUT, ">$outfile") or die "Can't open $outfile:$!\n"; open (RAW, ">$rawout") or die "Can't open $rawout:$!\n"; open (SUM, ">$summary") or die "Can't open $summary:$!\n"; } else { die "Change output file name from '$outfile'\n"; } } else { open (OUT, ">>$outfile") or die "Can't open $outfile:$!\n"; open (RAW, ">>$rawout") or die "Can't open $rawout:$!\n"; open (SUM, ">>$summary") or die "Can't open $summary:$!\n"; }

################################################################################ # Read infile contents into an array ################################################################################ my @lines = ; close IN;

################################################################################ # Extract Probe - for comparison ################################################################################ my @pr = split(/ /, $lines[0]); my $PROBE = $pr[1]; chomp $PROBE; my $cPROBE = " ".$PROBE;

################################################################################ # Create Similarity matrix ################################################################################ my %SIMILARITY = &similarity();

################################################################################ # Loop through the File a line at a time and align the target to the probe if # they differ to determine position and whether INDEL or SNP ################################################################################ my (@target, $target, $line);

# Print Header to Outfile and Summary &print_header(); for $line (@lines) {

# Skip commented lines - mostly for the first two of input file next if $line =~ /#/; # Split the values of the line into an array - target is found in block 3 @target = (); @target = split(/ /, $line);

# Need target as string with space at beginning for alignment $target = " ".uc($target[2]);

if ($target eq $cPROBE) { next; } else { # Create the F matrix and initialize my @fmatrix = &fmat($cPROBE, $target, \%SIMILARITY);

# Perform the traceback to get the alignment my ($palign_ref, $talign_ref) = &traceback($cPROBE, $target, \%SIMILARITY, \@fmatrix); my @p_align = @$palign_ref; my @t_align = @$talign_ref;

# Analyze the alignment, print out alignment and determine SNPs and INDELS in target &align_analysis($line, \@p_align, \@t_align); } }

################################################################################ # Finish Off and Close Files ################################################################################ print OUT "\n#EOF\n"; close OUT; close RAW;

################################################################################ # Process Raw Ouput File for the summary file ################################################################################

# Read in Raw Output open (DAT, "<$rawout") or die "Can't open $rawout:$!\n"; my @raw = ; close DAT;

# Process Raw Data for Summary &process_raw(\@raw);

# Finish Off print SUM "\n#EOF\n"; close SUM; unlink($rawout); exit;

### SUBROUTINES FOLLOW ###

#################################################################################################### # Subroutine Similarity # This subroutine creates the similarity matrix for the alignment algorithm #################################################################################################### sub similarity {

############################################################################ # Variable Declaration ############################################################################ my %SIM = (); my (@dat_x, @dat_y, $val_x, $val_y, $key, $sum, $i, $j);

############################################################################ # Define Setup Hashes ############################################################################ my %base_sim = ( "AA" => 10, "GG" => 7, "CC" => 9, "TT" => 8, "AG" => -1, "GA" => -1, "AT" => -4, "TA" => -4, "AC" => -3, "CA" => -3, "GT" => -3, "TG" => -3, "GC" => -5, "CG" => -5, "CT" => 0, "TC" => 0 );

my %LIST = ( "A" => "A,-,-,-", "G" => "-,G,-,-", "C" => "-,-,C,-", "T" => "-,-,-,T", "R" => "A,G,-,-", "M" => "A,-,C,-", "W" => "A,-,-,T", "K" => "-,G,-,T", "S" => "-,G,C,-", "Y" => "-,-,C,T", "H" => "A,-,C,T", "B" => "-,G,C,T", "D" => "A,G,-,T", "N" => "A,G,C,T", "V" => "A,G,C,-" );

my @x = qw( A G C T R M W K S Y H B D N V ); my @y = qw( A G C T R M W K S Y H B D N V );

############################################################################ # Create the matrix with nested loops ############################################################################ foreach $val_x (@x) { foreach $val_y (@y) {

@dat_x = split(/\,/, $LIST{ $val_x }); @dat_y = split(/\,/, $LIST{ $val_y }); $sum = 0;

for $i (0..3) { if ($dat_x[$i] eq '-') { next; }

for $j (0..3) { if ($dat_y[$j] eq '-') { next; }

$key = $dat_x[$i].$dat_y[$j]; $sum = $sum + $base_sim{$key}; } } push @{$SIM{$val_x.$val_y}}, $sum; } } return %SIM; }

#################################################################################################### # Subroutine fmat # This subroutine creates the fmatrix and initializes it #################################################################################################### sub fmat {

############################################################################ # Passed in Variables ############################################################################ my ($pb, $tg, $sim) = @_;

############################################################################ # Varible Declaration ############################################################################ my (@fmat); my ($ch1, $ch2, $ch3);

############################################################################ # Get the length of each sequence and create an array for each ############################################################################ my $pbsize = length($pb); my $tgsize = length($tg); my @pbarr = split(//, $pb); my @tgarr = split(//, $tg); ############################################################################ # Initialize f matrix -- create the fmatrix - with all score values ############################################################################ for my $i (0..$pbsize-1) { $fmat[$i][0] = $GAP * $i; } for my $j (0..$tgsize-1) { $fmat[0][$j] = $GAP * $j; } for my $i (1..$pbsize-1) { for my $j (1..$tgsize-1) {

if (($pbarr[$i] eq '-') || ($tgarr[$j] eq '-')) { $ch1 = $fmat[$i-1][$j-1] + $GAP; } else { $ch1 = $fmat[$i-1][$j-1] + $sim->{$pbarr[$i].$tgarr[$j]}[0]; } $ch2 = $fmat[$i-1][$j] + $GAP; $ch3 = $fmat[$i][$j-1] + $GAP; $fmat[$i][$j] = max($ch1, $ch2, $ch3); } } return @fmat; }

#################################################################################################### # Subroutine traceback # This subroutine uses the fmatrix and traces back through it to create # the alignment #################################################################################################### sub traceback {

############################################################################ # Passed in Variables ############################################################################ my ($pb, $tg, $sim, $fmatrix) = @_;

############################################################################ # Variable Declaration ############################################################################ my (@pb_align, @tg_align); my ($score, $score_diag, $score_up, $score_left, $sim_score); my $posI = length($pb) - 1; my $posJ = length($tg) - 1;

############################################################################ # Set-up Probe and target arrays - similar to subroutine fmat ############################################################################ my @pbarr = split(//, $pb); my @tgarr = split(//, $tg);

############################################################################ # Alignment step - loop through the fmatrix starting at pos ixj to 0x0 ############################################################################ while (($posI > 0) && ($posJ > 0)) {

# Set the values of score, score_dia, score_up and score_left $score = $fmatrix->[$posI][$posJ]; $score_diag = $fmatrix->[$posI-1][$posJ-1]; $score_up = $fmatrix->[$posI][$posJ-1]; $score_left = $fmatrix->[$posI-1][$posJ];

# Create the alignment matrices based on scores if (($pbarr[$posI] eq '-') || ($tgarr[$posJ] eq '-')) { $sim_score = $GAP; } else { $sim_score = $sim->{$pbarr[$posI].$tgarr[$posJ]}[0]; }

if ($score == ($score_diag + $sim_score)) { unshift (@pb_align, $pbarr[$posI]); unshift (@tg_align, $tgarr[$posJ]); $posI--; $posJ--; } elsif ($score == $score_left + $GAP) { unshift (@pb_align, $pbarr[$posI]); unshift (@tg_align, "-"); $posI--; } else { unshift (@pb_align, "-"); unshift (@tg_align, $tgarr[$posJ]); $posJ--; } }

############################################################################ # Need to account for any gaps at the beginning ############################################################################ while ($posI > 0) { unshift (@pb_align, $pbarr[$posI]); unshift (@tg_align, "-"); $posI--; }

while ($posJ > 0) { unshift (@pb_align, "-"); unshift (@tg_align, $tgarr[$posJ]); $posJ--; } return (\@pb_align, \@tg_align); }

#################################################################################################### # Subroutine align_analysis # This subroutine analyses the alignment for SNPs and indels and outputs # the data in a user friendly output and a raw ouput file #################################################################################################### sub align_analysis {

############################################################################ # Passed in Variables ############################################################################ my ($ln, $p_aln, $t_aln) = @_;

############################################################################ # Variable Declaration ############################################################################ my ($seqid, $editdist, $fvp_targ, $thrp_prb); my $desc = ""; my (@info, @indel, @sub, @kval);

############################################################################ # Step One: Split Info into separate variables ############################################################################ @info = split(/ /, $ln); $seqid = $info[0]; $editdist = $info[1]; $fvp_targ = $info[2]; $thrp_prb = $info[3];

chomp @info; for my $v (4..@info-1) { $desc = $desc.$info[$v]." "; }

############################################################################ # Step Two: Determine which base K codes for ############################################################################ my @fvp = split(//, $fvp_targ); my @thrp = split(//, $thrp_prb);

for my $v (0..@thrp-1) { if ($thrp[$v] eq "K") { push (@kval, ($v+1).".".uc($fvp[$v])); } }

############################################################################ # Step Three: Use aligned arrays to determine indels and snps ############################################################################ for my $i (0..$#$p_aln) { if (($p_aln->[$i] eq "-") && ($t_aln->[$i] eq "-")) { next; } elsif ($p_aln->[$i] eq "-") { push (@indel, "i.".$i.".".$t_aln->[$i]); } elsif ($t_aln->[$i] eq "-") { push (@indel, "d.".$i.".".$p_aln->[$i]); } elsif ($p_aln->[$i] ne $t_aln->[$i]) { push (@sub, "s.".$i.".".$p_aln->[$i].".".$t_aln->[$i]); } else { next; } }

############################################################################ # Printing Step - Raw Output ############################################################################ print RAW "$seqid;";

if (@indel) { print RAW join(',', @indel); } print RAW ";";

if (@sub) { print RAW join(',', @sub); } print RAW ";";

if (@kval) { print RAW "K."; print RAW join(',K.', @kval); } print RAW "\n";

############################################################################ # Printing Step - User Readable Ouput ############################################################################ # Data Info and Alignment print OUT "------\n"; print OUT "SeqID: \t $seqid\n"; print OUT "Edit_Dist: \t $editdist\n"; print OUT "5' Target 3':\t $fvp_targ\n"; print OUT "3' Probe 5': \t $thrp_prb\n"; print OUT "Description: \t $desc\n"; print OUT "------\n\n"; print OUT " Alignment: \t @$p_aln [Comp. Probe]\n"; print OUT " \t @$t_aln [3' Probe]\n\n";

# Alignment Analysis my $indel_size = @indel; my $sub_size = @sub; my $k_size = @kval;

print OUT " Analysis:\n"; print OUT " INDEL: $indel_size\n";

if (@indel) { foreach (@indel) { my @temp = split(/\./, $_); if ($temp[0] eq "d") { print OUT "\t\t\tDeletion\t"; } else { print OUT "\t\t\tInsertion\t"; } print OUT "--> Pos: ", ($temp[1] + 1), "\tBase: $temp[2]\n"; } } print OUT "\n"; print OUT " SUBS: $sub_size\n";

if (@sub) { foreach (@sub) { my @temp = split(/\./, $_); print OUT "\t\t\tSubstitution\t"; print OUT "--> Pos: ", ($temp[1] + 1), "\tBase: $temp[2] --> $temp[3]\n"; } } print OUT "\n"; print OUT " K_val: $k_size\n";

if (@kval) { foreach (@kval) { my @temp = split(/\./, $_); print OUT "\t\t\tK\t\t"; print OUT "--> Pos: $temp[0]\tBase: $temp[1]\n"; } } print OUT "\n\n\n"; }

#################################################################################################### # Subroutine process_raw # This subroutine analyses the raw data in raw out and produces a summary file with counts # of indels and snps, etc #################################################################################################### sub process_raw {

############################################################################ # Passed in Variables ############################################################################ my $data = shift;

############################################################################ # Variable Declaration ############################################################################ my (@deletion, @insertion, @substitution, @kval);

############################################################################ # Process raw data - Split dels, ins, sub and K into separate arrays ############################################################################ foreach (@{$data}) {

# Split data results into variants my @array = split(/;/, $_); my @indel = split(/,/, $array[1]); my @subs = split(/,/, $array[2]); my @kvalue = split(/,/, $array[3]); chomp(@kvalue);

# Process Insertions and Deletions foreach (@indel) { my @vars = split(/\./, $_); if ($vars[0] eq 'd') { push(@deletion, "$vars[1].$vars[2]"); } else { push(@insertion, "$vars[1].$vars[2]"); } }

# Process Substitutions foreach (@subs) { my @vars = split(/\./, $_); push(@substitution, "$vars[1].$vars[2].$vars[3]"); }

# Process K-Value foreach (@kvalue) { my @vars = split(/\./, $_); push(@kval, $vars[2]); } }

############################################################################ # Count unique values in arrays ############################################################################ my (%del_count, %ins_count, %sub_count, %k_count); if (@deletion) { map { $del_count{$_}++ } @deletion; } if (@insertion) { map { $ins_count{$_}++ } @insertion; } if (@substitution) { map { $sub_count{$_}++ } @substitution; } if (@kval) { map { $k_count{$_}++ } @kval; }

############################################################################ # Output to Summary the Counts and Breakdown ############################################################################ print SUM "Summary Counts for $opt_infile\n\n"; print SUM " Deletions:\n"; if (%del_count) { for my $key ( sort keys %del_count ) { my $value = $del_count{$key}; my @a = split(/\./, $key); print SUM "\t\t Pos: $a[0] \t Base: $a[1] \t Count -> $value\n\n"; } } else { print SUM "\t\t No Deletions found\n\n"; } print SUM "\n Insertions:\n"; if (%ins_count) { for my $key ( sort keys %ins_count ) { my $value = $ins_count{$key}; my @a = split(/\./, $key); print SUM "\t\t Pos: $a[0] \t Base: $a[1] \t Count -> $value\n\n"; } } else { print SUM "\t\t No Insertions found\n\n"; } print SUM "\n Substitutions:\n"; if (%sub_count) { for my $key ( sort keys %sub_count ) { my $value = $sub_count{$key}; my @a = split(/\./, $key); print SUM "\t\t Pos: $a[0] \t Base: $a[1] -> $a[2] \t Count -> $value\n\n"; } } else { print SUM "\t\t No Substitutions found\n\n"; }

print SUM "\n K_Value:\n"; if (%k_count) { for my $key ( sort keys %k_count ) { my $value = $k_count{$key}; print SUM "\t\t Base: $key \t Count -> $value\n\n"; } } else { print SUM "\t\t No K Values found\n\n"; } }

#################################################################################################### # Subroutine print_header # This subroutine prints the output file headers #################################################################################################### sub print_header {

# Header print OUT "------\n"; print SUM "------\n"; print OUT "Program Information\n\n"; print SUM "Program Information\n\n"; print OUT uc($PROG_NAME), " v.", $p_VER, " [17-Aug-2010] Created By: Dallas Thomas\n\n"; print SUM uc($PROG_NAME), " v.", $p_VER, " [17-Aug-2010] Created By: Dallas Thomas\n\n"; print OUT "------\n"; print SUM "------\n"; print OUT " Parsed Output Data For:\n\n"; print SUM " Summary Data For:\n\n"; print OUT " File: $opt_infile\n"; print SUM " File: $opt_infile\n"; print OUT " Date: $theTime\n\n"; print SUM " Date: $theTime\n\n"; print OUT " Probe: $PROBE\n"; print SUM " Probe: $PROBE\n"; print OUT "------\n"; print SUM "------\n"; print OUT "------\n"; print SUM "------\n"; print OUT "\n\n"; print SUM "\n\n"; }

__END__

## Man Page Info Follows ##

=head1 NAME

Spyder.pl

=head1 SYNOPSIS

Spyder.pl [-Options [--] [Arguments...]]

=head1 DESCRIPTION

Parse rdp Output Data

Parses rdp Output showing only the sequence matches where there is a difference between the probe and target. Outputs an alignment of the probe and target as well as an analysis of the alignment. Alignment shows substitutions, insertions and deletions.

Switches can be done in long or short form and are case sensitive eg: Spyder.pl --help Spyder.pl -h Spyder.pl -v { verbose }

=head1 ARGUMENTS --in --out --help : print Options and Arguments instead of fetching data --man : print complete man page instead of fetching data

name of the input file

name of the output file Notice: If no file name is given will use default "results.out" and save to current directory. Creates a raw output file - "raw_" which is used for further analysis. If given or "results.out" exist, will ask if user wants to overwrite.

=head1 OPTIONS

--verbose prints program progress to standard out

=head1 AUTHOR

Dallas Thomas

=head1 CREDITS

Needleman-Wuunsh Algorithm

=head1 TESTED

Perl 5.10.0 Debian Lenny

=head1 BUGS

None that I know of

=head1 TODO

Want to make this windows executable with a GUI

=head1 UPDATES

Aug 31, 2010 -- Changed code from comparing Probes to comparing probe to target. -- Created a new Similarity Matrix