<<

Introduction to Bioinformatics Introduction to Bioinformatics

Prof. Dr. Nizamettin AYDIN

[email protected] Introduction to

http://www3.yildiz.edu.tr/~naydin

1 2

Learning objectives Setting The Technological Scene

• After this lecture you should be able to • One of the objectives of this course is.. – to enable students to acquire an understanding of, and understand : ability in, a (Perl, Python) as the – sequence, iteration and selection; main enabler in the development of programs in the area of Bioinformatics. – basic building blocks of programming; – three ’s: constants, comments and conditions; • Modern are organised around two main – use of variable containers; components: – use of some Perl operators and its pattern-matching technology; – Hardware – Perl input/output – Software – …

3 4

Introduction to the Computing Introduction to the Computing

• Computer: electronic genius? • In theory, computer can compute anything – NO! Electronic idiot! • that’s possible to compute – Does exactly what we tell it to, nothing more. – given enough memory and time • All computers, given enough time and memory, • In practice, solving problems involves are capable of computing exactly the same things. computing under constraints. Supercomputer – time Workstation • weather forecast, next frame of animation, ... PDA – cost • cell phone, automotive engine controller, ... = = – power • cell phone, handheld video game, ...

5 6

Copyright 2000 N. AYDIN. All rights reserved. 1 Layers of Technology Layers of Technology

... – Interacts directly with the hardware – Responsible for ensuring efficient use of hardware resources • Tools... – Softwares that take adavantage of what the operating system has to offer. – Programming languages, databases, editors, interface builders... • Applications... – Most useful category of software – Web browsers, email clients, web servers, word processors, etc...

7 8

Transformations Between Layers How do we solve a problem using a computer? • A systematic sequence of transformations between layers of abstraction. Problems

Algorithms Problem Software Design: Language choose algorithms and data structures Algorithm Instruction Set Architecture Programming: Microarchitecture use language to express design Program Circuits Compiling/Interpreting: convert language to Devices Instr Set Architecture machine instructions

9 10

Deeper and Deeper… Descriptions of Each Level…

• Problem Statement – stated using "natural language" Instr Set Architecture – may be ambiguous, imprecise Processor Design: • Algorithm choose structures to implement ISA – step-by-step procedure, guaranteed to finish Microarch – definiteness, effective computability, finiteness Logic/Circuit Design: gates and low-level circuits to • Program – express the algorithm using a computer language Circuits implement components – high-level language, low-level language Engineering & Fabrication: develop and manufacture • Instruction Set Architecture (ISA) Devices lowest-level components – specifies the set of instructions the computer can perform – data types, addressing mode

11 12

Copyright 2000 N. AYDIN. All rights reserved. 2 …Descriptions of Each Level Many Choices at Each Level

• Microarchitecture Solve a system of equations – detailed organization of a processor implementation Gaussian Jacobi Red-black SOR Multigrid – different implementations of a single ISA elimination iteration • Logic Circuits FORTRAN C C++ Java – combine basic operations to realize Tradeoffs: microarchitecture PowerPC Intel Atmel AVR cost – many different ways to implement a single function performance Centrino Pentium 4 Xeon power (e.g., addition) (etc.) • Devices Ripple-carry adder Carry-lookahead adder – properties of materials, manufacturability CMOS Bipolar GaAs

The successive over-relaxation (SOR) : a method for solving a linear system of equations. 13 14

The Computer Level Hierarchy C Fortran Ada etc. Basic Java

• Each layer is Compiler Compiler an abstraction of the level below it. Byte Code • The machines at each level Assembly Language execute their own particular instructions, calling upon Assembler Interpreter machines at lower levels to perform tasks as required. Executable • Computer circuits ultimately carry out the work. Instruction Set Architecture •Software?

•Program or collection of programs. HW HW HW •Enables the hardware to process data. Implementation 1 Implementation 2 Implementation N

15 16

Programming Stepwise Refinement

• Methodologies for creating computer programs • Also known as systematic decomposition. that perform a desired function. – Problem Solving • Start with problem statement: • How do we figure out what to tell the computer to do? • Convert problem statement into algorithm, using stepwise • Decompose task into a few simpler subtasks. refinement. • Convert algorithm into machine instructions. • Decompose each subtask into smaller subtasks, and – Debugging these into even smaller subtasks, etc.... • How do we figure out why it didn’t work? • Examining registers and memory, setting breakpoints, etc. until you get to the machine instruction level. Time spent on the first can reduce time spent on the second!

17 18

Copyright 2000 N. AYDIN. All rights reserved. 3 Problem Statement Three Basic Constructs

• There are three basic ways to decompose a task: • Because problem statements are written in English, they are sometimes ambiguous and/or incomplete. – Where is “file” located? Task – How big is it? – How do I know when I’ve reached the end? – How should final count be printed? A decimal number?

True Test False – If the character is a letter, should I count both upper-case and False condition Test lower-case occurrences? Subtask 1 condition True • How do you resolve these issues? Subtask 1 Subtask 2 Subtask 2 Subtask – Ask the person who wants the problem solved, or – Make a decision and document it. Sequential Conditional Iterative

19 20

Sequential Conditional

• Do Subtask 1 to completion, • If condition is true, do Subtask 1; then do Subtask 2 to completion, etc. else, do Subtask 2.

Get character input from keyboard True file char False = input?

Count and print the Examine file and Test character. occurrences of a count the number If match, increment character in a file of characters that Count = Count + 1 match counter.

Print number to the screen

21 22

Iterative Why Write Programs?

• Do Subtask over and over, as long as the test condition is true. • Automate computer work that you do by hand – save time & reduce errors

more chars False • Run the same analysis on lots of similar data files to check? Check each element of • Analyze data the file and count the True characters that match. • Make decisions Check next char and • Create new analysis methods count if matches.

23

Copyright 2000 N. AYDIN. All rights reserved. 4 Why Perl? As a software tool: Perl

• What Is Perl? • Fairly easy to learn the basics • PERL is a "Practical Extraction and Report Language“ • Many powerful functions for working with • (or Pathologically Eclectic Rubbish Lister) text: search & extract, modify, combine – freely available for , MVS, VMS, MS/DOS, • Can control other programs , OS/2, , and other operating systems. • Free and available for all operating systems • Perl has powerful text manipulation functions. – It eclectically combines features and purposes of many • Most popular language in bioinformatics command languages. • Many pre-built “modules” are available that – Perl has enjoyed popularity for programming World Wide do useful things Web electronic forms and generally as glue and gateway between systems, databases, and users.

26

History Strengths of Perl

• Originally written by Larry Wall at NASA’s Jet • Very easy to learn Propulsion Labs • Very portable – to process mail on Unix systems – extended by a lot of people and many biologists! • High level language • Started as ‘glue’ language, • Powerful text processing – for the use of Larry and officemates. • It’s free • It combines the best features of several languages. • What makes Perl a good programming language for • Version 1: December 18, 1987 Biological data? • Current stable release is Perl 5.18.2 – Fast in file manipulation • Perl motto: TMTOWTDI-There’s More Than One – DBI modules provide database bridge for other applications Way To Do It – CGI module provides easy web interface

27 28

Getting and Installing Perl What is Perl Used For

• http://www.perl.org/ • CGI (common gateway interface) Programming (dynamically generating web pages). • http://www.perl.com/CPAN/ – (Example websites: www.amazon.com, www.slashdot.org, • http://www.activestate.com/ www.deja.com) • Perl tutorials: • Extracting data from one source and translating it to another format. http://www.internetbiologists.org/IB-perl/index.html • Manipulating databases, simple search and replace http://learn.perl.org/library/beginning_perl/ operation. • Data management in Human Genome Project • Bioinformatics related web pages: • Internet programming, automating administration http://www.geocities.com/bioinformaticsweb/index.html tasks, ...... etc. http://glasnost.itcarlow.ie/~biobook/index.html

29 30

Copyright 2000 N. AYDIN. All rights reserved. 5 Win95 / Win98 / WinME / WinNT / Win2000/W2K / Which Platform to Use WinXP (Win32)

• http://www.perl.com/CPAN/ports/index.html • Starting from Perl 5.005 the Win32 support has been integrated to the Perl • Perl Ports (Binary Distributions) standard distribution. But if you insist on a binary: • ActivePerl (Perl for Win32, Perl for ISAPI, PerlScript, Perl Package • CPAN/ports (Comprehensive Perl Archive Network ) Manager) • Perl runs on over 100 platforms! • Apache/Perl (binaries for both Perl-5.6/Apache-1.0/mod_perl-1 and Perl- • Acorn | AIX | Amiga | Apple | Atari | AtheOS | BeOS | BSD | BSD/OS | 5.8/Apache-2/mod_perl-2) Coherent | Compaq | Concurrent | Cygwin | Darwin | DG/UX | Digital | • DeveloperSide.Net (compiled under VS.NET and includes the latest Digital UNIX | DEC OSF/1 | /ptx | Embedix | EMC | EPOC | versions of Apache2, PHP, MySQL, OpenSSL, mod_perl, Apache::ASP, FreeBSD | Fujitsu | GNU Darwin | Guardian | HP | HP-UX | IBM | IRIX | and a few other components) Japanese | JPerl | | LynxOS | Mac OS | Mac OS X | Macintosh | • IndigoPerl (Perl for Win32, integrated Apache webserver, GUI Package MachTen | MinGW | | MiNT | MorphOS | MPE/iX | MS-DOS | MVS Manager) | NetBSD | NetWare | NEWS-OS | NextStep | NonStop | NonStop-UX | • niPerl (MSI installer, Win32::GUI, Win32::GUI::XMLBuilder, Novell | ODT | Open UNIX | OpenBSD | OpenVMS | OS/2 | OS/390 | Documentation Viewer, WGX, PAR ready, built-in SciTE editor) OS/400 | OSF/1 | OSR | Plan 9 | Pocket PC | PowerMAX | Psion | QNX | • PXPerl (libwin32 included, compiled with Intel C++ Compiler for Reliant UNIX | RISCOS | SCO | Sequent | SGI | Sharp | Siemens | SINIX | maximum performance, Explorer integration (file association etc), self- Solaris | SONY | Sun | Syllable | Stratus | Tandem | Tivo | Tru64 | | configures on install with the local Visual C++ binaries) UNIX | Unixware | U/WIN | VMS | VOS | Win32 | WinCE | Windows 3.1 | • SiePerl for Win32 by Siemens, contains several modules Windows 95/98/Me/NT/2000/XP/VISTA/W7 W8 | z/OS | ……. • Prebuilt Perls by Rich Megginson, a special installer is used. • Strawberry Perl a perl environment for MS Windows.

31 32

IDEs for Perl How to Get Help • Padre Perl Application Development and Refactoring Environment (Windows, Linux, Mac OS X) • Arachno Perl from Scriptolutions (Windows and Linux) • To get information on a particular function • multiplatform IDE has Perl plugins • Komodo from ActiveState (Windows and Linux) >perldoc –f function • Open Perl IDE (Windows) • Perl Builder from Solutionsoft (Windows and Linux) for example >perldoc –f print • PerlDevKit from ActiveState (IDE Windows, Linux and Solaris) • PerlEdit from IndigoStar (Windows and Linux) • Perl Oasis from Johan Lindström (Windows) • PerlWiz from Arctan Computer Ventures (Windows) • SciTE from the SCIntilla project (Windows and X/gtk+) • visiPerl+ from Help Consulting (Windows) or editors (Perl programs are just plain text so any editor will do). – CodeWright | | GNU | Epsilon | gVim | MicroEmacs | MultiEdit | nvi | PFE | SlickEdit | UltraEdit | | | XEMacs or environments (UNIX tool environments, tcsh and zsh are just the shell). – Cygwin | MKS ksh | U/WIN sh | tcsh | (csh/tcsh book) zsh (zsh in general)

33 34

How to Get Help How to Get Help • To search the Perl FAQ for any regular expression or keyword • Websites: www.perl.com, >perldoc –q keyword www.perlclinic.com, for example >perldoc –q reverse www.perlfaq.com, www.perl.org, www.tpj.com, www.activestate.com, www.perlarchive.com • News groups, books

35 36

Copyright 2000 N. AYDIN. All rights reserved. 6 Again: What Is Perl? How to Write Perl Code

• Interpreted Language ? • Form a working folder (directory) – (such as Basic, which needs another program called • Open Notepad (or any ) and type in the perl interpreted to process the code every time you want to run code following the convention the program) • Save the file with extension pl, or plx • Compiled languge ? – (such as C, which uses a compiler to proces the code before • In file manager you double click on the file. the code is ever run) – The program will run (probably a window will appear and disappear) • Perl is in between like java: – interpreter reads and compiles the program at ones, • Go to MSDOS prompt. – not into the specific machine code, • Change to working directory and type perl xxxx.pl – but into a special virtual machine code. • Or you use one of the IDEs • It is also called scripting language

37 38

First Perl Program Debugging Perl

• Here is a simple script to illustrate how a Perl • Debugging program looks: (Print a message to the terminal – Finding the errors and fixing them. Code: • It is a specialized skill and it takes practice #!/niPerl/bin –w to become good at it. • Among the beginner , it is print "merhaba \n"; common banging your head against the • Save this file as merhaba.pl keyboard for what seems like hours, only to • Run it by typing discover the problem was actually in a > perl merhaba.pl completely different part of your script than where you were looking.

39 40

...Debugging...... Debugging...

• Perl tries its best to tell you where the error is. • Use the -c switch – Read the error message carefully. – to check for possible errors. • Sometimes it gives multiple line errors. • By entering the command perl -c scriptname you make Perl to try to compile your script without • Go to the first line number and try to see the error. actually running it • Try to look at an earlier line too, – The -c switch compiles the script without invoking the – sometimes the error doesn’t trigger until a next warnings feature statement. print "merhaba \n"; • After fixing the first error, run the script again, >perl -c merhaba.pl – the next errors may have gone away.

41 42

Copyright 2000 N. AYDIN. All rights reserved. 7 ...Debugging...... Debugging...

• Syntax Errors print "merhaba \n; pint "merhaba \n "; >perl -c hmerhaba1.pl >perl -c hmerhaba2.pl

43 44

Syntax and semantics ...Debugging... • A Perl program may be syntactically correct, but • Use the –w switch semantically wrong. – to tell Perl to warn you about things it thinks might • Semantics has to do with meaning of language. be problems in your script. – This means that the program satisfies the rules and >perl -w hmerhaba3.pl regulations of language but does not do what you expected it to do.

print; "merhaba \n ";

> perl -c hmerhaba3.pl

45 46

...Debugging...... Debugging... • Use the -cw switches (combination of -c and -w) • Try to isolate the problem. – the -cw compiles it with warnings turned on. – By commenting out chunks of code, then rerunning the script, you can often narrow down where the In either case, you can get feedback on any problems your script problem is occurring. might have, without having to actually run it which, – Even better is to avoid the need for this by building may save you time in long and big scripts. and debugging the script in small increments. – Create a simple framework first, get it working, then >perl -c -w hmerhaba3.pl add increasingly complex features on, testing each >perl -cw hmerhaba3.pl component before moving to the next. – In this way you will uncover bugs as you go, and it will usually be obvious where the bug resides; it is in the small section of code you just added. – While it may seem faster to code up the whole thing first, then do all the debugging at the end, it rarely works out that way.

47 48

Copyright 2000 N. AYDIN. All rights reserved. 8 ...Debugging...... Debugging... • Use the strict pragma • Use the strict pragma – Perl is a great language for writing quick, one-off scripts, in part because of its default behavior of – By putting the following line near the of the script: having new variables simply spring into use strict; • you are telling Perl that you are willing to be held to a higher existence on first mention. standard. – This can lead to problems as your script grows, – In particular, you're saying you are willing to declare all however. your variables before using them. – A typo in the name of a variable will mean that – Besides protecting you from typos in your variable names (because the script will abort with an error message during your script is suddenly using a new, different the initial compilation phase if it encounters an undeclared variable from the one you intended, which can be variable), a real head-scratcher to debug. – this also lets you properly "scope“ your variables, thereby In computer programming, a directive or pragma (from "pragmatic") is a limiting their visibility, rather than letting them be "global" language construct that specifies how a compiler (or other translator) should variables that could conceivably interact with other process its input. Directives are not part of the grammar of a programming variables of the same name elsewhere in the script. language, and may vary from compiler to compiler. – All of this translates into big savings in debugging time.

49 50

...Debugging Return to our 1st program

• Resist the temptation to attribute the problem • #!/niPerl/bin –w to some previously undiscovered bug in Perl. • Every line starting with # is comment and ignored by – Every novice Perl eventually comes Perl. up against a bug that defies all efforts to identify • However, # and ! together at the start of the 1st line tell and eradicate it. UNIX how the file should be run. – As the programmer's frustration level mounts, an – In this case the file should be passed to Perl interpreter, which lives in niPerl/bin idea begins to creep into his or her head: • Within Perl, and almost all other programming • it must not be a problem in the script, but is languages, each line in a program is referred to as a something broken in Perl itself. ‘‘statement’’. • It is almost certainly a bug in your script, not in Perl. – Perl statements end with, and are separated from any other statements by, a semicolon, that is, the ‘‘;’’ character.

51 52

...1st program Program 1

• 2nd line print "merhaba \n"; • print function tells perl to display the given text. print "Welcome to the Wonderful World of Bioinformatics!\n"; • Text inside the quotes is not interpreted as code >perl -cw welcome1.pl and is called string. • \n is used to start a new line >perl welcome1.pl #!/niPerl/bin -w print "merhaba \n";

53 54

Copyright 2000 N. AYDIN. All rights reserved. 9 Another version of welcome Iteration (Repetition) • Using the Perl while construct print "Welcome "; print "to "; # The ‘iter1' program - a (Perl) program, print "the "; # which does not stop until someone presses Ctrl-C. print "Wonderful "; print "World "; use constant TRUE => 1; print "of "; use constant FALSE => 0; print "Bioinformatics!"; while ( TRUE ) print "\n"; { print "Welcome to the Wonderful World of Bioinformatics!\n"; >perl -cw welcome1.pl sleep 1; >perl welcome1.pl }

55 56

Running forever ... Iteration

>perl iter1.pl • Rather than use the word iteration or repetition to refer to this mechanism, many programmers favour the use

Welcome to the Wonderful World of Bioinformatics! of the word loop. Welcome to the Wonderful World of Bioinformatics! – In this context, loop is both a noun and a verb. Welcome to the Wonderful World of Bioinformatics! • Typical programmer utterances might be ‘‘the loop prints the message Welcome to the Wonderful World of Bioinformatics! five times’’ or ‘‘this program loops forever’’. Welcome to the Wonderful World of Bioinformatics! Welcome to the Wonderful World of Bioinformatics! – Programs that loop forever, just like the forever program in Welcome to the Wonderful World of Bioinformatics! this section, are referred to as an infinite loop. Welcome to the Wonderful World of Bioinformatics! • The infinite loop is generally regarded as a very bad thing, and Welcome to the Wonderful World of Bioinformatics! programmers are encouraged not to introduce such loops into . programs. .

57 58

Introducing variable containers Introducing variable containers

• Two constant definitions come after the comment lines: • The opposite of a constant is a variable container, or use constant TRUE => 1; variable for short. use constant FALSE => 0; – A variable’s value can change over the lifetime of the program. – Perl treats a value of 1 as true and a value of 0 as false. • When you need to change the value of an item, use a variable container. • A constant is a container within a program whose value • Perl has excellent support for all types of variable cannot be changed under any circumstance. containers. – Although not required, it is a convention to give constants all UPPERCASE names. • The simplest type of variable container is the scalar. – Scalars can hold, a number, a word, a sentence or a disk-file. • Perl is case-sensitive. • Within Perl programs, scalars are given a name prefixed – This means that when naming variables in Perl, case is with a dollar sign ($). significant. – Here are some example scalar names: • So, ‘‘TRUE’’ is a different symbol to ‘‘true’’. • $name, $_address, $programming_101, $z, $abc, $count

59 60

Copyright 2000 N. AYDIN. All rights reserved. 10 Variable containers and loops Running ten times ...

• To demonstrate the use of variable containers within loops, >perl iter2.pl a version of forever that displays ten messages and then stops can be created. Welcome to the Wonderful World of Bioinformatics! # The ‘iter2' program - a (Perl) program, Welcome to the Wonderful World of Bioinformatics! # which stops after ten iterations. Welcome to the Wonderful World of Bioinformatics! use constant HOWMANY => 10; Welcome to the Wonderful World of Bioinformatics! Welcome to the Wonderful World of Bioinformatics! $count = 0; Welcome to the Wonderful World of Bioinformatics! while ( $count < HOWMANY ) Welcome to the Wonderful World of Bioinformatics! { Welcome to the Wonderful World of Bioinformatics! print "Welcome to the Wonderful World of Bioinformatics!\n"; Welcome to the Wonderful World of Bioinformatics! $count++; Welcome to the Wonderful World of Bioinformatics! }

61 62

Variable containers and loops Selection

• In Perl, it is not necessary to set the value of a • One of the basic building blocks of programming is variable container before it’s used. selection. • Perl has a number of rules that are applied to the first • The use of a selection mechanism allows a program to usage of a variable container choose one of a number of possible courses of action. – Perl sets a scalar to zero if it is first used within a numeric • Here’s the general form of the selection statement in Perl: context. if ( some condition is true ) { – This feature can be very convenient. do something – However, it is always a good idea to give variable } containers an explicit starting value, as it indicates else precisely what the intentions for the variable are. { • Any programmer reading the tentimes program should be in no do something else doubt that the $count scalar is to be used within a numeric context. }

63 64

Using the Perl if construct There Really Is MTOWTDI • Here is another variation on the forever program that prints • Where MTOWTDI stands for ‘‘more than the message fivetimes. one way to do it’’. # The 'fivetimes' program - a (Perl) program, # which stops after five iterations. • This philosophy is one of the great strengths use constant TRUE => 1; use constant FALSE => 0; of Perl, but care is needed. use constant HOWMANY => 5; $count = 0; • Next examples illustrate the good and the while ( TRUE ) bad of this philosophy, { $count++; • Starting with a couple of not so good print "Welcome to the Wonderful World of Bioinformatics!\n"; if ( $count == HOWMANY ) examples followed by a couple of much { last; improved ones. } }

65 66

Copyright 2000 N. AYDIN. All rights reserved. 11 The oddeven program The terrible program

# The 'oddeven' program - a (Perl) program, # which iterates four times, printing 'odd' when $count # is an odd number, and 'even' when $count is an even • Here is another program that does exactly the same # number. use constant HOWMANY => 4; thing as oddeven. $count = 0; #! /usr/bin/perl -w while ( $count < HOWMANY ) # The 'terrible' program - a poorly formatted 'oddeven'. { $count++; use constant HOWMANY => 4; $count = 0; if ( $count == 1 ) while ( $count < HOWMANY ) { $count++; { if ( $count == 1 ) { print "odd\n"; } elsif ( $count == 2 ) print "odd\n"; } { print "even\n"; } elsif ( $count == 3 ) { print "odd\n"; } elsif ( $count == 2 ) else # at this point $count is four. { { print "even\n"; } } print "even\n"; } elsif ( $count == 3 ) • Notice that the program statements that make up the { terrible program are exactly the same as those that print "odd\n"; } make up the oddeven program. else # at this point $count is four. { – The difference between the two programs has to do with print "even\n"; } how they are laid out, or formatted. } 67 68

The terrible program The oddeven2 program • Here is another version of oddeven. • It produces exactly the same output as both oddeven and terrible: • Like a lot of modern programming languages, Perl is #! /usr/bin/perl -w classified as free format. # The 'oddeven2' program - another version of 'oddeven'. use constant HOWMANY => 4; • This means that you can write a program using $count = 0; whatever formatting you prefer, as perl can just as while ( $count < HOWMANY ) { easily process a well-formatted program, such as $count++; oddeven, as it can a poorly formatted program, such as if ( $count % 2 == 0 ) terrible. { print "even\n"; • Do yourself and everyone else a favour, and be sure to } else # $count % 2 is not zero. format your programs to be as readable as possible. { • Use plenty of whitespace, blank lines and indentation print "odd\n"; } to make your programs easier to read. }

69 70

Using the modulus operator The oddeven3 program

• That percentage sign (%) is another Perl operator, the modulus, % operator. • The following code is even shorter than the oddeven2. • Given two numbers, ‘‘A % B’’ returns the remainder after A has been divided by B, assuming both are #! /usr/bin/perl -w positive numbers. # The 'oddeven3' program - yet another version of 'oddeven'. use constant HOWMANY => 4; $count = 0; print 5 % 2, "\n"; # prints a '1' on a line. print 4 % 2, "\n"; # prints a '0' on a line. while ( $count < HOWMANY ) print 7 % 4, "\n"; # prints a '3' on a line. { $count++; print "even\n" if ( $count % 2 == 0 ); print "odd\n" if ( $count % 2 != 0 ); }

71 72

Copyright 2000 N. AYDIN. All rights reserved. 12 Processing Data Files Processing Data Files

• The input operator in Perl is • The following is a small program called getlines that <> exploits the program statement in the previous slide: – When Perl encounters this operator within a program, it #! /usr/bin/perl -w looks for and returns a line of input from standard input, # The 'getlines' program which processes lines. • which is the name given to the mechanism that is currently providing input data to the program. while ( $line = <> ) – Unless Perl is told otherwise, the default input mechanism { is the keyboard. print $line; • A program takes a line of data from the keyboard } whenever the input operator is used. • The getlines program has a condition part that uses – Consider the following program statement: <> to look for and return a line from standard input. $line = <>; • The line, when available, is assigned to the $line scalar, – A line is read from the keyboard and put into the $line scalar which is then checked for trueness.

73 74

Processing Data Files Processing Data Files

• It turns out that, in addition to using numerics to represent • Consider the following commmand line statement: true and false, strings also have a truth value. > perl getlines terrible – A string with no characters is false, otherwise it is true. • In addition to standard input, Perl has – standard output, • the default place to display normal messages, – standard error, • the default place to display error messages. > perl getlines terrible welcome3 – Unless told otherwise, Perl uses the screen as the default for both standard output and standard error. • To make things convenient, standard input, standard output and standard error go by the shorthand names of STDIN, STDOUT and STDERR respectively.

75 76

Processing Data Files Introducing Patterns

• As the getlines program uses • Perl has another programming language built • When used in association with a named disk- into it. file, it uses the contents of the disk-file as input, and there can be more than one named – This language within a language makes extensive disk file. use of Perl’s regular expression, pattern-matching technology. • Programmers refer to the list of things on the • The Perl on-line documentation defines a regular command-line that follow a program name as expression to be simply a string that describes a pattern. its command-line arguments or parameters. – The pattern identifies what it is hoped to match. • The last invocation of getlines has two • The actual how of finding the pattern is taken care of by command-line arguments, the word ‘‘terrible’’ the Perl program. and the word ‘‘welcome3’’.

77 78

Copyright 2000 N. AYDIN. All rights reserved. 13 Introducing Patterns Introducing Patterns

• A programming language that allows the programmer • For introducing regular expressions, consider the to specify what is required is often referred to as a following program, called patterns: declarative language. – The programmer declares what’s required, and the #! /usr/bin/perl -w technology works out the details. • Aprogramming language that allows the programmer to # The 'patterns' program - introducing regular expressions. specify exactly how a result is to be arrived at is often referred to as a procedural language. while ( $line = <> ) – The programmer defines the procedure to be followed, and { the technology blindly follows the instructions. print $line if $line =~ /even/; • Most programming languages can be classified as one } or the other, either declarative or procedural. • Remarkably, Perl can be one or the other, or both.

79 80

Introducing Patterns Introducing Patterns

• ‘patterns’ program is very similar to the ‘getlines’ • When programmers refer to a character that program except the print command within the loop’s block. surrounds something of interest, such as the print $line if $line =~ /even/; forward-leaning slash surrounding the patterns • Here’s the English language equivalent: in this section, they call that character a – display the contents of the scalar called $line if and only if delimiter. the scalar called $line contains the pattern ‘‘even’’. • In above stetement, =~ is called the binding operator. • The character delimits the something of • The binding operator compares something (usually a interest. scalar variable container) against a pattern. • The ‘‘/’’ character is the default delimiter for – For now, a pattern is defined as any sequence of characters surrounded by the forward-leaning slash character ‘‘/’’. regular expression patterns in Perl. • In the example above, the pattern is the word ‘‘even’’. • If the contents of $line contains the pattern ‘‘even’’ anywhere in the line, it is said to match.

81 82

Running patterns ...

• To illustrate what’s going on, try the following command-lines: http://en.wikipedia.org/wiki/Regular_expression

> perl patterns terrible http://www.regular-expressions.info/tutorial.html

> perl patterns oddeven http://www.english.uga.edu/humcomp/perl/regex 2a.html

> perl patterns welcome2 http://www.perl.com/doc/manual/html/pod/perlre .html

83 84

Copyright 2000 N. AYDIN. All rights reserved. 14 Input/Output I/O-STDIN

• Data entering a program is referred to as its input, • The standard input stream (STDIN) is the default while data produced by a program is its output. place from which data enters a program. – Rather than refer to (and write) ‘‘input/output’’, most • Typically, STDIN is the keyboard, but it can also be a programmers simply say ‘‘IO’’, which is written as disk-file. I/O. • To read data from STDIN, use the input operator: • I/O facilities are often referred to as streams. – It is possible to have many streams associated with a my $data = ; program, with some of them classed as input streams or and others classed as output streams. my $data = <>; – As a minimum, every Perl program has three standard streams available to it. • Perl is smart enough to know that an ‘‘empty’’ input • STDIN, STDOUT, and STDERR operator actually refers to STDIN by default.

85 86

I/O-STDIN I/O-STDIN

print "Enter a number: "; print "Enter the username: "; $a = ; #<> is okay too $username = ; print "Enter another number: "; chomp $username; $b = ; #<> is okay too if ($username =~ /ibstudent/) { chomp $a; # chomp : removes \n from string print "Welcome IB student!\n\n";} chomp $b; # chomp : removes \n from string else {print "Bad username, sorry!\n\n";} $c = $a + $b; print "The sum of the numbers that you entered is $c"; • Syntax of if, else statement: if (a condition is met) {do something;} • If $a is omitted, it is assumed to be $_ else {do something else;}

87 88

I/O- STDOUT I/O- STDOUT

• The standard output stream (STDOUT) is the default • STDOUT can be altered outside your program, place to which data is sent by a program. with the "redirection" operator. • Typically, STDOUT is the screen, but it can also be a • So if you were running your Perl program and disk-file. you wanted to keep the output for review • To write data to STDOUT, use the output operator: instead of letting it flash by on the screen, you could redirect STDOUT with the ">" symbol print STDOUT $data; and have the output sent to a file like this: or print $data; >perl perltest.pl > output.txt; • Perl is smart enough to know that print sends data to STDOUT by default.

89 90

Copyright 2000 N. AYDIN. All rights reserved. 15 I/O-STDOUT I/O-STDOUT

• Writing the output to a particular file from • An example: within a Perl program: – Following script creates a new web page, saved in – Specify the file you want to use by opening a the file: c:/web/root/index.html. filehandle to it. – Use the new filehandle in print statement, instead open HTML, ">c:/perlex/index.html"; of the default STDOUT filehandle. print HTML "Content-Type: text/html\n\n"; – When finished, close the filehandle. print HTML ""; print HTML "

Written by Perl!

"; open OUTPUT, ">output.txt"; print HTML ""; print OUTPUT "hello world\n"; close HTML; close OUTPUT;

91 92

FILEHANDLE FILEHANDLE

• A filehandle is a special type of variable that is • Reading from file “DNA” and copying each line to associated with an output destination. “DNAcopy": • It is used to tell your program where you want output to go. open IN, “DNA“; open OUT, ">DNAcopy"; – open a file for reading while ($line = ) open FILEHANDLE,"chromosome2" alternative form: { open FILEHANDLE,"< chromosome2" print OUT $line; } – open a file for writing open FILEHANDLE,">myprediction" • Syntax of while statement: – open a file for appending open FILEHANDLE,">>mypredictions" while (something is happening) {do something;}

93 94

FILEHANDLE FILEHANDLE

• Reading from file “DNA” and copying each line to “DNAcopy": • It is a good habit to close the things that you open. • Close the filehandle once you are done with it. open IN, “DNA" or die "Can't open input file: $!\n"; • This will also happen automatically when your program ends. open OUT, ">DNAcopy" or die "Can't open output file: $!\n"; • Some complex functions don’t even work till after you close while ($line = ) the files. { print OUT $line; • Also as long as you keep files open you occupy the system’s } memory.

• Syntax of die function: close IN or warn "Errors while closing filehandle: $!"; or die “output message” || die “output message” • Syntax of warn function: or warn “output message” • Variable $! describes the error such as “file not found” || warn “output message”

95 96

Copyright 2000 N. AYDIN. All rights reserved. 16 Perl Variables Scalar variables

• In Perl the most basic variable type is a scalar variable. • Perl programs use variables to store data in • It holds a single value. memory. • Value can be any kind of data, including, but not limited to, string, integer, float, object and reference to other variables or • Perl is a typeless language that doesn't force sets of variables. • Scalar variables are preceded with dollar sign ($), and consist the programmer to distinguish between the of only alpha-numeric characters. • Following are all valid Perl variable assignments. types of data stored in a variable. $lang = "Perl"; # <-- string. - Notice quotes $version = 5.6; # <-- float. - Notice lack of quotes • Perl provides 3 built-in variable types: $year = 2001; # <-- integer. / – Scalar $x = 10; $value = $x + 1; – Array $number_of_items = 15; – Hash $word = "hello"; $text = "This is a sentence but is still a scalar";

97 98

Array variables... …Array variables...

• Arrays are handy when you want to store more than • You can also create an arbitrary array, and later re-assign it one data in a single variable, but still want to be able elements using a bracket ([ ]) operator. – Important: when you refer to individual elements of an array, you use $ to refer to them independently. sign just like in scalar variables: • Array in Perl is distinguished from scalar variables by its @ sign that proceeds its name. $ desimal [0] = "bir"; • You can initialize an array variable by giving a list of $ desimal [1] = "iki"; values, each separated with comma inside the $ desimal [2] = "üç"; # ... parenthesis: $ desimal [8]= "dokuz";

@desimal = ("bir", "iki", "uc", ... "dokuz"); • Digits inside the [ ] are usually called array's index, or just @array = ( 1, 2 ); index. – In Perl array indices start at 0, not 1. @values = ( $x, $y, 3, 5); • That's why 10th element of an array has an index of 9

99 100

…Array variables Appending elements to an array

• When you want to add an extra element to the end of • Programming languages such as C/C++ require the array, you will need to know the last index of the that all the elements of an array be of the same array. type, such as all integers, all characters, all • Special symbol, $# can be prepended to the name of strings. the array to get the last index number. • For example: • This is not the case with Perl. – You can mix all kinds of data types in an array: $last_index = $#desimal; $desimal[ $last_index + 1 ] = "on"; • or $array[0] = "apple"; # <-- string $desimal[ $#desimal + 1 ] = "on"; $array[1] = 12; # <-- integer • or $array[2] = 3.47 # <-- float push (@desimal, "on");

101 102

Copyright 2000 N. AYDIN. All rights reserved. 17 Hash variables... …Hash variables

• Perl supports hash variables, which are also known as associative arrays. • You can access the values of columns of the – Arrays, because they store multiple values, just like ordinary arrays. table individually using a {} operator. – Associative, because they associate the values of the element not with an index, but through names, also called keys. • Just like in arrays, we use not % sign, but $ to • You generate keys yourself, and you can refer to those values with those keys. refer to individual variables: • Distinguishing signature of a hash is a % sign, which is prepended to its name: $name = $person{"f_name"}; %person = (); # <-- creating an empty hash $email =$person{"l_name"}; %person = ( "l_name" => "AYDIN", "f_name" => "Nizamettin", "email" => '[email protected]' );

103 104

Hash-related functions String & Array... • Just like in arrays, Perl provides several built-in functions for working with hashes. • How to compute string length?. keys() - returns all the keys (names) of the hash as an array: $length = length $variable; @names = keys(%person); length function returns the length of a string values() - returns all the values of the hash as an array: (number of characters) @values = values(%person); • How to find position of character in a string? delete() - deletes a key/value pair from the hash. $length = rindex($variable r, ‘N') + 1; delete $person{email}; rindex function returns the position of the exists() - returns a true value if a specific key of the hash first ‘N’ from the right really exists: • Instead of 'rindex', 'index' can also be used if ( exists( $person{f_name} ) { # do something accordingly... } $q = $a . '?'; # "tag" the end of the string with '?' $x = index ($q, '?'); # get the position of '?'

105 106

…String & Array Example

• Converting a string into an array: • This is the amino acid sequence of human ubiquitin: MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKE – Use 'qw' operator to create an array: GIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLH @hum_ubiq = qw(M Q I F K T L T G K T); LVLRLRGG Notice that the characters have to be separated by spaces • Human UBC gene encodes a precursor composed of – Use 'split' function: nine direct repeats of this sequence, plus an additional $ubiquitin = 'MQIFVKTLTGKT'; valine residue (V) at the C terminus. @array = split(//, $ubiquitin); It splits string $ubiquitin at a separator substring defined • Using Perl, create the sequence of the precursor, within slashes (// defines an empty string) calculate its length, approximate molecular weight, and the corresponding number of nucleotides in mRNA, and finally print out the results.

107 108

Copyright 2000 N. AYDIN. All rights reserved. 18 precursor Example script 1

• Main Entry: pre·cur·sor @ubi = qw(M Q I F V K T L T G K T I T L E V E P S D T I E N V K A K I Q D K E G I P P D Q Q R L I F A G K Q L E • Pronunciation: \pri-ˈkər-sər, ˈprē-ˌ\ D G R T L S D Y N I Q K E S T L H L V L R L R G G); • Function: noun @pre_ubi = (@ubi) x 9; • Etymology: Middle English precursoure, from push (@pre_ubi, 'V'); Latin praecursor,from praecurrere to run before, from prae- $length = @pre_ubi; $mw = $length * 0.11; pre- + currere to run — more at CURRENT $RNA_length = $length * 3; • Date: 15th century print "The sequence of the human ubiquitin precursor • 1 a : one that precedes and indicates the approach of is:\n@pre_ubi\n"; another b :PREDECESSOR print "Its length is $length amino acids.\n"; print "Its approximate molecular weight is $mw daltons.\n"; 2 : a substance, cell, or cellular component from which another print "It is encoded by an mRNA of approximately substance, cell, or cellular component is formed $RNA_length nucleotides.\n"; • synonyms see FORERUNNER • — pre·cur·so·ry \-ˈkərs-rē, -ˈkər-sə-\ adjective

109 110

Example script 2 Perl Operators…

$ubi = • Arithmetic Operators 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQ – They perform some sort of mathematical functions QRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG'; $pre_ubi = ($ubi x 9) . 'V'; $length = length ($pre_ubi); Operator Function $mw = $length * 0.11; $RNA_length = $length * 3; + Addition print "The sequence of the human ubiquitin precursor - Subtraction, Negative Numbers, is:\n$pre_ubi\n"; Unary Negation print "Its length is $length amino acids.\n"; print "Its approximate molecular weight is $mw daltons.\n"; * Multiplication print "It is encoded by an mRNA of approximately / Division $RNA_length nucleotides.\n"; % Modulus ** Exponent

111 112

…Perl Operators… …Perl Operators…

• To use these, you will place them in your statements • Assignment Operators like a mathematical expression. Operator Function • So, if you want to store the sum of two variables in a = Normal Assignment third variable, you would write something like this: += Add and Assign -= Subtract and Assign $adrevenue=20; *= Multiply and Assign $sales=10; /= Divide and Assign %= Modulus and Assign $total_revenue = $adrevenue + $sales; **= Exponent and Assign • Increment/Decrement Operator Function ++ Increment (Add 1) -- Decrement (Subtract 1)

113 114

Copyright 2000 N. AYDIN. All rights reserved. 19 …Perl Operators… …Perl Operators…

• String Operators • Numeric Comparison Operator Function – These operators are used to compare two numbers, but not to compare strings. . Concatenate Strings – These operators are typically used in some type of .= Concatenate and Assign conditional statement that executes a block of code or initiates a loop. Operator Function $a = "Hello"; == Equal to $b = "World"; != Not Equal to $c = $a . $b; # $c is now "HelloWorld" > Greater than $d = $a . " " . $b; # $d is now "Hello World" < Less than >= Greater than or Equal to <= Less than or Equal to

115 116

…Perl Operators… …Perl Operators… • String Comparison • String Comparison – These are similar to the numerical comparisons, but they work with strings. – The greater-than and less-than operators compare strings using alphabetical order. Operator Function – Something that starts with "a" is greater than eq Equal to something that starts with "c". ne Not Equal to – Also, small letters are greater than capital letters. gt Greater than lt Less than – Thus, "hello" is greater than "Hello". ge Greater than or Equal to – So, that is how it will compare it. le Less than or Equal to

117 118

…Perl Operators Perl Functions...

• Logical Operators • Function is a collection of code (statements), – These are often used when you need to check more than one condition. which can be easily configured by passing lists Operator Function of arguments. && AND || OR • Functions in Perl are called subroutines, and ! NOT have a general syntax of: • So, if you want to see if a number is less than or equal to 10, and also greater than zero: sub (list of arguments) { $number=5; if (($number <= 10) && ($number > 0)) list of statements to execute { return some value } ...code.... }

119 120

Copyright 2000 N. AYDIN. All rights reserved. 20 …Perl Functions Matching and Substitution in Perl… • The ease and power of Perl's pattern matching • Let's create a simple function to compute the area of is one its true strengths and a big reason why the triangle. Perl is as popular as it is. • This function will receive 2 arguments; triangles width and its height and returns final computed value: • Almost every script you write in Perl will have sub ucgen ($genislik, $yukseklik) { some kind of pattern matching operation $alan = $genislik * $yukseklik / 2; because so often you want to seek something return $alan; } out, and then take an action when you find it. • We can now call the above function with various • Matching and substitution are very important arguments, and each time it will return a computed because this is how you do editing "on the value for a triangle: fly". $alan1 = ucgen(3, 4); $alan2 = ucgen(45, 38); • This is how you create content customized to your Web visitor.

121 122

…Matching and Substitution in Perl The Binding Operator... • You need to be able to open HTML templates and swap in information pertaining to your visitor. • When you do a pattern match, you need three things: • Matching and then substituting is just the way to • the text you are searching through do it. • the pattern you are looking for • Also in many other administrative tasks, such as • a way of linking the pattern with the searched text searching through log files or web pages for particular words or sequences, pattern matching is • As a simple example, let's say you want to see whether a string the way to go. variable has the value of "success". • Pattern matching, in Perl at least, is the process of • Here's how you could write the problem in Perl: looking through sections of text for particular $word = "success"; words, letters-within-words, character sequences, if ( $word =~ m/success/ ) { numbers, strings of numbers, html tags. print "Found success\n"; • These more complicated search expressions fall } else { into the category of "regular expressions". print "Did not find success\n";}

123 124

…The Binding Operator... …The Binding Operator

• The "=~" construct, called the binding operator, is what binds • Negative Matching the string being searched with the pattern that specifies the search. – In some cases you are more interested in whether a pattern – The binding operator links these two together and causes the search to does not match a string rather than that it does. In this case take place. you could write • Next, the "m/success/" construct is the matching operator, m//, if ( ! $string =~ m/search text/ ) ... in action. – The "m" stands for matching to make it easy to remember. The slash – but as usual, Perl makes it easier for you and offers you characters here are the "delimiters". more than one way to do it. • They surround the specified pattern. • In this case, there's the "negative" binding operator, • In m/success/, the matching operator is looking for a match of the letter sequence: success. !~, so you could write this: • Generally, the value of the matching statement returns 1 if if ( $string !~ m/search text/ ) ... there was a match, and 0 if there wasn't.

125 126

Copyright 2000 N. AYDIN. All rights reserved. 21 Matching… …Matching...

• Parentheses () group pattern elements. • Square brackets [ ] match a class of single characters. • An asterisk * means that the preceding character, element, or group of elements may occur zero times, one time, or many [0123456789] matches any single digit times. [0-9] matches any single digit • A plus + means that the preceding element or group of elements must occur at least once. [0-9]+ matches any sequence of one or • A question mark ? matches zero or one times. more digits • So: /fr.*nd/ matches "frnd", "friend", "front and back" [a-z]+ matches any lowercase word /fr.+nd/ matches "frond", "friend", "front and back" but not [A-Z]+ matches any uppercase word "frnd". /10*1/ matches "11", "101", "1001", "100000001". [ab n]* matches the null string "", "b", any /b(an)*a/ matches "ba", "bana", "banana", "banananana" number of blanks, "nab a banana" /flo?at/ matches "flat" and "float" but not "flooat"

127 128

…Matching... …Matching...

[^...] matches characters that are not "...": • The Backslash. Other characters simply match themselves, but [^0-9] matches any non-digit character. the characters +?.*^$()[]{}|\ and usually / must be escaped with a backslash \ to be taken literally. Thus: • Curly braces allow more precise specification of repeated fields. For /10.2/ matches "10Q2", "1052", and "10.2" example /10\.2/ matches "10.2" but not "10Q2" or "1052" [0-9]{6} matches any sequence of 6 digits, and /\*+/ matches one or more asterisks [0-9]{6,10} matches any sequence of 6 to 10 digits. /A:\\DIR/ matches "A:\DIR" • Patterns float, unless anchored. The caret ^ (outside [ ]) anchors a pattern to /\/usr\/bin/ matches "/usr/bin“ the beginning, and dollar-sign $ anchors a pattern at the end, so: • If a backslash precedes an alphanumeric character, this sequence /at/ matches "at", "attention", "flat", & "flatter" takes a special meaning, typically a short form of a [ ] character /^at/ matches "at" & "attention" but not "flat" class. For example, \d is the same as the [0-9] digits character /at$/ matches "at" & "flat", but not "attention" class. /^at$/ matches "at" and nothing else. /[-+]?\d*\.?\d*/ is the same as /^at$/i matches "at", "At", "aT", and "AT". /[-+]?[0-9]*\.?\d*/ /^[ \t]*$/ matches a "blank line", one that contains nothing or Either of the above matches decimal numbers: "-150", "- any combination of blanks and tabs. 4.13", "3.1415", "+0000.00", etc.

129 130

…Matching Regular Expressions…

• A simple \s specifies "white space", the same as the • All pattern matching in Perl is based on the concept of character class [ \t\n\r\f] (blank, tab, , carriage regular expressions. return,form-feed). A character may be specified in hexadecimal as a \x followed by two hexadecimal • Regular expressions are an important part of computer digits; \x1b is the ESC character. science, and entire books are devoted to the topic. • A vertical bar | specifies "or". • Regular expressions form a standard way of if ($answer =~ /^y|^yes|^yeah/i ) { print expressing almost any text pattern unambiguously. "Affirmative!"; } • A mechanism to select specific strings from a set of prints "Affirmative!" for $answer equal to "y" or character strings. "yes" or "yeah" (or "Y", "YeS", or "yessireebob, that's • A set of characters, metacharacters, and operators that right"). define a string or group of strings in a search pattern.

131 132

Copyright 2000 N. AYDIN. All rights reserved. 22 …Regular Expressions Strictness…

• A string containing wildcard characters and • Perl breaks a number of the ‘‘golden rules’’ of operations that define a set of one or more possible the traditional programming language strings. • A regular expression (abbreviated as regexp, regex, or – allows variables to be used before they are declared regxp, with plural forms regexps, regexes, or regexen) – Subroutines can be invoked before they are defined is a string that describes or matches a set of strings, • All variables are global by default according to certain syntax rules. • Regular expressions are used by many text editors and – the use of my variables turns a global variable into utilities to search and manipulate bodies of text based a lexical. on certain patterns. – By default, the use of my variables is optional • Many programming languages support regular • However, it is possible to have perl insist on the use of expressions for string manipulation my variables, making their use mandatory.

133 134

…Strictness… …Strictness…

• This insistence is referred to as strictness, and • As programs get bigger, they become harder to is switched on by adding the following line to maintain. the top of a program: – The use of use strict helps keep things organised use strict; and reduces the risk of errors being introduced into programs. • Anything that helps reduce errors is a good thing, even if • This is a directive that it is sometimes inflexible. – tells perl to insist on all variables being declared • Thinking about the scope of variables, and before they are used, using my and our to control the visibility of – all subroutines be declared (or defined) before they variables, becomes important as a program are invoked. grows in size.

135 136

…Strictness Results from bestrict…

• When strictness is enabled, perl checks the • When an attempt is made to execute the declaration of each of a program’s variables bestrict program, perl complains that the before execution occurs. strictness rules have been broken: • Consider the following program: >perl -w strict.pl Global symbol "$message" requires explicit package name at bestrict line 7. #! /usr/bin/perl -w Global symbol "$message" requires explicit package name at bestrict line 9. # bestrict - demonstrating the effect of strictness. Execution of bestrict aborted due to compilation errors. use strict; > $message = "This is the message.\n"; • These ‘‘compilation errors’’ are fixed by print $message; declaring the $message scalar as a my variable • In the code, $message scalar is not declared as my $message = "This is the message.\n"; a lexical (my) or global (our) variable. 137 138

Copyright 2000 N. AYDIN. All rights reserved. 23 …Results from bestrict use subs

# bestrict - demonstrating the effect of strictness. • Perl provides the use subs directive that ... use strict; – can be used in combination with use strict my $message = "This is the message.\n"; • to declare a list of subroutines at the top of the program print $message; For example: >perl -w mystrict.pl use strict; This is the message. use sub qw( drawline biod2mysql); >Exit code: 0 • The use subs directive declares a list of subroutine names that are later defined somewhere in the program’s disk-file.

139 140

Perl One-Liners… …Perl One-Liners

• Perl usually starts with the following line: • Other examples : >#! /usr/bin/perl –w >perl -e 'print "Hello from a Perl one-liner.\n";‘ – a single line of Perl code is provided to perl to execute w: warning immediately from the command-line – instructs perl to warn the programmer when it notices any dubious programming practices >perl -e 'printf "%0.2f\n", 30000 * .12;' – turns perl into a simple command-line calculator • -e switch checks whether a module installed • The printf subroutine is a variant of the more common correctly or not: print, and prints to a specified format. >perl -e 'use ExampleModule' • The ability to use the –e switch on the e: execute command-line in this way creates what is – instructs perl to execute the program statements included within known in the perl world as... the single quotes a one-liner.

141 142

Perl One-Liners: Equivalents… …Perl One-Liners: Equivalents

• Another useful switch is –n, which, when used • When the one-liner is executed, the following with –e, treats the one-liner as if it is enclosed output is generated: with a loop. attgtaatat ctgaatagcc actgattttg taggcacctt tcagtccatc tagtgactaa • Consider this one-liner: >perl -ne 'print if /ctgaatagcc/;' embl.data • Same function can also be implemented using which is equivalent to the following program grep: statement: >grep 'ctgaatagcc' embl.data while ( <> ) { print if /ctgaatagcc/; • When the -n switch is combined with -p, the } loop has a print statement added to the end.

143 144

Copyright 2000 N. AYDIN. All rights reserved. 24 Perl One-Liners: More Options… …Perl One-Liners: More Options…

• Here is a one-liner that prints only those lines from • When executed, the following output is produced: the embl.data disk-file that do not end in four digits: >perl -npe 'last if /\d{4}$/;' embl.data • The above one-liner is equivalent to this program: while ( <> ) { last if /\d{4}$/; } continue { print $_; } • This one-liner is a little harder to do with grep. > grep -v ’[0123456789][0123456789][0123456789][0123456789]$’ embl.data

145 146

…Perl One-Liners: More Options Running Other Programs From Perl…

• The above one-liner is equivalent to this program: • There are two main ways to do this: while ( <> ) – By invoking the program in such a way that after { execution, the calling program can determine whether the called program successfully executed. last if /\d{4}$/; • Perl’s in-built system subroutine behaves in this way } – By invoking the program in such as way that after continue { execution, any resultsfrom the called program are print $_; returned to the calling program. } • Perl’s backticks and qx// operator behaves in this way • Following example program demonstrates each of • This one-liner is a little harder to do with grep. these mechanisms by invoking the DOS utility > grep -v ’[0123456789][0123456789][0123456789][0123456789]$’ embl.data program, dir, that lists disk-files in the current directory

147 148

…Running Other Programs From Perl… …Running Other Programs From Perl

#! /usr/bin/perl –w • The invocation of system results in the dir program executing. Any output from dir is displayed on screen (STDOUT) as # pinvoke - demonstrating the invocation of other programs normal. # from Perl. – As dir executed successfully, a value of zero is returned to use strict; pinvoke and assigned to the $result scalar. • The $result scalar is then printed to STDOUT as part of an appropriately worded message. my $result = system("dir *.*" ); – If the dır program fails, the $result scalar is set to −1. print "The result of the system call was as follows:\n$result\n"; • Perl’s backticks (` and `) also execute external programs from within Perl. – The results from the program are captured and returned to the $result = `dir *.*`; # warning: $result = 'dir *.*'; will not work program. • In the pinvoke program, the results are assigned to the $result scalar, and print "The result of the backticks call was as follows:\n$result\n"; then printed to STDOUT as part of an appropriately worded message. • The qx// operator is another way to invoke the backticks $result = qx/dir *.*/; behaviour: print "The result of the qx// call was as follows:\n$result\n"; – it works exactly the same way as backticks

149 150

Copyright 2000 N. AYDIN. All rights reserved. 25 Results from pinvoke Recovering from Errors…

• It is not always appropriate to die whenever an error occurs. – Sometimes it makes more sense to spot, and then recover from, an error. • This is referred to as exception handling. • Consider the following code: my $first_filename = "itdoesnotexist.txt";

open FIRSTFILE, "$first_filename" or die "Could not open $first_filename. Aborting.\n";

151 152

…Recovering from Errors… …Recovering from Errors…

• Executing the code • The in-built eval subroutine takes a block of code >perl -w errorrec.pl and executes (or evaluates it). Name "main::FIRSTFILE" used only once: possible typo at • When perl invokes eval, anything that happens errorrec.pl line 4. within the eval block that would usually result in a Could not open itdoesnotexist.txt. Aborting. program terminating is caught by perl and does >Exit code: 2 not terminate the program. • This assumes that the itdoesnotexist.txt disk- • Here’s exeption handling by an eval block: file does not exist. eval { my $first_filename = "itdoesnotexist.txt"; – The program terminates as a result of the open FIRSTFILE, "$first_filename" invocation of die. or die "Could not open $first_filename. • It is possible to protect this code by enclosing Aborting.\n"; it within an eval block. };

153 154

…Recovering from Errors Sorting…

• If die is invoked within an eval block, the block immediately terminates and perl sets the internal $@ • Perl provides powerful in-built support for variable to the message generated by die. sorting. • After the eval block, it is a simple matter to check the – sort and reverse, status of $@ and act appropriately. • can be used to sort lists of strings or numbers into • Adding the following if statement after the above eval block: ascending order, descending order or any other customized order. if ( $@ ) { • Following examples demonstrate usage of sort print "Calling eval produced this message: $@"; and reverse. } prints the following message to STDOUT when the – In the following program a list of four short DNA itdoesnotexist.txt disk-file does not exist: sequences is assigned to an array called @sequences, which is then printed to STDOUT Calling eval produced this message: Could not open itdoesnotexist.txt. Aborting.

155 156

Copyright 2000 N. AYDIN. All rights reserved. 26 …Sorting… …Sorting…

#! /usr/bin/perl -w # sortexamples - how Perl's in-built sort subroutine works. • Results from sort examples >perl -w sort1.pl use strict; Before sorting:

my @sequences = qw( gctacataat attgttttta aattatattc cgatgcttgg ); -> gctacataat attgttttta aattatattc cgatgcttgg print "Before sorting:\n\t-> @sequences\n"; Sorted order (default): -> aattatattc attgttttta cgatgcttgg gctacataat my @sorted = sort @sequences; my @reversed = sort { $b cmp $a } @sequences; Reversed order (using sort { $b cmp $a }): my @also_reversed = reverse sort @sequences; -> gctacataat cgatgcttgg attgttttta aattatattc print "Sorted order (default):\n\t-> @sorted\n"; Reversed order (using reverse sort): print "Reversed order (using sort { \$b cmp \$a }):\n\t-> @reversed\n"; print "Reversed order (using reverse sort):\n\t-> @also_reversed\n"; -> gctacataat cgatgcttgg attgttttta aattatattc >Exit code: 0

157 158

…Sorting Another Sorting Example

• my @sorted = sort @sequences; • It is also possible to sort in numerical order using sort – created by invoking the in-built sort subroutine – The following program defines a list of chromosome pair – sorts the array alphabetically in ascending order (from ‘‘a’’ numbers and assigns them to another array, called through to ‘‘z’’). @chromosomes, and the array is then printed to STDOUT: • my @reversed = sort { $b cmp $a } @sequences; – also created by invoking the in-built sort subroutine my @chromosomes = qw( 17 5 13 21 1 2 22 15 ); – sorts the array alphabetically in descending order (from print "Before sorting:\n\t-> @chromosomes\n"; ‘‘z’’ through to ‘‘a’’). • my @also_reversed = reverse sort @sequences; @sorted = sort { $a <=> $b } @chromosomes; @reversed = sort { $b <=> $a } @chromosomes; – created by first sorting the array, then reversing the sorted list by invoking the in-built reverse subroutine. print "Sorted order (using sort { \$a <=> \$b }):\n\t-> @sorted\n"; print "Reversed order (using sort { \$b <=> \$a }):\n\t-> @reversed\n"; • Note that the reverse subroutine reverses the order of elements in a list; it does not sort in reverse order.

159 160

And its results The sortfile Program…

>perl -w sort2.pl • The following program takes any disk-file and sorts Before sorting: the lines in the disk-file in ascending order -> 17 5 13 21 1 2 22 15 #! /usr/bin/perl -w Sorted order (using sort { $a <=> $b }): # sortfile - sort the lines in any file. -> 1 2 5 13 15 17 21 22 use strict; my @the_file; Reversed order (using sort { $b <=> $a }): while ( <> ) -> 22 21 17 15 13 5 2 1 { >Exit code: 0 chomp; push @the_file, $_; • To learn more, use the following command-line to } read the on-line documentation for sort that comes my @sorted_file = sort @the_file; with Perl: foreach my $line ( @sorted_file ) >perldoc -f sort { print "$line\n"; >man sort } 161 162

Copyright 2000 N. AYDIN. All rights reserved. 27 …The sortfile Program HERE Documents

• Consider the requirement to display the following text on • Before running sortfile • After running sortfile screen in exactly the format shown from within a program: Icimde ve evlerde balkon Bir tabut kadar yer tutar Shotgun Sequencing Camasirlarinizi asarsiniz hazir kefen Sezlongunuza uzanin olu This is a relatively simple method of reading Gelecek zamanlarda a genome sequence. Oluleri balkonlara gomecekler It is ''simple'' because Insan rahat etmeyecek it does away with the need to locate Oldukten sonra da Bana sormayin boyle nereye individual DNA fragments on a map before Kosa kosa gidiyorum they are sequenced. Alnindan opmeye gidiyorum Evleri balkonsuz yapan mimarlarin The Shotgun Sequencing method relies on powerful computers to assemble the finished • Same thing could also be done by using Linux sort utility in command-line: sequence. sort sort.data 163 164

Without HERE Documents Output

• Could be done by using a sequence of print statements >perl -w shotgun1.pl as follows: Shotgun Sequencing

print "Shotgun Sequencing\n\n"; This is a relatively simple method of reading print "This is a relatively simple method of reading\n"; a genome sequence. It is ''simple'' because it does away with the need to locate print "a genome sequence. It is ''simple'' because\n"; individual DNA fragments on a map before print "it does away with the need to locate\n"; they are sequenced. print "individual DNA fragments on a map before\n"; print "they are sequenced.\n\n"; The Shotgun Sequencing method relies on print "The Shotgun Sequencing method relies on\n"; powerful computers to assemble the finished print "powerful computers to assemble the finished\n"; sequence. print "sequence.\n"; >Exit code: 0

165 166

With HERE Documents Output

• A better way to do this is to use Perl’s HERE document mechanism >perl -w shotgun2.pl my $shotgun_message = <

The Shotgun Sequencing method relies on The Shotgun Sequencing method relies on powerful computers to assemble the finished powerful computers to assemble the finished sequence. sequence. ENDSHOTMSG >Exit code: 0

print $shotgun_message; 167 168

Copyright 2000 N. AYDIN. All rights reserved. 28 Even Better HERE Documents Output

• It is possible to improve previous program by removing the need for the >perl -w shotgun3.pl $shotgun message scalar and printing the HERE document directly, as follows: Shotgun Sequencing print <Exit code: 0 sequence. ENDSHOTMSG 169 170

HERE Documents Downloading Datasets…

• HERE documents are useful, • Downloading from the Web – especially when it comes to dynamically producing – a highly interactive mechanism HTML documents. – useful for downloading individual data-files • This use of HERE documents will be discussed later – cumbersome for downloading large number of data-files • Some technologies allow the easy integration of data sources across the Internet. • However, it is often convenient to download frequently used datasets and store them locally.

171 172

…Downloading Datasets… …Downloading Datasets…

• The advantages of downloading and storing – Reliability • Accessing local hard-disk copies of data-files is more datasets locally : reliable than network connections and WWW servers. – Ease of access – Stability • accessing data-files on a local hard disk easier than • If the data changes frequently, it is often helpful to writing an interface routine to download them as needed ‘‘freeze’’ it by downloading a copy and using it locally from a – possibly congested – location on the Internet. until all analyses are completed. – Speed – Flexibility • Local hard-disk access, even over a shared , • Often the search facilities that exist on the WWW lack is usually faster than operating through external certain required functionality. networks to Internet locations. – Security – When the processing is performed locally, it may be possible to • Data or results are often sensitive, and sending them to a allocate extra computational resources to the analysis. remote, third-party Internet site may be unacceptable.

173 174

Copyright 2000 N. AYDIN. All rights reserved. 29 …Downloading Datasets… …Downloading Datasets

• The disadvantages of downloading and storing datasets locally : • can be accomplished in a number of ways: – Stale data – by using established sequence analysis programs, • The local copy is a one-time ‘‘snapshot’’ of the dataset at a particular such as EMBOSS point in time. – http://emboss.sourceforge.net/ – At some stage, it will need to be updated or replaced by newer data. – Storage • have specific methods for performing downloads. • The dataset has to be stored somewhere, and some datasets can be – Typically, datasets are accessed via a standard network large. connection to remote Internet sites. – The Protein Databank (PDB) is close to four gigabytes, and the PDB is one of – Frequently, downloads are automated to occur at regular the smaller databases! intervals. • Consequently, storing multiple copies of the PDB is often impractical. – Performance – The wget program, included with most Linux • The centralised specialist services accessible from the WWW are often systems, can be used to do just this. configured with dedicated parallelised systems, designed to service requests as quickly as possible. • wget is an excellent example of GNU software as – If the stored dataset is designed with such systems in mind, it is unlikely that a distributed by the Foundation. local system will be able to match this advanced processing capability. – https://www.gnu.org/software/wget/ • Consequently, some analyses may be slower locally when compared to those performed on the WWW. – http://gnuwin32.sourceforge.net/packages/wget.htm

175 176

Using wget to download PDB data-files Mirroring a dataset

• To download a single data-file via anonymous • wget can be used to mirror datasets. FTP, simply provide the URL of the data-file – to download the entire PDB, which is four required after the wget command. gigabytes of data, stored in over 18000 data-files: – For example, to download the two PDB structures, wget --mirror ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb use these commands: • Such a command should be invoked only when mkdir structures there is a real need to mirror the PDB. cd structures – a download of this size takes a considerable wget ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb1m7t.ent.Z amount of time and disk space. wget ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb1lqt.ent.Z • If such a need exists, once complete, another invocation of the same command downloads only additions or updates to the PDB since the last mirror.

177 178

Smarter mirroring… …Smarter mirroring

• The wget command (in slide 178) results in a deep • The wget command sets a number of options: – --output-file directory tree. • a disk-file into which any message produced by wget is placed. – The actual data-files are found in – --mirror • turns on mirroring. structures/ftp.rcsb.org/pub/pdb/data/structures/all/pdb – --http-user • Such a deep directory structure can be very • sets the web username to use (if needed). – --http-passwd inconvenient and frustrating to navigate. • sets the web password to use (if needed). – --directory-prefix • Following wget invocation can help with this • the place to put the downloaded data-files. problem: – --no-host-directories • the instruction not to use the hostname when creating a mirrored directory wget --output-file=log --mirror --http-user=anonymous \ structure, which is the ‘‘ftp.rcsb.org’’ part. [email protected] \ – --cut-dirs --directory-prefix=structures/mmCIF \ • instructs wget to ignore the indicated number of directory levels. – In the previous example, six directory levels are to be ignored, that is, the --no-host-directories \ ‘‘pub/pdb/data/structures/all/pdb’’ part. --cut-dirs=6 ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb

179 180

Copyright 2000 N. AYDIN. All rights reserved. 30 Downloading a subset of a dataset… …Downloading a subset of a dataset…

• On many occasions, the entire contents of an • The pdbselect program takes the PDB-Select list FTP site might not be required produced in the Non-Redundant Datasets (discusssed • wget can fetch a specific data-file, placing it in later), builds a list of URLs, removes the duplicates the current directory. and then downloads them: #! /usr/bin/perl – Use a command similar to this: # pdbselect - a program that takes a list of PDB ID wget ftp://beta.rcsb.org/pub/pdb/uniformity/data/mmCIF/all/1ger.cif.Z # codes; build a list of URLs for them; • While multiple URLs to data-files can be # and automates the downloading of them supplied on the command-line (separated by # using ’wget’. spaces), it is often more convenient to place the use strict; my $Base_URL = "ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb"; URLs in a data-file and use the ‘‘--input file=’’ my $Output_Dir = "structures"; switch. open URL_LIST, ">pdb_select_url.lst"

181 182

…Downloading a subset of a dataset… …Downloading a subset of a dataset…

or die "Cannot write to file: ’pdb_select_url.lst’\n"; close URL_LIST; while ( <> ) if ( !-e $Output_Dir ) { { if ( /Failed/ ) system "mkdir $Output_Dir"; { } next; if ( !-w $Output_Dir or !-d $Output_Dir ) } { s/ //g; die "ERROR: Cannot access directory: ’$Output_Dir’. Exiting\n"; my ( $Structure, $Length ) = split ( ":", $_ ); } my ( $ID, $Chain ) = split ( ",", $Structure ); system "sort -u pdb_select_url.lst > unique_urls.lst"; $ID =~ tr /[A-Z]/[a-z]/; system "rm $Output_Dir/* > /dev/null"; print URL_LIST "$Base_URL/pdb$ID.ent.Z\n"; system "wget --output-file=log --http-user=anonymous \ } --http-passwd=email\@some.where.net \ --directory-prefix=$Output_Dir -i unique_urls.lst";

183 184

…Downloading a subset of a dataset The Protein Databank…

• This program takes a list of PDB ID codes from • The similarity between the amino acid sequence of a STDIN and downloads them from the URL specified ‘‘new’’ protein and one previously characterized can in the scalar variable $Base_URL6. give an indication of the function of the new protein. – Those structures marked as Failed are skipped, otherwise a – Sequence search algorithms assume some groups of amino URL is built and written to the pdb_select_url.lst file. acids have similar functional roles and consequently, occur – Duplicate structures are filtered out using the sort -u in both sequences. operating system utility. – It is also assumed that these amino acids have similar local – Error-checking is performed to see if the output directory structures (the amino acids arrangement in space). exists (otherwise it is created) and that the directory can be • It is these structures that determine the function of a accessed. protein. • All previous files in it are then deleted using the rm system call. – Although these assumptions are are useful as a working – Finally, wget is invoked with the list of URLs. model.

185 186

Copyright 2000 N. AYDIN. All rights reserved. 31 …The Protein Databank Determining Biomolecule Structures

• Determining the detailed structure of a protein • There are many methods used for gaining is more difficult than finding a DNA or amino information about the structure of a acid sequence. biomolecule • The aim of some structural studies • The two major methods by which the location – to know how the protein (or other biomolecule) of atoms can be determined to a useful ‘‘does what it does’’ accuracy – to alter its function. – X-Ray Crystallography • for example, to design a small molecule that binds to the – Nuclear Magnetic Resonance (NMR). protein, more commonly known as a ‘‘drug’’.

187 188

X-Ray Crystallography… …X-Ray Crystallography…

• a technique for determining the three-dimensional • a tool used for determining the atomic and molecular structure of molecules, structure of a crystal. – including complex biological macromolecules such as • The underlying principle is that the crystalline atoms proteins and nucleic acids. cause a beam of X-rays to diffract into many specific • a powerful tool in the elucidation of the three- directions. dimensional structure of a molecule at atomic resolution. • Data is collected by diffracting X-rays from a single crystal, which has an ordered, regularly repeating arrangement of atoms. • Based on the diffraction pattern obtained from X-ray scattering off the periodic assembly of molecules or • By measuring the angles and intensities of these diffracted atoms in the crystal, the electron density can be beams, a crystallographer can produce a 3D picture of the reconstructed. density of electrons within the crystal.

189 190

…X-Ray Crystallography… …X-Ray Crystallography

• From this electron density image, the mean positions • Viral capsid structure obtained by of the atoms in the crystal can be determined, as well X-ray crystallography. (A) Poliovirus capsid with T=3 as their chemical bonds, their disorder, and various symmetry. other information. (B) Hepatitis B virus capsid with T=4 – The method revealed the structure and function of many • The major shortcomings of X-ray crystallography biological molecules, including vitamins, drugs, proteins, – it is difficult to obtain a crystal of virus particles, which is a and nucleic acids, such as DNA. prerequisite for X-ray crystallography. • Note that the double helix structure of DNA discovered by James Watson and Francis Crick was revealed by X-ray crystallography. – X-ray crystallography generally requires placing the • Recent advances in image reconstruction technology samples in nonphysiological environments, which can occasionally lead to functionally irrelevant conformational have made X-ray crystallography amenable to the changes. structural analysis of much larger complexes, such as virus particles. 191 192

Copyright 2000 N. AYDIN. All rights reserved. 32 Nuclear magnetic resonance… …Nuclear magnetic resonance…

• In NMR, no crystals are used in the process, and • Eventually, the ‘‘flipped’’ state of the nuclei the protein remains in solution throughout the entire experiment. realigns to the normal state, emitting a radio • An intense and very linear magnetic field aligns frequency pulse as it does so. the atomic nuclei of the protein into one of two • The timing of this re-emission of energy is spin states. determined by the electronic environment in • A series of radio frequency pulses is used to which the nucleus is embedded. perturb these by ‘‘flipping’’ some of the nuclei from one spin state to the other. • A feature of this environment is the lectrostatic • As the total amount of energy absorbed is low, the shielding effects of the surrounding nuclei. protein remains undamaged and functions as • The nuclei, in addition to the bonds linking them, normal. can be identified by their spin decay properties.

193 194

…Nuclear magnetic resonance The Protein Databank…

• A problem with NMR methods is the size of the proteins • contains a large collection of previously that can be studied. determined biological structures. – Using current techniques, this equates to a maximum of 200 amino acids. – For inclusion in the PDB, the spatial locations of the • This is low compared to the many hundreds of amino acids that can be atoms have to be determined with sufficient accuracy studied using X-Ray Crystallography. to usefully describe protein structures. • The X-Ray Crystallography and NMR systems are • also includes experimental details of how the complementary in many respects, as both determine, to a high accuracy, the coordinates of the atoms in protein structure was determined, what publications and structures. other databases to consult for more information on – If protein structures determined by X-Ray Crystallography and the structure, some ‘‘derived data’’ and details of NMR are compared, they are generally consistent with each other any ill-defined regions. and moreover are biologically plausible. – While this information is meant to be included in the • This should give the researcher confidence when using PDB, some of it may be missing, incomplete or them. incorrect for some database entries.

195 196

…The Protein Databank… …The Protein Databank

• one of the oldest bioscience data stores, dating back to 1971. • It originally stored the 3D coordinates of protein structures as • PDB Statistics: determined by the Xray Crystallography method. – Overall Growth of Released Structures Per Year • Prior to the PDB, structures were typically published in journals, and many researchers re-entered the information manually into their computers so as to facilitate further manipulation of them. • The original PDB data-file format adopted was a ‘‘flat’’ textual disk- file that was 80 columns wide. • Today, the structures in the PDB are determined by either X-Ray Crystallography or NMR. – Often, many years of effort go into determining an individual structure. • This is reflected in the growth of the number of entries in the PDB over some 40 years.

197 198

Copyright 2000 N. AYDIN. All rights reserved. 33 The PDB Data-file Formats Example structures

• available in one of two formats. • 1LQT – These formats are inter-convertible – A modern, high-resolution ‘‘Oxidoreductase’’ enzyme • PDB flat file structure produced using X-Ray Crystallographic – The original, generic and highly unstructured PDB data-file techniques. format that is still widely used by researchers. • When biologists talk of ‘‘PDB files’’ or ‘‘PDB format’’, they are referring to this data-file format. • The current standard format is the 2.3 version. • mmCIF – The new PDB data-file format that is designed to offer a highly structured, modern replacement to the original PDB Flat File format. • The mmCIF format is often informally referred to as the ‘‘new PDB format’’.

199 200

Example structures Downloading PDB data-files

• 1M7T • PDB structure data-files can be downloaded – A modern protein structure of ‘‘Thioredoxin’’ produced from many web-site locations on the Internet. using NMR. • The RCSB web-site is always a good place to start: – http://www.rcsb.org/pdb/ • Alternatively, the EBI hosts a European mirror. Follow the links from: – http://www.ebi.ac.uk/services/ to access the PDB from the EBI.

201 202

Accessing Data in PDB Entries Accessing Data in PDB Entries

• There are some common sections to all PDB • In a PDB data-file there is a left-right split (per entries: line) and a top-bottom split (per data-file): – those concerned with • Left-right • indexing, – The left-most characters (a maximum of nine) on • bibliographic data, each line indicate what information is present on • notable features the right-hand side. • 3D coordinates. • Top-bottom • Other sections are radically different from each – There is an upper HEADER section that contains other, as they depend on the experimental the annotation about the structure (top) and a lower technique (X-Ray Crystallography or NMR) coordinates section that contains the 3D spatial used to determine the structure. locations of the atoms in the structure (bottom). 203 204

Copyright 2000 N. AYDIN. All rights reserved. 34 A short description of the most important fields in the PDB data-file A short description of the most important fields in the PDB data-file

• HEADER • JRNL – Contains a description of the structure, the date and the PDB ID – One or more literature references that describe the structure. code. • REMARK 1 through REMARK 999 • TITLE – Details of the experimental methods used to determine the – The title of the structure. structure are contained in this subsection (see the example in the • COMPND next section). – Brief details of the structure. • DBREF • SOURCE – Cross links to other databases. – Identifies which organism the structure came from. • SEQRES • KEYWDS – The official amino acid sequence (protein, RNA or DNA) of the – Lists a set of useful words/phrases that describe the structure. structure. • AUTHOR • HELIX/SHEET – The scientists depositing the structure. – Details of the regions of secondary structure found in the protein. • REVDAT – The date of the last revision. • ATOM/HETATM – The 3D spatial coordinates of particular atoms in the protein structure or other molecules such as water or co-factors.

205 206

Accessing PDB Annotation Data Free R and resolution…

• There are many examples of parsing data from the • The REMARK tag, type 2 subsection stores resolution, HEADER section of PDB data-files, all of which whereas the Free R value is quoted in REMARK tag, involve pattern matching. type 3. – Perl is exceptionally good at this. – Here’s a small extract from the 1LQT entry: • Two representative examples exploring REMARK 2 – the relationship between the resolution of a structure and its REMARK 2 RESOLUTION. 1.05 ANGSTROMS. Free R value, both of which are measures of the quality of • In NMR structures, REMARK tag, type 2 and type 3 are the X-Ray Crystallographic structures. present, but the data in them is ‘‘NOT APPLICABLE’’ • the Free R value measures the agreement between the model and the for REMARK tag, type 2 and ‘‘NULL’’ or free text for observed x-ray reflection data. REMARK tag, type 3. • The lower the Free R Value, the better the fit between the model and the observed data. – Extract from the 1M7T structure’s HEADER: REMARK 215 NMR STUDY – the database cross-referencing section used to link to other REMARK 215 THE COORDINATES IN THIS ENTRY WERE GENERATED FROM SOLUTION databases. REMARK 215 NMR DATA. PROTEIN DATA BANK CONVENTIONS REQUIRE THAT REMARK 215 CRYST1 AND SCALE RECORDS BE INCLUDED, BUT THE VALUES ON REMARK 215 THESE RECORDS ARE MEANINGLESS.

207 208

…Free R and resolution A Perl program extracts the resolution and Free R Value from any PDB data-files

• Structural Refinement is the process of iteratively #! /usr/bin/perl -w # free_res - Designed to extract the ’Free R Value’ and ’Resolution’ fitting the model structure into the electron density # quantities from ’PDB data-files’ containing structures map, and details of this refinement are stored in # produced by ’Diffraction’. REMARK tag, type 3. use strict; my $PDB_Path = shift; • Here is an extract : opendir ( INPUT_DIR, "$PDB_Path" ) . or die "Error: Cannot read from mmCIF directory: ’$PDB_Path’\n"; . . my @PDB_dir = readdir INPUT_DIR; REMARK 3 FIT TO DATA USED IN REFINEMENT. close INPUT_DIR; REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT my @PDB_Files = grep /\.pdb/, @PDB_dir; REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM foreach my $Current_PDB_File ( @PDB_Files ) REMARK 3 R VALUE (WORKING + TEST SET) : 0.134 REMARK 3 R VALUE (WORKING SET) : 0.134 { REMARK 3 FREE R VALUE : 0.153 my $Free_R; REMARK 3 FREE R VALUE TEST SET SIZE (%) : NULL my $Resolution; REMARK 3 FREE R VALUE TEST SET COUNT : 2200 open ( PDB_FILE, "$PDB_Path/$Current_PDB_File" ) or die "Cannot open PDB File: ’$Current_PDB_File’\n";

209 210

Copyright 2000 N. AYDIN. All rights reserved. 35 A Perl program extracts the resolution and Free R Value from any PDB data-files A Perl program extracts the resolution and Free R Value from any PDB data-files

while ( ) if ( $Free_R =~ /NULL/ or $Resolution eq "" ) { { if ( /^EXPDTA / and !/DIFFRACTION/ ) last; { } last; else } { if ( /^REMARK 2 RESOLUTION/ ) printf ( "%7s %4.2f %7.3f \n", $Current_PDB_File, { $Resolution, $Free_R ); ( undef, undef, undef, $Resolution ) = split ( " ", $_ ); last; } } if ( /^REMARK 3 FREE R VALUE / ) } { } $Free_R = substr ( $_, 47, 6 ); close ( PDB_FILE ); $Free_R =~ s/ //g; }

211 212

Non-redundant Datasets Reduction of redundancy

• There may be many reasons for redundancy in a • There are two reasons for supporting the reduction dataset. of a database: – Scientific • It is often advantageous to study molecules with similar – Conceptually, to remove bias within the database. structures. • The statistical analysis based upon the non-redundant dataset – This is a classic scientific investigative methodology: change a will be more representative of all the items in the database, small part, then identify the change in structure or function to form hypotheses about the reasons for the change. rather than just the largest dominant group. – Consequently, researchers are encouraged to study similar molecules to those studied previously. – As a practical measure, to reduce the computational – Technological limitations requirements caused by analysing examples that are • In X-Ray Crystallography, it is easier to obtain the structure unnecessary. of a molecule that is similar to one that is already known, as • For example, the PDB-Select structural non-redundant dataset molecules with similar conformations are likely to have similar crystallisation conditions. contains approximately 1600 protein structures, whereas the – This, conveniently, allows two of the most difficult aspects of entire PDB contained approximately 18,000. using X-Ray Crystallography to be dealt with.

213 214

Database cross references… …Database cross references…

• The DBREF subsection gives a list of cross references to other Bioinformatics databases. • The PDB publishes a table of database names – This makes it easier for researchers to integrate biological and their associated, abbreviated codes. datasets. • The second value on the DBREF line is the PDB identifier. – By examining this value, researchers and automatic parsing programs can tell to which structure the entry belongs. • Example DBREF lines : DBREF 1LQT A 1 456 GB 13882996 AAK47528 1 456 DBREF 1LQT B 1 456 GB 13882996 AAK47528 1 456

DBREF 1AFI 1 72 SWS P04129 MERP_SHIFL 20 91

DBREF 1M7T A 1 66 SWS P10599 THIO_HUMAN 0 65 DBREF 1M7T A 67 106 SWS P00274 THIO_ECOLI 68 107

215 216

Copyright 2000 N. AYDIN. All rights reserved. 36 …Database cross references Coordinates section

• The DBREF lines identify the following fields, working • The coordinate data for the locations of atoms in the macromolecular structure is straightforward, especially when from left to right: compared to the annotation contained in the HEADER section – PDB ID code. of the PDB data-file. – Chain identifier (if needed). – The coordinates are presented as points in space, the atoms they represent are actually in motion. – The start of the sequence. – In crystallographic structures, isotropic B-factors, commonly referred to as ‘‘Temperature Factors’’, give us an idea of the vibration of the molecule. – Insertion code. • For very high-resolution structures, Anisotropic Temperature Factors may be included in the ANISOU lines. – End of the sequence. – These provide an idea of the vibration of the molecule in the directions of the – The external database to which the cross reference refers. coordinate axes. – In NMR structures, the variation in position of a particular atom between – The external database accession code. different models in the ensemble can be used as a similar measure of motion or as an indication of the error between the minimisation models. – The database external accession name. • Here is an example from 1M7T: – The start, insertion and end of the sequence in the external database. REMARK 210 REMARK 210 BEST REPRESENTATIVE CONFORMER IN THIS ENSEMBLE : 21 REMARK 210

217 218

Data section Data section

• Referring to the 1LQT x-ray structure, an extract of lines from the coordinate section looks like this:

219 220

Data section Data section

• For the 1M7T NMR structure, an extract of lines from the coordinate section looks like this:

221 222

Copyright 2000 N. AYDIN. All rights reserved. 37 Data section Extracting 3D co-ordinate data

• In each ATOM line, the fields are as follows: • The technique involves extracting the three substrings from each line that contains the X, Y and Z coordinates. • Assuming the data is in $_, three invocations of Perl’s substr subroutine do the trick:

my ( $X, $Y, $Z ) = ( substr( $_, 30, 8 ),

substr( $_, 38, 8 ),

substr( $_, 46, 8 ) );

223 224

The simple_coord_extract program Results from simple_coord_extract ...

#! /usr/bin/perl -w X, Y & Z: 25.150, -8.702, 38.505

# simple_coord_extract - Demonstrates the extraction of X, Y & Z: 23.675, -8.497, 35.069 # C-Alpha co-ordinates from a PDB X, Y & Z: 20.747, -6.252, 34.332 # data-file. X, Y & Z: 17.545, -8.297, 34.292

use strict; X, Y & Z: 15.182, -7.484, 31.454 X, Y & Z: 11.736, -8.952, 30.942 while ( <> ) X, Y & Z: 10.261, -9.014, 27.451 { if ( /^ATOM/ && substr( $_, 13, 4 ) eq "CA " ) X, Y & Z: 6.507, -9.548, 27.173 { my ( $X, $Y, $Z ) = ( substr( $_, 30, 8 ), substr( $_, 38, 8 ), substr( $_, 46, 8 ) );

$X =~ s/ //g; $Y =~ s/ //g; $Z =~ s/ //g;

print "X, Y & Z: $X, $Y, $Z\n"; } }

225 226

Introducing Databases… …Introducing Databases

• Many modern computer systems store vast amounts of • For example, structured data. – if the first column in a row holds a date, then every • Typically, this data is held in a database system. first column in every row must also hold a date. • Database – if the second column holds a name, then every – a collection of one or more related tables. second column must also hold a name, and so on. • Table • The following data corresponds to the – a collection of one or more rows of data. structure, in that there are two columns, the • The rows of data are arranged in columns, with each intersection of first holding a date, the second holding a name: a row and column containing a data item. • Row 1960-12-21 P. Barry – a collection of one or more data items, arranged in columns. 1954-6-14 M. Moorhouse • Within a row, the columns conform to a structure.

227 228

Copyright 2000 N. AYDIN. All rights reserved. 38 Structured data Relating tables…

• Each column can be given a descriptive name. • Extending the Discoveries table to include details of the discovery, an ------additional column is needed to hold the data Discovery_Date Scientist ------Discovery_Date Scientist Discovery 1960-12-21 P. Barry 1954-6-14 M. Moorhouse ------1970-3-4 J. Blow 1960-12-21 P. Barry Flying car 2001-12-27 J. Doe 1954-6-14 M. Moorhouse Telepathic sunglasses • In addition, the structure requires that each data item 1970-3-4 J. Blow Self cleaning child held in a column be of a specific type. 2001-12-27 J. Doe Time travel ------• The inclusion of this new column requires an update to the structure of the table Column name Type restriction ------Discovery_Date a valid Date Column name Type restriction Scientist a String no longer than 64 characters ------• This type information generally goes by one of two Discovery_Date a valid Date names: metadata or schema. Scientist a String no longer than 64 characters Discovery a String no longer than 128 characters

229 230

…Relating tables The problem with single-table databases

• Although the above table structure solves the problem of ------Column name Type restriction uniquely identifying each scientist, it introduces some other ------problems: Discovery_Date a valid Date – If a scientist is responsible for a large number of discoveries, Scientist a String no longer than 64 characters their identification information has to be entered into every row Discovery a String no longer than 128 characters of data that refers to them. Date_of_birth a valid Date • This is time-consuming and wasteful. Telephone_number a String no longer than 16 characters – Every time identification information is added to a row for a particular scientist, it has to be entered in exactly the same way as the identification information added already. • Despite the best of efforts, this level of accuracy is often difficult to ------achieve. Discovery_Date Scientist Discovery Date_of_birth Telephone_number ------– If a scientist changes any identification information, every row in 1960-12-21 P. Barry Flying car 1966-11-18 353-503-555-91910 1954-6-14 M. Moorhouse Telepathic sunglasses 1970-3-24 00-44-81-555-3232 the table that refers to the scientist’s discoveries has to be 1970-3-4 J. Blow Self cleaning child 1955-8-17 555-2837 changed. 2001-12-27 J. Doe Time travel 1962-12-1 - • This is drudgery. 1974-3-17 M. Moorhouse Memory swapping toupee 1970-3-24 00-44-81-555-3232 1999-12-31 M. Moorhouse Twenty six hour clock 1958-7-12 416-555-2000

231 232

Solving the one table problem… …Solving the one table problem

------• The problems described in the previous section are solved by breaking the all-in- one Discoveries table into two tables. Discovery_Date Scientist_ID Discovery • Here is a new structure for Discoveries: ------1954-6-14 MM Telepathic sunglasses ------1960-12-21 PB Flying car Column name Type restriction 1969-8-1 PB A cure for bad jokes ------1970-3-4 JB Self cleaning child Discovery_Date a valid Date 1974-3-17 MM Memory swapping toupee Scientist_ID a String no longer than 8 characters Discovery a String no longer than 128 characters 1999-12-31 MM2 Twenty six hour clock 2001-12-27 JD Time travel ------

Column name Type restriction ------Scientist_ID Scientist Date_of_birth Address Telephone_number Scientist_ID a String no longer than 8 characters ------JB J. Blow 1955-8-17 Belfast, NI 555-2837 Scientist a String no longer than 64 characters JD J. Doe 1962-12-1 Syndey, AUS - Date_of_birth a valid Date MM M. Moorhouse 1970-3-24 England, UK 00-44-81-555-3232 MM2 M. Moorhouse 1958-7-12 Toronto, CA 416-555-2000 Address a String no longer than 256 characters PB P. Barry 1966-11-18 Carlow, IRL 353-503-555-91910 Telephone_number a String no longer than 16 characters

233 234

Copyright 2000 N. AYDIN. All rights reserved. 39 Relational Databases Database system: a definition

• A database system is a computer program (or group • Relating data in one table to that in another of programs) forms the basis of modern database theory. – that provides a mechanism to define and manipulate – It also explains why so many modern database • one or more databases technologies are referred to as Relational Database • A database system Management Systems (RDBMS). – allows databases, tables and columns to be created and named, and structures to be defined. • When a collection of tables is designed to – provides mechanisms to add, remove, update and interact relate to each other they are collectively with the data in the database. referred to as a database. • Data stored in tables can be searched, sorted, sliced, diced and cross-referenced. • It is usually a requirement to give the database • Reports can be generated, and calculations can be a descriptive name. performed.

235 236

Available Database Systems Chosing Database System

• Personal database systems: • Which type of database system is chosen – Designed to run on PCs depends on a number of factors, including (but • Access, Paradox, FileMaker, dBase not limited to): • Enterprise database systems: – Designed to support efficient storage and retrieval of vast amount of data – The amount of data to be stored in the database. • Interbase, Ingres, SQL Server, Informix, DB2, Oracle – Whether the data supports a small personal project or • Open source database systems: a large collaborative one. – Free!!! (Linux!!!) – How much funds (if any) are available towards the • PostgreSQL, MySQL purchase of a database system.

237 238

SQL: The Language of Databases Installing a database system

• Defining data with SQL (structured query • MySQL is a modern, capable and SQL-enabled database system. • It is Open Source and freely available for download from the language) MySQL web-site: http://www.mysql.com • It comes as a standard, installable component of most Linux • SQL provides two facilities: distributions • The following commands switch on MySQL on RedHat and – A database definition Language (DDL) RedHat-like Linux distributions chkconfig --add mysqld • provides a mechanism whereby databases can be created chkconfig mysqld on • If the first chkconfig command produces an error messages like this: – A Data Manipulation Language (DML) error reading information on service mysqld: No such file or directory • provides a mechanism to work with data in tables this means that MySQL is not installed and the second command will also fail. 239 240

Copyright 2000 N. AYDIN. All rights reserved. 40 Installing a database system A Database Case Study: MER

• Once MySQL is installed, it needs to be configured. • A small collection of SWISS-PROT and EMBL entries are taken from the Mer Operon, a bacterial gene cluster that is • The first requirement is to assign a password to the found in many bacteria for the detoxification of Mercury MySQL superuser, known as ‘‘root’’. Hg2+ ions. • These provide the raw data to a database, which is called • The mysqladmin program does this, as follows: MER. • The MER database contains four tables: mysqladmin -u root password 'passwordhere‘ – proteins – A table of protein structure details, extracted from a collection of SWISS-PROT entries. – dnas – A table of DNA sequence details, extracted from a • It is now possible to securely access the MySQL Monitor collection of EMBL entries. command-line utility with the following command, providing – crossrefs – A table that links the extracted protein structures to the correct password when prompted: the extracted DNA sequences. – citations – A table of literature citations extracted from both the SWISS-PROT and EMBL DNA entries. -u root -p

241 242

A Database Case Study: MER Creating the MER database…

• SQL queries can be entered directly at the MySQL Monitor prompt. • Once the raw data is in the database, SQL can be mysql> create database MER; used to answer questions about the data. Query OK, 1 row affected (0.36 sec) mysql> show databases; • For instance: +------+ – How many protein structures in the database are longer | Databases | than 200 amino acids in length? +------+ | MER | – How many DNA sequences in the database are longer | test | than 4000 bases in length? | mysql | – What’s the largest DNA sequence in the database? +------+ 3 rows in set (0.00 sec) – Which protein structures are cross-referenced with • A list of databases is returned by MySQL. which DNA sequences? • There are three identified databases: – MER – The just-created database that will store details on the extracted protein structures, DNA sequences, cross – Which literature citations reference the results from the references and literature citations. previous question? – test – A small test database that is used by MySQL and other technologies to test the integrity of the MySQL installation. – mysql – The database that stores the internal ‘‘system information’’ used by the MySQL database system.

243 244

…Creating the MER database Adding tables to the MER database

• It is possible to use the MySQL superuser to create tables within the MER database. • However, it is better practice to create a user within the database system to have create table proteins authority over the database, and then perform all operations on the MER database ( as this user. • The queries to do this are entered at the MySQL Monitor prompt. accession_number varchar (6) not null, • Here are the queries and the messages returned: code varchar (4) not null, mysql> use mysql; species varchar (5) not null, Database changed last_date date not null, mysql> grant all on MER.* to bbp identified by 'passwordhere'; description text not null, Query OK. 0 rows affected (0.00 sec) mysql> quit sequence_header varchar (75) not null, Bye sequence_length int not null, • The first query tells MySQL that any subsequent queries are to be applied to the sequence_data text not null named database, which in this case is the mysql database. ) • The second query does three things: – creates a new MySQL user called ‘‘bbp’’. $ mysql -u bbp -p MER < create_proteins.sql – assigns a password with the value of ‘‘passwordhere’’ to user ‘‘bbp’’. – grants every available privilege relating to the MER database to ‘‘bbp’’.

245 246

Copyright 2000 N. AYDIN. All rights reserved. 41 Databases and Perl Perl Database Technologies

• Why Program Databases? • A number of third-party CPAN modules provide access – Customised output handling to MySQL from within a Perl program. • Programs can be written to post-process the results of any SQL – One such module is Net::MySQL by Hiroyuki Oyama, query and display them in any number of preferred formats. which provides a stable programming interface to MySQL functionality. – Customised input handling • Users of customised input handling programs do not need to know • In fact, nearly every database system provides a specific anything about SQL – all they need to know and understand is their technology for programmers to use when programming data. their particular database. – Extending SQL – This technology is referred to as an API, an application • Some tasks that are difficult or impossible to do with SQL can be programming interface. programmed more easily. • Unfortunately, the effort expended in learning how to – Integrating MySQL into custom applications use Net::MySQL is of little use when a program has to • Having the power of MySQL as a component of an application can be written to interface with Oracle or Sybase be very powerful.

247 248

Perl Database Technologies Preparing Perl

DBI and DBD::mysql modules need to be installed • The DBI module provides a database $ man DBI independent interface for Perl. $ man DBD::mysql – By providing a generalised API, programmers can program at a ‘‘higher level’’ than the API provided by the database system, in effect insulating $ find `perl -Te 'print "@INC"' ` -name '*.pm' -print | grep 'DBI.pm' programs from changes to the database system. $ find `perl -Te 'print "@INC"' ` -name '*.pm' -print | grep 'mysql.pm' • To connect the high-level DBI technology to a $ locate DBI.pm particular database system, a special driver $ locate mysql.pm converts the general DBI API into the database DBI (previously called DBperl) is a database independent interface module for Perl. system-specific API. DBD: Data Base Description – These drivers are implemented as CPAN modules.

249 250

Checking the DBI installation Programming Databases With DBI

#! /usr/bin/perl -w

#! /usr/bin/perl -w # show_tables - list the tables within the MER database. # Uses "DBI::dump_results" to display results. # check_drivers - check which drivers are installed with DBI. use strict; use DBI qw( :utils ); use strict; use constant DATABASE => "DBI:mysql:MER"; use constant DB_USER => "bbp"; use constant DB_PASS => "passwordhere"; use DBI; my $dbh = DBI->connect( DATABASE, DB_USER, DB_PASS ) or die "Connect failed: ", $DBI::errstr, ".\n"; • DATABASE my @drivers = DBI->available_drivers; – Identifies the data source to use my $sql = "show tables"; • DB_USER my $sth = $dbh->prepare( $sql ); foreach my $driver ( @drivers ) – Identifies the username to use $sth->execute; when connecting to the data source { print dump_results( $sth ), "\n"; • DB_PASS $sth->finish; print "Driver: $driver installed.\n"; – Identifies the password to use } $dbh->disconnect; when authenticating to the data source 251 252

Copyright 2000 N. AYDIN. All rights reserved. 42 The Sequence Retrieval System… …The Sequence Retrieval System…

• Sequence Retrieval System (SRS) is a web-based • SRS makes it very easy to query the entire data database integration system that allows for the set, using common search terms that work across querying of data contained in a maltitude of all the different databases, regardless of what they databases, all through a single . are. • This makes the individual databases appear as if • Everything contained within the SRS is ‘‘tied they are really one big relational database, together’’ by the web-based interface. organised with different subsections: • Figure in the next slide is the database selection – one called SWISS-PROT, page from the EBI’s SRS web-site, which can be – one called EMBL, navigated to from the following Internet address: – one called PDB, – http://srs.ebi.ac.uk – … • SRS is a trademark and the intellectual property of Lion Bioscience

253 254

EBI's SRS Database Selection Page …The Sequence Retrieval System

• SRS is important for two reasons: – It is a useful and convenient service that every Bioinformatician should know about. – It is an excellent example of what can be created when the World Wide Web, databases and programming languages are combined.

• Warning: – Don't create a new data format unless absolutely necessary. – Use an existing format whenever possible

255 256

Web Technologies The Web Development Infrastructure…

• The WWW was invented in 1991 by Tim Berners- • The web server Lee. – a program that when loaded onto a computer system, provides for the publication of data and applications • The ability to publish data and applications on the • often referred to collectively as content Internet, in the form of custom web pages, is now – Examples (apache, Jigsaw, and Microsft’s IIS) considered an essential skill in many disciplines, • The web client including Biology. – a program that can request content from a web server • The development infrastructure of the World and display the content within a graphical window, providing a mechanism whereby user can interact with Wide Web (WWW) is well established and well the contents understood. • The common name for the web client is web browser • There is a standard set of infrastructural – Examples (Chrome, Mozilla, MS Internet Explorer, KDE Konqueror, Opera, , …) components (as suggested by Tim Berners-Lee):

257 258

Copyright 2000 N. AYDIN. All rights reserved. 43 …The Web Development Infrastructure Additional components…

• Transport protocol • Client-side programming – the “language” that the web server and web client use – a technology used to program the web client, providing when communicating with each other. a way to enhance the user’s interactive experience. • Think of this as the set of rules and regulations to which the • Java applets, JavaScript, Macromedia Flash, … client and server must adhere. • Server-side programming – The transport protocol employed by the WWW is called – a technology used to program the web server, HyperText Transport Protocol (HTTP) providing a mechanism to extend the services provided • The content by the web server. – the data and applications published by the web server • Java Servlets, JSP, Python, ASP, PHP, Perl, … • this is textual data formatted to conform to one of the • Backend database technology HyperText Mark-up Language standards (HTML) – a place to store the data to be published, which is – HTML can be enhanced with embedded graphics. accessed by the server-side programming technology. • Data published in the form of HTML is often referred to as • MySQL, … HTML pages or web pages

259 260

…Additional components Creating Content For The WWW…

• The acronym LAMP is used to describe the favoured • There are a number of techniques employed to WWW development infrastructure of many programmers. create HTML: – The letters that form the acronym are taken from the words Linux, Apache, MySQL and Perl/Python/PHP. – Creating content manually • O’Reilly & Associates provides an excellent LAMP web-site, available • Any text editor can be used to create HTML, since HTML is mostly on-line at text. – http://www.onlamp.com. • Special tags within the text guide the web browser when it comes to • The additional components turn the standard web displaying the web page on screen. development infrastructure into a dynamic and powerful • The tags are also textual and any text editor can produce them • Advantages and disadvantages: application development environment. – provides the maximum amount of flexibility as the creator has complete • One of the reasons the WWW is so popular is the fact that control over the process. – to know what’s going on behind the scenes, so learning HTML is highly creating content is so straightforward recommended – Adding a programming language into the mix allows even more – can be time-consuming and tedious, as the creator of the page has to write to be accomplished the content as well as decide which tags to use and where

261 262

…Creating Content For The WWW… …Creating Content For The WWW

– Creating content visually – Creating content dynamically • Special-purpose editors can create HTML pages • Since HTML is text, it is also possible to create HTML visually, displaying the web page as it will appear in the from a program. web browser as it is edited. • Advantages and disadvantages: – HTML pages produced in this way can sometimes be useful – Netscape Composer, FrontPage and Macromedia when combined with a web server that allows for server-side Dreamweaver, .... programming of a backend database • Advantages and disadvantages: • needs a web page creator to write a program to produce – no need to know anything about HTML. even the simplest of pages – The editor adds the required tags to the text that’s entered by the user – unnecessary tags added, • A useful web-site: – HTML pages are larger – http://www.htmlprimer.com

263 264

Copyright 2000 N. AYDIN. All rights reserved. 44 A Simple HTML Page… …A Simple HTML Page

• Content of simple-m.html created manually • Content of simple-k.html created visually – by using KompoZer • http://www.kompozer.net/ A Simple HTML Page "http://www.w3.org/TR/html4/strict.dtd"> A Simple HTML Page This is as simple a web page as there is.

265 266

Producing HTML… …Producing HTML…

• Producing HTML with a Perl program using a HERE document: • HTML file produced by the program:

#! /usr/bin/perl -w # produce_simple - produces the "simple.html" web page using # a HERE document. A Simple HTML Page use strict; This is as simple a web page as there is. print < A Simple HTML Page This is as simple a web page as there is. WEBPAGE

267 268

…Producing HTML

• Another version of HTML generation • The CGI module is designed to make the – written to use Perl’s standard CGI module production of HTML as convenient as #! /usr/bin/perl -w possible.

# produce_simpleCGI - produces the "simple.html" web page using • start_html subroutine produces the tags that # Perl's standard CGI module. appear at the start of the web page.

use strict; • end_html subroutine produces the following HTML, representing tags that conclude a web use CGI qw( :standard ); page: print start_html( 'A Simple HTML Page' ), "This is as simple a web page as there is.", end_html;

269 270

Copyright 2000 N. AYDIN. All rights reserved. 45 Results from produce_simpleCGI Static creation of WWW content

• HTML file produced by the program: • simple.html web page is static appear in exactly the same way every time it is A Simple HTML Page accessed. – It is static, and remains unchanged until someone takes the time to change it. This is as simple a web page as there is. • It rarely makes sense to create such a web page with a program unless you have a special Extra staff at the start is optional. Extra tags tell the web browser exactly which version of HTML the web page conforms to. The CGI module includes these requirement. tags for web browser to optimise its behaviour to the version of HTML identified – Create static web pages either manually or visually

271 272

The dynamic creation of WWW content The dynamic creation of WWW content

• When the web page includes content that is not #! /usr/bin/perl -wT static, it is referred to as dynamic web page. # whattimeisit - create a dynamic web page that includes the – For example a page including current date and time # current date/time. • It is not possible to creat a web page either manually or visually that includes dynamic use strict;

content, and use CGI qw( :standard ); – this is where server side programming technologies come into their own. print start_html( 'What Date and Time Is It?' ), "The current date/time is: ", scalar localtime, end_html;

273 274

Results from whattimeisit And some time later

>perl -wT whattime.pl >perl -wT whattime.pl

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> What Date and Time Is It? xml:lang="en-US">What Date and Time Is It? The current date/time is: Thu Mar 29 18:56:17 The current date/time is: Thu Mar 29 18:59:59 2007>Exit code: 0 2007>Exit code: 0

• This web page, if served up by a web server, changes with each serving, as it is dynamic.

275 276

Copyright 2000 N. AYDIN. All rights reserved. 46 Sending Data To A Web Server…

• Note that use of the “T” command-line option at the • Switch on taint mode on the Perl command line start of the program. • Use CGI module, importing (at least) the standart set – This switches on Perl’s taint mode, of subroutines • which enables a set of special security checks on the behaviour of • Ensure the first print statement within the program is the program. “print header”; • If a server-side program does something that could • Envelope any output sent to STDOUT with calls to potentially be exploited and, as a consequence, pose a the start_html and end_html subroutines sequrity treat, Perl refuses to execute the program • Create a static web page to invoke the server-side when taint mode is enabled. program, providing input as necessary • Always enable ‘‘taint mode’’ for server-side programs • Test your web-site on localhost prior to deployment on the Internet

277 278

…Sending Data To A Web Server match_emblCGI, cont.

#! /usr/bin/perl -wT print start_html( "The results of your search are in!" ); print "Length of sequence is: ", length $sequence, # The 'match_emblCGI' program - check a sequence against the EMBL " characters.

"; # database entry stored in the print h3( "Here is the result of your search:" ); # embl.data.out data-file on the # web server. my $to_check = param( "shortsequence" );

use strict; $to_check = lc $to_check;

use CGI qw/:standard/; if ( $sequence =~ /$to_check/ ) { print header; print "Found. The EMBL data extract contains: $to_check."; } open EMBLENTRY, "embl.data.out" else or die "No data-file: have you executed prepare_embl?\n"; { print "Sorry. No match found for: $to_check."; my $sequence = ; } print p, hr, p; close EMBLENTRY; print "Press Back on your browser to try another search."; print end_html; 279 280

A Search HTML Page The ``Search the Sequence for a Match'' web page

Search the Sequence for a Match Please enter a sequence to match against:

figMERSEARCH.eps

281 282

Copyright 2000 N. AYDIN. All rights reserved. 47 Installing CGIs on a Web Server The ``Results of your search are in!'' web page

$ su

$ cp mersearch.html /var/www/html

$ cp match_emblCGI /var/www/cgi-bin

$ chmod +x /var/www/cgi-bin/match_embl figMERSEARCHFOUND.eps $ cp embl.data.out /var/www/cgi-bin

$

283 284

The ``Sorry! Not Found'' web page Using a HERE document

print <

Please enter another sequence to match against:

figMERSEARCHSORRY.eps

MERFORM

285 286

Better version: ``Results of your search are in!'' web page Searching all the entries in the dnas table

figMERSEARCHBETTER.eps figMERSEARCHMULTI.eps

287 288

Copyright 2000 N. AYDIN. All rights reserved. 48 The ``results'' of the multiple search on the dnas table Installing DB Multi-Search

$ su

$ cp mersearchmulti.html /var/www/html

$ cp db_match_emblCGI /var/www/cgi-bin

$ chmod +x /var/www/cgi-bin/db_match_emblCGI

$ cp /home/barryp/DbUtilsMER.pm /var/www/cgi-bin

$

289 290

Web Automation Strategy to follow when automating interactions with any web page

• Imagine you have 100 sequences to check. • Load the web page of interest into a graphical browser • If it takes average 1 minutes to enter the • Wiev the HTML used to display the web page by selecting the Page Source option from browser’s sequence into text area, entering 100 sequences View menu requires 100 minutes • Read the HTML and make a note of nthe names of the • Why not automate it to save time interface elements and form buttons that are of interest • Write a Perl program that user WWW::Mechanize to • Perl module WWW::Mechanize allows interact with the web page (based on automatch, if programmer to automate interactions with any needed) • Use an appropriate regular expression to extract the web-site interesting bits from the results returned from the web server

291 292

The automatch program… …The automatch program

#! /usr/bin/perl -w $browser->get( URL ); $browser->form( 1 ); $browser->field( "shortsequence", $seq ); # The 'automatch' program - check a collection of sequences against $browser->submit; # the 'mersearchmulti.html' web page. if ( $browser->success ) { use strict; my $content = $browser->content; while ( $content =~ use constant URL => "http://pblinux.itcarlow.ie/mersearchmulti.html"; m[(\w+?)yes]g ) { use WWW::Mechanize; print "\tAccession code: $1 matched '$seq'.\n"; } my $browser = WWW::Mechanize->new; } else while ( my $seq = <> ) { { print "Something went wrong: HTTP status code: ", chomp( $seq ); $browser->status, "\n"; } } print "Now processing: '$seq'.\n"; 293 294

Copyright 2000 N. AYDIN. All rights reserved. 49 Running the automatch program Results from automatch

Accession code: AF213017 matched 'aaattt'. $ chmod +x automatch Accession code: J01730 matched 'aaattt'. Accession code: M24940 matched 'aaattt'. $ ./automatch sequences.txt Now processing: 'acgatccgcaagtagcaacc'. Accession code: M15049 matched 'acgatccgcaagtagcaacc'. Now processing: 'gggcccaaa'. Results from automatch Now processing: 'atcgatcg'. Now processing: 'tcatgcacctgatgaacgtgcaaaaccacag'. Accession code: AF213017 matched 'tcatgcacctgatgaacgtgcaaaaccacag'. Now processing: 'attccgattagggcgta'. . Now processing: 'aattc'. . Accession code: AF213017 matched 'aattc'. Now processing: 'ccaaat'. Accession code: J01730 matched 'aattc'. Accession code: AF213017 matched 'ccaaat'. Accession code: M24940 matched 'aattc'. Accession code: J01730 matched 'ccaaat'. Now processing: 'aatgggc'. Accession code: M24940 matched 'ccaaat'. Now processing: 'aaattt'.

295 296

Viewing the source of the mersearchmulti.html web page

figMERSEARCHSOURCE.eps

297

Copyright 2000 N. AYDIN. All rights reserved. 50