• CSCI-GA.3033.003 • 6/7/2012 Textual Data Processing ()

Visibility in VBA

Project P1 Project P2 CSCI-GA.3033.003 Module M1A Module M1B Scripting Languages Sub S

6/7/2012 S is visible from M1B M1B is visible from P2 Textual data processing (Perl) if S is not private if P2 has a reference to P1 and M1B is not private

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 1 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 2

Perl Outline About Perl

• Perl Basics • Practical Extraction and Reporting Language • Regular Expressions – Regular expressions – String interpolation • At 7:30pm: Quiz 1 – Associative arrays • TIMTOWDI – There is more than one way to do it – Make easy jobs easy – … without making hard jobs impossible – Pathologically Eclectic Rubbish Lister

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 3 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 4

Concepts Perl Orthogonality Related Languages

Definition of Language design Violation of orthogonal principle orthogonality • Predecessors: , sed, awk, sh At right angles Uniform rules for VBA object • Successors: (unrelated) feature interaction assignments – PHP (“PHP 1” = collection of Perl scripts) Not redundant Few, but general, VBA positional – Python, JavaScript (different languages, inspired features vs. named args by Perl’s strengths + weaknesses)

Perl is diagonal rather than orthogonal: • Perl 5 (current version, since 1994) “If I walk from one corner of the park to another, • Perl 6 I don’t walk due east and then due north. I go – Larry Wall has been talking about it since 2001 northeast.” [Larry Wall] – Release unclear; not backward compatible ⇒ shortcut features even when not orthogonal

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 5 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 6

• © Martin Hirzel • 1 • CSCI-GA.3033.003 • 6/7/2012 Textual Data Processing (Perl)

Perl Perl How to Write + Run Code Lexical Peculiarities

• perl [-w] -e 'perl code' • “term” = token – “-w” flag produces warnings • Single-line comments: # • perl [-w] script.pl • Semicolon required after statements unless • script.pl {last; in; block} – Write the file in Vi or Emacs or … • Quotes around certain strings (bare words) – chmod u+x script.pl Makes script executable optional in certain cases (e.g., as hash key) – #!/usr/bin/perl -w • v-string: v13.10 = "\x{13}\x{10}" In first line of script specifies interpreter • Interpolation; pick-your-own-quotes; • perl [-w] -d -e 42 Heredocs; POD () – Edit-eval-print loop (debug the script “42”)

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 7 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 8

Perl Perl Types Sigils, a.k.a. “Funny Characters”

• Symbol that must appear in front of (composite) scalar file handle variable, showing its type – $=scalar, @=array, %=hash, &=function, *=typeglob numeric string void reference array hash – E.g., $a[0] is element 0 of array @a • Unlike shell, Perl requires sigil also on function (other int double UNIVERSAL left-hand side of assignment reference references) • ${id} is the same as $id (blessed references) • Function sigil & not required for call

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 9 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 10

Perl Perl Variable Declarations Interpolation

Read if Implicit print $a + 1; undef • Expansion of values embedded in string $b = 5; non-existent • Single-quoted string literal 'abcde' Local, my $c; – Only interpolate \' and \\ lexical scope my ($d,$e)=(3,4); • Double-quoted string literal "abcde" sub f{ Global Hides locals; – More escape sequences, e.g., \n our $g; unlimited print $g++ – Variables only, starting with @ or $ used locally lifetime } – Use curleys to delimit: "time ${hours}h" sub h{print $i;} Local, Can also • Trick to interpolate arbitrary expressions sub k{ localize single dynamic scope local $i=5; h – "… @{[expr]} …" or "… @{[scalar expr]} …" } array/hash item

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 11 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 12

• © Martin Hirzel • 2 • CSCI-GA.3033.003 • 6/7/2012 Textual Data Processing (Perl)

Concepts (…), "…", `…`, print, sort, … L Terms, function call, quoting, list operators (leftward) -> 2 L Dereference and member access Static vs. Dynamic Scoping ++, -- 1 N Auto-increment, auto-decrement Perl ** 2 R Exponentiation Static scoping Dynamic scoping !, ~, \, +, - 1 R Negation (!, ~, -), reference (\),no-op (+) =~, !~ 2 L Binding to regular expression pattern match Bound in closest nesting Bound in closest calling *, /, %, x 2 L Multiplicative (x is string repetition) scope in program text function at runtime +, -, . 2 L Additive (. is string concatenation) <<, >> 2 L Bitwise shift #!/usr/bin/perl -w #!/usr/bin/perl -w eval, sqrt, -f, -e, … 1 N Named unary operators, file test operators $x = 's'; $x = 's'; <, >, <=, >=, lt, gt, le, ge 2 N Relational (lt, gt, le, ge is for strings) sub g { sub g { ==, !=, <=>, eq, ne, cmp 2 N Equality (eq, ne, cmp is for strings) (<=>, cmp yield -1/0/1) my $x = 'd'; local $x = 'd'; &, |, ^ 2 L Bitwise (not all same precedence) return h() return h() && 2 L Logical and (short-circuit) } } ||, // 2 L Logical or (||), Defined-or (//) (short-circuit) sub h { sub h { Operators .., ... 2 N Range (in list context) or bistable (in scalar context) return $x return $x ?: 3 R Ternary conditional =, +=, -=, *=, … 2 R Assignment; return l-value of target } } ,, => 2 L List (in list context) or sequencing (in scalar context) print g(), "\n"; #s print g(), "\n"; #d print, sort, … N List operators (rightward) print $x, "\n"; #s print $x, "\n"; #s not, and, or, xor 2 R Logical (short-circuit; not all same precedence)

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 13 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 14

Perl Perl Operators: List vs. Named Unary Input and Output

• Different precedence rules • Output • List operator (most user-defined functions) – print "Hello, world!"; – High leftward, low rightward precedence – print STDERR "boo!"; – @a = (1,5,sort 9,2); print @a; #1529 – printf "sqrt(%.2f)=%.2f\n", 2, sqrt(2); • Named unary operator • Input – Lower than arithmetic, higher than comparison – $lineFromStdIn = <>; – @a = (1,5,sqrt 9,2); print @a; #1532 – open MYFILE, '; – Highest precedence – @allLines = ; – @a = (1,5,sort(3+6),2); print @a; #1592

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 15 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 16

Perl Perl Arrays Hashes • Resizable • Associative arrays, using hash tables • Literals: list @a=(1,3,5), range @b=2..4 • Literals: none; use list literals instead • Indexing: e.g. $a[1] – %h = (lb => 1, oz => 16, g => 453); – Zero-based; negative index counts from end • Indexing: string, bare-word quotes optional – $#a returns last index of @a, in this case, 2 – $h{"oz"}, $h{oz} – Write to non-existent index auto-vivifies – Write to non-existing index auto-vivifies • Free: undef @a, truncate: $#a=1 • Free: undef %h; delete: undef $h{oz}; • Array slice: using multiple indices, e.g., @a • Hash slice: returns array, e.g., [0,2] or @a[1..2] @a = @h{lb, g}; returns (1, 453) • Using array in scalar context: returns length • Using hash in scalar context: returns string – scalar(@a); # 3 = size – scalar(%h); # ”2/8"=buckets/capacity © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 17 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 18

• © Martin Hirzel • 3 • CSCI-GA.3033.003 • 6/7/2012 Textual Data Processing (Perl)

Perl Perl Array and Hash References Data Structures : hash • References are scalars, only scalars can be $recipe = [ $recipe { qty = 2 stored in other arrays/hashes qty => 2, unit => "cup", unit = "cup" • Literals: “anonymous array/hash composers” ing => "flour", ing = "flour" – $ar=[1,3,5];, $hr={x=>2,y=>0}; }, { : hash • Indexing: e.g., $ar->[1]; $hr->{y}; qty => 4, qty = 4 – Dereference operator -> optional in nested chain, unit => "oz", : array ing => "sugar", unit = "oz" e.g., $rarh->[123]{abc} }, ing = "sugar" – Write to non-existent index auto-vivifies all indices { on the way qty => 3, : hash unit => "tbsp", qty = 3 • Using reference in string context: returns ing => "milk", string describing the type and address }, unit = "tbsp" ]; – print $ar; # "ARRAY(0x18bef54)" ing = "milk" © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 19 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 20

Perl Perl Control Statements What is Truth

if(expr){…} [elsif(expr){…}] … [else{…}] [label:] while(expr){…} [continue{…}] • Strings "" and "0" are false, Proper [label:] until(expr){…} [continue{…}] everything else is true statements [label:] for(expr; expr; expr) {…} (man • File read <…> returns "" at end-of-file perlsyn) [label:] foreach [var] (list) {…} [continue{…}] [label:] for [var] (list) {…} [continue{…}] • Number 0 converted to string "0" [label ] [ ] : {…} continue{…} • Array in scalar context 0 iff empty Modifiers stmt (if | unless | while | until | foreach) expr Operators do | eval | sub • Hash in scalar context "0" iff empty (man continue | goto | last | next | redo | return • Reference iff perlfunc) die | dump | exit "" undef

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 21 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 22

Concepts Perl Blocks as Function Arguments Writing Subroutines • do{…}, eval{…}, sub{…} look like control • Declaration statements, but aren’t – sub [id] [proto] [attrs] [{…}] – They are functions with block argument – No id: anonymous; no proto: list operator; no – Other examples: grep, map, sort block: declaration without definition – You can write such functions yourself – To return a value: return expr; – If no explicit return: value of last expression • This is a lightweight form of callback • Arguments • SmallTalk language does same, only more – Array @_ – SmallTalk has no control statements at all – Call-by-reference ($_[i] is alias of actual i) – Method on boolean with block argument – Common idiom: copy arguments into named – (x

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 23 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 24

• © Martin Hirzel • 4 • CSCI-GA.3033.003 • 6/7/2012 Textual Data Processing (Perl)

Perl Default Variable $_ Outline

• Only angle operator in while condition: • Perl Basics read line from file handle into $_ • Regular Expressions • No variable in foreach: use $_ as iteration variable • At 7:30pm: Quiz 1 • No binding in pattern match: use $_ • No explicit operand for certain functions: use $_, e.g., print, chop, chomp, split, … • Don’t confuse $_ with @_ (arguments)

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 25 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 26

Perl Perl Example of m// Matching Explanation of m// Matching

-rw-r--r-- 1 hirzel ibm 9404 May 30 10:37 hw01.html =~: binding operator Input -rw-r--r-- 1 hirzel ibm 59469 May 30 10:37 hw01.pdf string to search in m//: match operator -rw-r--r-- 1 hirzel ibm 9142 Jun 1 19:21 index.html #!/usr/bin/perl -w true if matched (): grouping+capturing while($line = <>) { if ($line =~ m/(\d+)\s\w{3}[\d\s:]+(\S+)$/) { Script if ($line =~ m/(\d+)\s\w{3}[\d \s:]+(\S+)$/) { printf "$2 contains $1 bytes\n"; printf "$2 contains $1 bytes\n"; } } } $1, $2: captured groups hw01.html contains 9404 bytes Output hw01.pdf contains 59469 bytes (\d+)\s\w{3}[\d \s:]+(\S+)$: regular expression index.html contains 9142 bytes

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 27 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 28

Perl Inside a Regular Expression Perl Kind Construct Syntax Example Explanation of s/// substitution Pattern Matches Essentials Character Itself b b string to search and modify

Concatenation e1e2 bc bc =~: binding operator #!/usr/bin/perl -w Alternative (or) e1|e2 a|bc a, bc while($line = <>) { Repetition (≥0) e* a* , a, aa, aaa if ($line =~ s/.+\s(\d+)\s\w{3}[\d\s:]+(\S+)$/$2 contains $1 bytes/) { Grouping e , ( ) (a|b)c ac bc print $line; Quantifier Optional e? (a|b)? , a, b } Repetition ( 1) e , , } ≥ + a+ a aa aaa s///: substitution operator Repetition e{n} a{3} aaa Char class Custom class […] [a-c] a, b, c true if matched Wildcard character . . a, 3, : .+\s(\d+)\s\w{3}[\d \s:]+(\S+)$: regular expression Shortcut class \… \d 0,1,2,…,9 Assertion Zero-width anchor $,^,\b,… \b (word boundary) $2 contains $1 bytes: replacement text © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 29 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 30

• © Martin Hirzel • 5 • CSCI-GA.3033.003 • 6/7/2012 Textual Data Processing (Perl)

Perl Perl Outside a Regular Expression Library Functions

• Match: m// or simply //; substitute: s/// • String: chomp, chop, length, substr, uc • Binding: =~ positive, !~ negative • RegExp: m//, s///, split – Or when missing: bound to $_ • Math: abs, cos, exp, sin, sqrt • Modifiers: e.g. s///g global substitute • Array: pop, push, shift, unshift • Return value: true if success, invert for !~ • Control: die, do, eval, last, next, sub – Or when assigned to list: groups, e.g., • Scoping: local, my, our, package, use my ($a,$b) = $s =~ /(\w+) (\d+)/; • Files: chdir, chmod, fork, printf, qx// • Output: $& entire match, $1, … groups • And many more (“man perlfuc”) • Find out more: “man perlre”

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 31 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 32

Perl Administrative Perl Documentation Last Slide

• Included: manpages perl, perldata, • Example solutions available perlsyn, perlop, perlre, perlfunc, … • “Code in ppt file” doesn’t mean “on slide” • Book: , 3rd edition. • Today’s lecture • Next lecture Larry Wall, Tom Christiansen, and Jon – Associative arrays – Context Orwant. O’Reilly, 2000 [safari]. – Regular – Typeglobs • Online: expressions – Object-oriented – CPAN = Comprehensive Perl Archive Perl Network, http://www.cpan.org – Basics of Perl – http://www.perl.com

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 33 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 34

Outline Quiz 1 • Perl Basics • Seat yourself with 2 empty spaces • Regular Expressions • Closed-book (don’t consult books, • At 7:30pm: Quiz 1 notes, phones, laptops, …) • Write name on each answer sheet • Good luck!

• Start time 7:41pm, end time 8:16pm

© Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 35 © Martin Hirzel CSCI-GA.3033.003 NYU 6/7/12 36

• © Martin Hirzel • 6