ECE750-TXB Lecture 1: Asymptotics

Todd L. Veldhuizen ECE750-TXB Lecture 1: Asymptotics [email protected]

Asymptotics Asymptotics: Motivation Todd L. Veldhuizen Bibliography [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 26, 2007

ECE750-TXB Motivation Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected] I We want to choose the best algorithm or data structure for the job. Asymptotics Asymptotics: Need characterizations of resource use, e.g., time, Motivation I Bibliography space; for circuits: area, depth.

I Many, many approaches:

I Worst Case Execution Time (WCET): for hard real-time applications

I Exact measurements for a specific problem size, e.g., number of gates in a 64-bit addition circuit. I Performance models, e.g., R∞, n1/2 for latency-throughput, HINT curves for linear algebra (characterize performance through different cache regimes), etc.

I ... ECE750-TXB Asymptotic analysis Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Motivation Bibliography I We will focus on Asymptotic analysis: a good “first approximation” of performance that describes behaviour on big problems

I Reasonably independent of:

I Machine details (e.g., 2 cycles for add+mult vs. 1 cycle) I Clock speed, programming language, compiler, etc.

ECE750-TXB Asymptotics: Brief history Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected] I Basic ideas originated in Paul du Bois-Reymond’s Asymptotics Infinit¨arcalc¨ul(‘calculus of infinities’) developed in the Asymptotics: 1870s. Motivation Bibliography I G. H. Hardy greatly expanded on Paul du Bois-Reymond’s ideas in his monograph Orders of Infinity (1910) [3].

I The “big-O” notation was first used by Bachmann (1894), and popularized by Landau (hence sometimes called “Landau notation.”)

I Adopted by computer scientists [4] to characterize resource consumption, independent of small machine differences, languages, compilers, etc. ECE750-TXB Basic asymptotic notations Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Asymptotic ≡ behaviour as n → ∞, where for our purposes Motivation n is the “problem size.” Bibliography Three basic notations:

I f ∼ g (“f and g are asymptotically equivalent”)

I f  g (“f is asymptotically dominated by g”)

I f  g (f and g are asymptotically bounded by one another)

ECE750-TXB Basic asymptotic notations Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: f (n) Motivation f ∼ g means lim = 1 n→∞ g(n) Bibliography

Example: 3x2 + 2x + 1 ∼ 3x2. ∼ is an equivalence relation:

I Transitive: (x ∼ y) ∧ (y ∼ z) ⇒ (x ∼ z)

I Reflexive: x ∼ x

I Symmetric: (x ∼ y) ⇒ (y ∼ x). Basic idea: We only care about the “leading term,” disregarding less quickly-growing terms. ECE750-TXB Basic asymptotic notations Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected] f (n) f  g means lim sup < ∞ Asymptotics Asymptotics: n→∞ g(n) Motivation Bibliography f (n) i.e., g(n) is eventually bounded by a finite value.

I Basic idea: f grows more slowly than g, or just as quickly as g.

I  is a preorder (or quasiorder):

I Transitive: (f  g) ∧ (g  h) ⇒ (f  h). I Reflexive: f  f

I  fails to be a partial order because it is not antisymmetric: there are functions f , g where f  g and g  f but f 6= g.

I Variant: g  f means f  g.

ECE750-TXB Basic asymptotic notations Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Write f  g when there are positive constants c1, c2 such that Asymptotics Asymptotics: Motivation f (n) Bibliography c ≤ ≤ c 1 g(n) 2

for sufficiently large n.

I Examples:

I n  2n I n  (2 + sin πn)n

I  is an equivalence relation. ECE750-TXB Strict forms Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Write f ≺ g when f  g but f 6 g. Motivation Bibliography I Basic idea: f grows strictly less quickly than g f (n) I Equivalent: f  g exactly when limn→∞ g(n) = 0. I Examples 2 3 I x ≺ x I log x ≺ x

I Variant: f g means g ≺ f

ECE750-TXB Orders of growth Lecture 1: Asymptotics

Todd L. Veldhuizen We can use ≺ as a “ruler” by which to judge the growth of [email protected] functions. Some common “tick marks” on this ruler are: Asymptotics n k  2 n n 2 Asymptotics: log log n ≺ log n ≺ log n ≺ n ≺ n ≺ n ≺ · · · ≺ 2 ≺ n! ≺ n ≺ 2 Motivation Bibliography We can always find in ≺ a dense total order without endpoints. i.e.,

I There is no slowest-growing function;

I There is no fastest-growing function;

I If f ≺ h we can always find a g such that f ≺ g ≺ h. (The canonical example of a dense total order without endpoints is Q, the rationals.) I This fact allows us to sketch graphs in which points on the axes are asymptotes. ECE750-TXB Big-O Notation Lecture 1: Asymptotics “Big-O” is a convenient family of notations for asymptotics: Todd L. Veldhuizen [email protected] O(g) ≡ {f : f  g} Asymptotics Asymptotics: Motivation i.e., O(g) is the set of functions f so that f  g. Bibliography 2 2 2 3/2 I O(n ) contains n , 7n , n, log n, n , 5,...

I Note that f ∈ O(g) means exactly f  g.

I A standard abuse of notation is to treat a big-O expression as if it were a term:

x2 + 2x1/2 + 1 = x2 + O(x1/2) | {z } ≺x2 The above equation should be read as “there exists a function f ∈ O(x1/2) such that x2 + 2x1/2 + 1 = x2 + f (x).”

ECE750-TXB Big-O for algorithm analysis Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected] I Big-O notation is an excellent tool for expressing

machine/compiler/language-independent complexity Asymptotics Asymptotics: properties. Motivation Bibliography I On one machine a might take ≈ 5.73n log n seconds, on another it might take ≈ 9.42n log n + 3.2n seconds.

I We can wave these differences aside by saying the algorithm runs in O(n log n) seconds.

I O(f (n)) means something that behaves asymptotically like f (n):

I Disregarding any initial transient behaviour; I Disregarding any multiplicative constants c · f (n); I Disregarding any additive terms that grow less quickly than f (n). ECE750-TXB Basic properties of big-O notation Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Given a choice between an sorting algorithm that runs in Asymptotics Asymptotics: O(n2) time and one that runs in O(n log n) time, which Motivation should we choose? Bibliography 1. Gut instinct: the O(n log n) one, of course! 2. But: note that the class of functions O(n2) also contains n log n. Just because we say an algorithm is O(n2) does not mean it takes  n2 time! 3. It could be that the O(n2) algorithm is faster than the O(n log n) one.

ECE750-TXB Additional notations Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

To distinguish between “at most this fast,” “at least this Asymptotics Asymptotics: fast,” etc. there are additional big-O-like notations: Motivation Bibliography

f ∈ O(g) ≡ f  g upper bound f ∈ o(g) ≡ f ≺ g strict upper bound f ∈ Θ(g) ≡ f  g tight bound f ∈ Ω(g) ≡ f  g lower bound f ∈ ω(g) ≡ f g strict lower bound ECE750-TXB Tricks for a bad remembering day Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Motivation Lower case means strict: I Bibliography I o(n) is strict version of O(n) I ω(n) is strict version of Ω(n)

I ω, Ω (omega) is the last letter of the greek alphabet — if f ∈ ω(g) then g comes after f in asymptotic ordering.

I f ∈ Θ(g): the line through the middle of the theta — asymptotes converge

ECE750-TXB Notation: o(·) Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics f ∈ o(g) means f ≺ g Asymptotics: Motivation Bibliography

I o(·) expresses a strict upper bound.

I If f (n) is o(g(n)), then f grows strictly slower than g. Pn −k 1 I Example: k=0 2 = 2 − 2n = 2 + o(1) I o(1) indicates the class of functions for which g(n) limn→∞ 1 = 0, which means limn→∞ g(n) = 0. I 2 + o(1) means “2 plus something that vanishes as n → ∞”

I If f is o(g), it is also O(g). n I n! = o(n ). ECE750-TXB Notation: ω(·) Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: f ∈ ω(g) means f g Motivation Bibliography

I ω(·) expresses a strict lower bound.

I If f (n) is ω(g(n)), then f grows strictly faster than g.

I f ∈ ω(g) is equivalent to g ∈ o(f ). I Example: Harmonic series Pn 1 −1 I hn = k=0 k ∼ ln n + γ + O(n ) I hn ∈ ω(1) (It is unbounded.) I hn ∈ ω(ln ln n) n n I n! = ω(2 ) (grows faster than 2 )

ECE750-TXB Notation: Ω(·) Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Motivation f ∈ Ω(g) means f  g Bibliography

I Ω(·) expresses a lower bound, not necessarily strict

I If f (n) is Ω(g(n)), then f grows at least as fast as g.

I f ∈ Ω(g) is equivalent to g ∈ O(f ) 2 I Example: Matrix multiplication requires Ω(n ) time. (At least enough time to look at each of the n2 entries in the matrices.) ECE750-TXB Notation: Θ(·) Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected] f ∈ Θ(g) means f  g Asymptotics Asymptotics: Motivation Bibliography I Θ(·) expresses a tight asymptotic bound

I If f (n) is Θ(g(n)), then f (n)/g(n) is eventually contained in a finite positive interval [c1, c2].

I Θ(·) bounds are very precise, but often hard to obtain.

I Example: QuickSort runs in time Θ(n log n) on average. (Tight! Not much faster or slower!)

I Example: Stirling’s approximation ln n! ∼ n ln n − n + O(ln n) implies that ln n! is Θ(n ln n)

I Don’t make the mistake of thinking that f ∈ Θ(g) f (n) means limn→∞ g(n) = k for some constant k.

ECE750-TXB Algebraic manipulations of big-O Lecture 1: Asymptotics

I Manipulating big-O terms requires some thought — Todd L. Veldhuizen always keep in mind what the symbols mean! [email protected]

I An additive O(f (n)) term swallows any terms that are Asymptotics Asymptotics:  f (n): Motivation Bibliography n2 + n1/2 + O(n) + 3 = n2 + O(n)

The n1/2 and 3 on the l.h.s. are meaningless in the presence of an O(n) term.

I O(f (n)) − O(f (n)) = O(f (n)) not 0! I O(f (n)) · O(g(n)) = O(f (n)g(n)). −1 1/2 I Example: What is ln n + γ + O(n ) times n + O(n )? h i ln n + γ + O(n−1) · n + O(n1/2) = n ln n + γn + O(n1/2 ln n)

The terms γO(n1/2), O(n−1/2), O(1), etc. get swallowed by O(n1/2 ln n). ECE750-TXB Sharpness of estimates Lecture 1: Asymptotics

Todd L. Example: for a constant c, Veldhuizen [email protected]   c   c  ln(n + c) = ln n 1 + = ln n + ln 1 + Asymptotics Asymptotics: n n Motivation 2 c c Bibliography = ln n + − + ··· (Maclaurin series) n 2n2 1 = ln n + Θ n

It is also correct to write

ln(n + c) = ln n + O(n−1) ln(n + c) = ln n + o(1)

−1 1 1 since Θ(n ) ⊆ O( n ) ⊆ o(1). However, the Θ( n ) error term is sharper — a better estimate of the error.

ECE750-TXB Sharpness of estimates & The Riemann Lecture 1: Asymptotics

Hypothesis Todd L. Veldhuizen Example: let π(n) be the number of prime numbers ≤ n. [email protected] The Theorem is that Asymptotics Asymptotics: π(n) ∼ Li(n) (1) Motivation Bibliography R n 1 where Li(n) = x=2 ln x dx is the logarithmic integral, and n Li(n) ∼ ln n Note that (1) is equivalent to:

π(n) = Li(n) + o(Li(n))

It is known that the error term can be improved, for example to  n √  π(n) = Li(n) + O e−a ln n ln n ECE750-TXB Sharpness of estimates & The Riemann Lecture 1: Asymptotics

Hypothesis Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Motivation The famous Riemann hypothesis is the conjecture that a Bibliography sharper error estimate is true:

1 π(n) = Li(n) + O(n 2 ln n)

This is one of the Clay Institute millenium problems, with a $1,000,000 reward for a positive proof. Sharp estimates matter!

ECE750-TXB Sharpness of estimates Lecture 1: Asymptotics

Todd L. To maintain sharpness of asymptotic estimates during Veldhuizen [email protected] analysis, some caution is required. E.g. If f (n) = 2n + O(n), what is log f (n)? Asymptotics Asymptotics: Bad answer: log f (n) = n + O(n). Motivation More careful answer: Bibliography

log f (n) = log(2n + O(n)) = log(2n(1 + O(n2−n))) = log(2n) + log(1 + O(n2−n))

Since log(1 + δ(n)) ∼ O(δ(n)) if δ ∈ o(1),

log f (n) = n + O(n2−n)

i.e., log f (n) is equal to n plus some value converging exponentially fast to 0. ECE750-TXB Sharpness of estimates Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected] log f (n) = n + O(n2−n) Asymptotics Asymptotics: is a reasonably sharp estimate (but, what happens if we take Motivation 2log f (n) with this estimate?) Bibliography If we don’t care about the rate of convergence we can write

f (n) = n + o(1)

where o(1) represents some function converging to zero. This is less sharp since we have lost the rate of convergence. Even less sharp is

f (n) ∼ n

which loses the idea that f (n) − n → 0, and doesn’t rule out things like f (n) = n + n3/4.

ECE750-TXB Asymptotic expansions Lecture 1: Asymptotics

Todd L. Veldhuizen An asymptotic expansion of a function describes how that [email protected] function behaves for large values. Often it is used when an Asymptotics explicit description of the function is too messy or hard to Asymptotics: derive. Motivation Bibliography e.g. if I choose a string of n bits uniformly at random (i.e., each of the 2n possible strings has probability 2−n), what is 3 the probability of getting ≥ 4 n 1’s? n Easy to write the answer: there are k ways of arranging k 3 1’s, so the probability of getting ≥ 4 n 1’s is:

n X n P(n) = 2−n k 3 k=d 4 ne This equation is both exact and wholly uninformative. ECE750-TXB Asymptotic expansions Lecture 1: Asymptotics Can we do better? Yes! Todd L. The number of 1’s in a random bit string is a binomial Veldhuizen [email protected] distribution and is well-approximated by the normal distribution as n → ∞: Asymptotics Asymptotics: Motivation Bibliography n   Z ∞ 2 X −n n 1 − x 2 ∼ √ e 2 dx k 1 √ x=α 2π k= 2 n+α n = 1 − F (α)    where F (x) = 1 1 + erf √x is the cumulative normal 2 2 distribution. Maple’s asympt command yields the asymptotic expansion: ! 1 F (x) ∼ 1 − O x2 xe 2

ECE750-TXB Asymptotic expansions example Lecture 1: Asymptotics

3 Todd L. We want to estimate the probability of ≥ 4 n 1’s: Veldhuizen [email protected] 1 √ 3 n + α n = n Asymptotics Asymptotics: 2 4 Motivation √ Bibliography n gives α = 4 . Therefore the probability is √  n P(n) ∼ 1 − F 4  1  ∼ 1 − 1 + O √ n ne 32  1  = O √ n ne 32

3 So, the probability of having more than 4 n 1’s converges to 0 exponentially fast. ECE750-TXB Asymptotic Expansions Lecture 1: Asymptotics

I When taking an asymptotic expansion, one writes Todd L. Veldhuizen [email protected] ln n! ∼ n ln n − n + O(1) Asymptotics Asymptotics: rather than Motivation Bibliography ln n! = n ln n − n + O(1)

Writing ∼ is a clue to the reader that an asymptotic expansion is being taken, rather than just carrying an error term around.

I Asymptotic expansions are very important in average case analysis, where we are interested in characterizing how an algorithm performs for most inputs.

I To prove an algorithm runs in O(f (n)) on average, one technique is to obtain an asymptotic estimate of the probability of running in time f (n), and show it converges to zero very quickly.

ECE750-TXB Asymptotic Expansions for Average-Case Analysis Lecture 1: Asymptotics

Todd L. I The time required to add two n-bit integers by a no Veldhuizen carry adder is proportional to the longest carry [email protected]

sequence. Asymptotics Asymptotics: I It can be shown that the probability of having a carry Motivation sequence of length ≥ t(n) satisfies Bibliography

Pr(carry sequence ≥ t(n)) ≤ 2−t(n)+log n+O(1)

I If t(n) log n, the probability converges to 0. We can conclude that the average running time is O(log n).

I In fact we can make a stronger statement:

Pr(carry sequence ≥ log n + ω(1)) → 0

Translation: “The probability of having a carry sequence longer than log n + δ(n), where δ(n) is any unbounded function, converges to zero.” ECE750-TXB The Taylor series method of asymptotic Lecture 1: Asymptotics expansion Todd L. Veldhuizen [email protected] I This is a very simple method for asymptotic expansion that works for simple cases; it is one technique Maple’s Asymptotics Asymptotics: asympt function uses. Motivation ∞ Bibliography I Recall that the Taylor series of a C function about x = 0 is given by:

x2 x3 f (x) = f (0) + xf 0(0) + f 00(0) + f 000(0) + ··· 2! 3!

I To obtain an asymptotic expansion of some function F (n) as n → ∞, 1. Substitute n = x −1 into F (n). (Then n → ∞ as x → 0.) 2. Take a Taylor series about x = 0. 3. Substitute x = n−1. 4. Use the dominating term(s) as the expansion, and the next term as the error term.

ECE750-TXB Taylor series method of asymptotic expansion: Lecture 1: Asymptotics example Todd L. 1+ 1 Veldhuizen Example expansion: F (n) = e n . [email protected] Obviously limn→∞ F (n) = e, so we expect something of the Asymptotics form F (n) ∼ e + o(1). Asymptotics: Motivation −1 −1 1+x 1. Substitute n = x into F (n): obtain F (x ) = e Bibliography 2. Taylor series about x = 0: x2 x3 e1+x = e + xe + e + e + ··· 2 6 3. Substitute x = n−1: e 1 1 = e + + e + e + ··· n 2n2 6n3 1 1 4. Since e n e 2n2 e · · · , 1 F (n) ∼ e + Θ n ECE750-TXB Asymptotics of algorithms Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics is a key tool for algorithms and data structures: Asymptotics Asymptotics: Motivation I Analyze algorithms/data structures to obtain sharp estimates of asymptotic resource consumption (e.g., Bibliography time, space)

I Possibly use asymptotic expansions in the analysis to estimate e.g. probabilities

I Use these resource estimates to

I Decide which algorithm/data structure is “best” according to design criteria

I Reason about the performance of compositions (combinations) of algorithms and data structures.

ECE750-TXB References on asymptotics Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Motivation I Course text: [1] Asymptotic notations Bibliography I Concrete , Ronald L. Graham, Donald E. Knuth and Oren Patashnik, Ch. 9 Asymptotics [2]

I Advanced:

I Shackell, Symbolic Asymptotics [6] I Hardy, Orders of Infinity [3] I Lightstone + Robinson, Nonarchimedean fields and asymptotic expansions [5] ECE750-TXB Bibliography I Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics [1] Thomas H. Cormen, Charles E. Leiserson, and Ronald R. Asymptotics: Rivest. Motivation Bibliography Intoduction to algorithms. McGraw Hill, 1991. bib [2] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, MA, USA, second edition, 1994. bib

ECE750-TXB Bibliography II Lecture 1: Asymptotics

Todd L. Veldhuizen [3] G. H. Hardy. [email protected]

Orders of infinity. The ‘Infinit¨arcalc¨ul’ of Paul du Asymptotics Asymptotics: Bois-Reymond. Motivation Hafner Publishing Co., New York, 1971. Bibliography Reprint of the 1910 edition, Cambridge Tracts in Mathematics and Mathematical Physics, No. 12. bib [4] Donald E. Knuth. Big omicron and big omega and big theta. SIGACT News, 8(2):18–24, 1976. bib pdf

[5] A. H. Lightstone and Abraham Robinson. Nonarchimedean fields and asymptotic expansions. North-Holland Publishing Co., Amsterdam, 1975. North-Holland Mathematical Library, Vol. 13. bib ECE750-TXB Bibliography III Lecture 1: Asymptotics

Todd L. Veldhuizen [email protected]

Asymptotics Asymptotics: Motivation Bibliography [6] John R. Shackell. Symbolic asymptotics, volume 12 of Algorithms and Computation in Mathematics. Springer-Verlag, Berlin, 2004. bib ECE750-TXB Lecture 2

Todd L. Veldhuizen ECE750-TXB Lecture 2 [email protected] Resources and Complexity Classes Outline Resource Consumption

Complexity classes Todd L. Veldhuizen Bibliography [email protected]

Electrical & Computer Engineering University of Waterloo Canada

January 16, 2007

ECE750-TXB Resource Consumption Lecture 2 Todd L. Veldhuizen To decide which algorithm or data structure to use, we are [email protected] interested in their resource consumption. Depending on the Outline problem context, we might be concerned with: Resource Consumption I Time and space consumption Complexity classes I For logic circuits: Bibliography I Number of gates I Depth I Area I Heat production

I For parallel/distributed computing:

I Number of processors I Amount of communication required I Parallel running time

I For randomized algorithms:

I Number of random bits used I Error probability ECE750-TXB Machine models Lecture 2 Todd L. Veldhuizen I The performance of an algorithm must always be [email protected] analyzed with reference to some machine model that defines: Outline Resource I The basic operations supported (e.g., random-access Consumption

memory; arithmetic; obtaining a random bit; etc.) Complexity classes I The resource cost of each operation. Bibliography I Some common machine models:

I Turing machine (TM): very primitive, tape-based, used for theoretical arguments only;

I Nondeterministic Turing machine: TM that can effectively fork its execution at each step, so that after t steps it can behave as if it were a superfast parallel machine with e.g. 2t processors;

I RAM (Random Access Machine) is a model that corresponds more-or-less to an everyday single-CPU desktop machine, but with infinite memory;

I PRAM and LogP [2, 3] are popular models for parallel computing.

ECE750-TXB Machine models Lecture 2 Todd L. Veldhuizen [email protected] I The performance of an algorithm can change drastically when you change machine models. e.g., many problems Outline Resource believed to take exponential time (assuming P 6= NP) Consumption

on a RAM can be solved in polynomial time on a Complexity classes Nondeterministic TM. Bibliography

I Often there are generic results that let you translate resource bounds on one machine model to another:

I An algorithm taking time T (n) and space S(n) on a Turing machine can be simulated in O(T (n) log log S(n)) time by a RAM;

I An algorithm taking time T (n) and space S(n) on a RAM can be simulated in O(T 3(n)(S(n) + T (n))2) time by a Turing machine.

I Unless otherwise stated, people are usually referring to a RAM or similar machine model. ECE750-TXB Machine models Lecture 2 Todd L. Veldhuizen [email protected]

Outline I When you are analyzing an algorithm, know your Resource machine model. Consumption

Complexity classes I There are embarassing papers in the literature in which nonspecialists have “proven” outlandish complexity Bibliography results by making basic mistakes

I e.g. Assuming that arbitrary precision real numbers can be stored in O(1) space and multiplied, added, etc. in O(1) time. On realistic sequential (nonparallel) machine models, d-digit real numbers take:

I O(d) space I O(d) time to add I O(d log d) time to multiply

ECE750-TXB Example of time and space complexity Lecture 2 Todd L. Veldhuizen I Let’s compare three containers for storing values: list, [email protected] tree, sorted array. Let n be the number of elements Outline stored. Resource Consumption I Average-case complexity (on a RAM) is: Complexity classes Space Search time Insert time Bibliography List Θ(n) Θ(n) Θ(n) Balanced tree Θ(n) Θ(log n) Θ(log n) Sorted array Θ(n) Θ(log n) Θ(n)

I If search time is important: since log n ≺ n, a balanced tree or sorted array will be faster than a list for sufficiently large n.

I If insert time is important: use a balanced tree.

I Caveat: asymptotic performance says nothing about performance for small cases. ECE750-TXB Example: Circuit complexity Lecture 2 Todd L. I In circuit complexity, we do not analyze programs per Veldhuizen se, but a family of circuits, one for each problem size [email protected] (e.g., addition circuits for n-bit integers). Outline I Circuits are built from basic gates. The most realistic Resource Consumption model is gates that have finite fan-in and fan-out, i.e., Complexity classes

gates have 2-inputs and output signals can be fed into Bibliography at most k inputs I Common resource measures are: I time (i.e., delay, circuit depth) I number of gates (or cells, for VLSI) I fan-out I area I E.g., addition circuits: Adder type Gates Depth Ripple-carry adder ≈ 7n ≈ 2n √ Carry-skip (1L) ≈ 8n ≈ 4 n Carry lookahead ≈ 14n ≈ 4 log n Conditional-sum adder ≈ 3n log n ≈ 2 log n

ECE750-TXB Resource consumption tradeoffs Lecture 2 Todd L. Veldhuizen [email protected]

Outline

Resource I Often there are tradeoffs between consumption of Consumption resources. Complexity classes Bibliography I Example: Testing whether a number is prime. The Miller-Rabin test takes time Θ(k log3 n) and has probability of error 4−k . 3 I Choosing k = 20 yields time Θ(log n) and probability of error 2−40. 1 4 I Choosing k = 2 log n yields time Θ(log n) and 1 probability of error n . ECE750-TXB Resource consumption tradeoffs: time-space Lecture 2 Todd L. Cracking passwords has a time-space tradeoff: Veldhuizen [email protected] I Passwords are stored encrypted to make them hard to recover: e.g. htpasswd (web passwords) turns “foobar” Outline into “AjsRaSQk32S6s” Resource Consumption

I Brute force approach: if there are n possible passwords, Complexity classes

precompute a database of size O(n) containing every Bibliography possible encrypted password and its plaintext. Crack passwords in O(log n) time by looking them up in the database. 64 I Prohibitively expensive in space: e.g. n ≈ 2 . 2/3 I Hellman: can recover plaintext in O(n ) time using a database of size O(n2/3).

I MS-Windows LanManager passwords are 14-characters; they are stored hashed (encrypted). With a precomputed database of size 1.4Gb (two CD-ROMs), 99.9% of all alphanumerical password hashes can be cracked in 13.6 seconds [4].

ECE750-TXB Resource consumption tradeoffs: Area-time Lecture 2 Todd L. Veldhuizen [email protected]

In designing circuits e.g., VLSI, one is concerned with how Outline much area a circuit takes up vs. how fast it is (its gate Resource Consumption

depth). Complexity classes I Often one can sacrifice area for time (depth), and vice Bibliography versa.

I e.g. Multiplying two n-bit numbers. With A the area and T the time, it is known [1] that for any circuit family

(AT )2α = Ω(n1+α)

This is an “area-time product.” ECE750-TXB Kinds of problems Lecture 2 Todd L. Veldhuizen [email protected]

Outline

I We write algorithms to solve problems. Resource Consumption I Some special classes of problems: Complexity classes I Decision problems: require a yes/no answer. Example: Does this file contain a valid Java program? Bibliography

I Optimization problems: require choosing a solution that minimizes (maximizes) some objective function. Example: Find a circuit made out of AND, OR, and NOT gates that computes the sum of two 8-bit integers, and has the fewest gates.

I Counting problems: count the number of objects that satisfy some criterion. Example: For how many inputs will this circuit output zero?

ECE750-TXB Complexity classes Lecture 2 Todd L. Veldhuizen [email protected]

I A complexity class is defined as Outline

I a style of problem Resource Consumption I that can be solved with a specified amount of resources Complexity classes I on a specified machine model Bibliography I Example: P (a.k.a PTIME) is the class of decision problems that can be solved in polynomial time (i.e., d time O(n ) for some d ∈ N) on a Turing machine. I Complexity classes:

I Let us lump together problems according to how “hard” they are

I Are usually defined so as to be invariant under non-radical changes of machine model (e.g., the class P on a TM is the same as the class P on a RAM). ECE750-TXB Some basic distinctions Lecture 2 Todd L. Veldhuizen I At the coarsest level of structure, decision problems [email protected] come in three varieties: Outline

I Problems we can write computer programs to Resource solve. What this course is about! (Program will always Consumption stop and say “yes” or “no,” and be right!) Complexity classes

I Problems we can define, but not write computer Bibliography programs to solve (e.g., deciding whether a Java program runs in polynomial time) I Problems we cannot even define. I Consider deciding whether x ∈ A for some set A ⊆ N of natural numbers. e.g., prime numbers. I In any (effective) notation system we care to choose, there are ℵ0 (countably many) problem definitions. (They can be put into 1-1 correspondence with the natural numbers). ℵ0 I There are 2 (uncountably many) problems — subsets of A ⊆ N. (They can be put into 1-1 correspondence with the reals.)

ECE750-TXB An aside: Hasse diagrams Lecture 2 Todd L. Veldhuizen [email protected]

Outline Complexity classes are sets of problems Resource I Consumption

I Some complexity classes are contained inside other Complexity classes

complexity classes. Bibliography I e.g., every problem in class P (polynomial time on TM) is also in class PSPACE (polynomial space on TM).

I We can write P ⊆ PSPACE to mean: the class P is contained in the class PSPACE.

I ⊆ is a partial order: reflexive, transitive, anti-symmetric.

I Hasse diagrams are intuitive ways of drawing partial orders. ECE750-TXB An aside: Hasse diagrams Lecture 2 Todd L. Veldhuizen [email protected] I Example: I am a professor and a geek. Professors are people; geeks are people (are too!) Outline Resource I {me} ⊆ professors Consumption Complexity classes I {me} ⊆ geeks Bibliography I professors ⊆ people

I geeks ⊆ people

people o KK ooo KK ooo KK professors geeks NN s NNN ss NN sss {me}

ECE750-TXB Whirlwind tour of major complexity classes Lecture 2 Todd L. Veldhuizen [email protected]

Outline I Some caveats: Resource I There are 462 classes in the Complexity Zoo. Consumption Complexity classes I We’ll see... slightly fewer than that. (Most complexity classes are interesting primarily to structural complexity Bibliography theorists — they capture fine distinctions that we’re not concerned with day-to-day.)

I For every class we shall see, there are many classes above, beside, and below it that are not shown;

I The Hasse diagrams do not imply that the containment is strict: e.g., when the diagram shows NP above P, this means P ⊆ NP, not P ⊂ NP. ECE750-TXB Whirlwind tour of major complexity classes Lecture 2 Todd L. Veldhuizen [email protected]

Decidable Outline Resource EXP Consumption Complexity classes PSPACE Bibliography m PP mmmm PPP coNP NP QQQ nn QQQ nnn Q P nn

I EXP = decision – exponential time on TM (aka EXPTIME)

I PSPACE = decision – polynomial space on TM

I P = decision – polynomial time on TM (aka P)

I NP, co − NP: we’ll get to these...

ECE750-TXB Randomness-related classes Lecture 2 Todd L. ZPP, RP, coRP, BPP: probabilistic classes (machine has Veldhuizen [email protected] access to random bits) EXPTIME Outline n RR nn RR Resource nnn R NP BPP coNP Consumption R nnn RRR Complexity classes nnn R RP P coRP Bibliography PPP ll PP llll ZPP

PTIME

I BPP ≈ problems that can be solved in polynomial time with access to a random number source, with 1 probability of error < 2 . (Run many times and vote: get error as low as you like.)

I ZPP=problems that can be solved in polynomial time with access to a random number source, with zero probability of error. ECE750-TXB Polynomial-time and below Lecture 2 Todd L. Veldhuizen [email protected]

Outline

Resource PTIME Polynomial time Consumption Complexity classes

Bibliography NC “Nick’s class”

LOGSPACE Logarithmic space

NC1 Logarithmic depth circuits, bounded fan in/out

AC0 Constant depth circuits, bounded fan in/out

ECE750-TXB Structural complexity theory Lecture 2 Todd L. Veldhuizen I Structural complexity theory = the study of complexity [email protected] classes and their interrelationships Outline I Many fundamental relationships are not known: Resource I Is P=NP? (Lots of industrially important problems are Consumption NP, like placement & routing for VLSI, designing Complexity classes communication networks, etc.) Bibliography

I Is ZPP=P? (Is randomness really necessary?) I Is BPP ⊆ NP? If so, we can solve those hard problems in NP by flipping coins, with some error so tiny we don’t care.

I Lots of conditional results are known, e.g.: “If BPP contains NP, then RP=NP and PH is contained in BPP; any proof of BPP=P would require showing either NEXP is not in P/poly or that #P requires superpolynomial sized circuits.”

I Luckily (for me and you) this is not a course in complexity theory. We will do basics only. ECE750-TXB Bibliography I Lecture 2 Todd L. Veldhuizen [email protected]

Richard P. Brent and H. T. Kung. Outline Resource The area- of binary multiplication. Consumption J. ACM, 28(3):521–534, 1981. bib pdf Complexity classes Bibliography David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of parallel computation. In Marina Chen, editor, Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1–12, San Diego, CA, May 1993. ACM Press. bib

ECE750-TXB Bibliography II Lecture 2 Todd L. Veldhuizen [email protected]

Outline

David E. Culler, Richard M. Karp, David Patterson, Resource Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Consumption Ramesh Subramonian, and Thorsten von Eicken. Complexity classes Bibliography Logp: a practical model of parallel computation. Commun. ACM, 39(11):78–85, 1996. bib Philippe Oechslin. Making a faster cryptanalytic time-memory trade-off. In Dan Boneh, editor, CRYPTO, volume 2729 of Lecture Notes in Computer Science, pages 617–630. Springer, 2003. bib pdf ECE750-TXB Lecture 3

Todd L. Veldhuizen ECE750-TXB Lecture 3 [email protected] Basic Algorithm Analysis, Recurrences, and Z-transforms Outline

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 28, 2007

ECE750-TXB Lecture 3

Todd L. Veldhuizen [email protected]

Part I

Basic Algorithm Analysis ECE750-TXB RAM-style machine models Lecture 3 Todd L. Veldhuizen [email protected]

I Unless we are dealing with parallelism, randomness, circuits, etc., for the remainder of this course we will always assume a RAM-style machine.

I RAM = random access memory

I Every memory location can be read and written in O(1) time. (This is in contrast to a Turing machine, where reading a symbol at position p on the tape requires moving the position of the machine across the tape to p, requiring O(p) steps.)

I Memory locations, variables, registers, etc. all contain objects of size O(1). (e.g., 64-bit words)

I Basic operations (addition, multiplication, etc.) all take O(1) time.

ECE750-TXB Styles of analysis Lecture 3 Todd L. Veldhuizen [email protected]

I Worst case: if an algorithm has worst case O(f (n)) time, there are constants c1, c2 such that no input requires more than c1 + c2f (n) time, for n big.

I Average case: average case O(f (n)) time means: the time required by the algorithm on inputs of size n, averaged according to some probability distribution (usually uniform) is O(f (n)).

I Amortized analysis: if a data structure has amortized time O(f (m)) then a sequence of m operations will take O(m · f (m)) time. (Most operations are cheap, but every now and then you need to do something expensive.) ECE750-TXB What is n? Lecture 3 Todd L. Veldhuizen [email protected] I When we say an algorithm takes O(f (n)) time, what does n refer to?

I Default: n is the number of bits required to represent the input.

I However, often we choose n to be a natural description of the “size” of the problem: 2 I Number of vertices in a graph (input length is O(n ) bits to specify edges)

I For number-theory algorithms: n is often an integer (input length is O(log n) bits) I For linear algebra, n usually indicates rank:

I Input is O(n) bits for vectors, e.g., dot product; 2 I Input is O(n ) bits for matrices.

I Exactly what n stands for is important:

I Two integers ≤ n can be multiplied in O(log n) time. I Two n-bit integers can be multiplied in O(n log n) time.

ECE750-TXB Tools for analyzing algorithms Lecture 3 Todd L. Veldhuizen [email protected]

I Asymptotics

I Recurrences, z-transforms

I Combinatorics, Ramsey theory

I Discrepancy

I Probability, Statistics

I Information Theory

I Random objects (e.g., random graphs), zero-one laws

I

I ... pretty much anything else you can think of ECE750-TXB No silver bullet Lecture 3 Todd L. Veldhuizen [email protected] I Finding a bound for the running time of an algorithm is an undecidable problem; it is impossible to write a program that will automatically prove a bound, if one exists, for any program.

I There are very simple algorithms that have extremely long proofs of complexity bounds.

I There are very simple algorithms that nobody knows the running time of! e.g. Collatz problem.

I In any formal system (e.g., ZFC set theory) there are simple algorithms that have a complexity bound, but this cannot be proven.

I There is no finite set of tools that suffice for algorithm analysis.

I However, there are well-defined classes of algorithms that can be analyzed in a systematic way, and we will learn some of these.

ECE750-TXB Recurrence equations Lecture 3 Todd L. Veldhuizen [email protected]

I Recurrence equations are one of the simplest techniques for algorithm analysis, and for simple programs the analysis is easily automated.

I Recipe:

I Write out the algorithm in some suitable programming language or pseudocode, so that every step is expressed in terms of basic operations supported by the machine model that take O(1) time.

I Attach to each statement/syntax block a variable that counts the amount of resource used there (e.g., time)

I Write equations that relate the variables; simplify or approximate as necessary.

I Solve the equations. ECE750-TXB Pseudocode language Lecture 3 Todd L. Veldhuizen [email protected]

I A simple pseudocode language:

s = loc ← e assignment | if e then b [else b]opt if statement | for v = e to e b for loop | v(e, ··· , e) function call | return e b = s | s b statement block (one or more statements) loc = v[e] array v variable e = loc location (array or variable) | e op e operator: + ∗ − etc. | e R e relation: ≤, =, etc. | constant

ECE750-TXB Analysis Rules Lecture 3 Todd L. Veldhuizen [email protected]

I Basic operations (array access, multiply, add, compare, etc.) take O(1) time.

I Represent constant time operations by arbitrary constants c1, c2, etc. ECE750-TXB For loops (simple version) Lecture 3 Todd L. Veldhuizen [email protected]

for i = 1 to n n t1 X . t1 = c1 + (c2 + t2(i)) t2(i) . end . i=1

I The time required by a loop is: I c1: time required to initialize i = 1 I for each loop iteration (sum):

I some constant overhead c2 (time required to increment i and compare to n) I the time required by the body of the loop t2(i), which might depend on the value of i.

ECE750-TXB If statements (simple version) Lecture 3 Todd L. Veldhuizen [email protected]

if t2 e then

t . 3 . t 1 t1 = t2 + c1 + max(t3, t4) else

t . 4 .

I Time required for an if statement: I t2 = time required to evaluate branch condition I c1 = some constant time required for branching I t3, t4 = time taken in the branches I We use max(t3, t4) because we are seeking an upper bound on running time. ECE750-TXB Function calls Lecture 3 Todd L. I For each function F , introduce a time function TF (n): Veldhuizen represents the amount of time used by the function F [email protected] on inputs characterized by parameters n = (n1, n2, ··· ). (Usually have just a single parameter: TF (n).) I The variable(s) n should include any values on which the time required by the function depends. s I Example: (naive) function to compute r Exp(r, s) p ← 1 for i = 1 to s p ← p ∗ r end return p Time depends on s (exponent) but not r (base), so time function should be TExp(s). I Time required for function call:

t1 Exp(x, y) t1 = c1 + TExp(y)

ECE750-TXB Example: Exp(r, s) Lecture 3 Todd L. Veldhuizen [email protected]

Exp(r, s)  t t1 = t2 + t3 + t5 2 p ← 1   t2 = c2 for i = 1 to s  Ps 0 t3 = c3 + (c + t4) t1 t3 t4 p ← p ∗ r i=1 3  t4 = c4 end   t = c t5 return p 5 5

I Solve:

0 TExp(s) = t1 = c2 + c3 + s(c3 + c4) + c5

0 I So, TExp(s) = c + c s. ECE750-TXB A simplifying notation Lecture 3 Todd L. I Write c to mean Θ(1). Anytime a constant is needed, Veldhuizen [email protected] just use c.

I The result is an upper bound on the time, for c sufficiently large.

I Example:

Exp(r, s)  t t1 = t2 + t3 + t5 2 p ← 1   t2 = c for i = 1 to s  Ps t3 = c + (c + t4) t1 t3 t4 p ← p ∗ r i=1  t4 = c end   t = c t5 return p 5

TExp(s) = c + cs

I Fine point: each time c occurs, it means Θ(1) (some bounded value): but not the same value at each occurrence.

ECE750-TXB Matrix Multiply Example Lecture 3 Todd L. Veldhuizen [email protected]

Matrix-Multiply(n, A, B, C) for i = 1 to n for j = 1 to n C(i, j) ← 0 for k = 1 to n C(i, j) ← C(i, j) + A(i, k) ∗ B(k, j) end end end ECE750-TXB Matrix Multiply: analyze Lecture 3 Todd L. Veldhuizen [email protected] Matrix-Multiply(n, A, B, C)

for i = 1 to n

for j = 1 to n t 3 C(i, j) ← 0

for k = 1 to n t1 t2 t4 t5 C(i, j) ← C(i, j) + A(i, k) ∗ B(k, j)

end

end

end  t = c + Pn (c + t )  1 i=1 2  t = c + Pn (c + t + t )  2 j=1 3 4 t3 = c  t = c + Pn (c + t )  4 k=1 5  t5 = c

ECE750-TXB Matrix multiply: solve Lecture 3 Todd L. Veldhuizen [email protected]

I Solve:

n  n n ! X X X t1 = c + 2c + 3c + 2c  i=1 j=1 k=1 = c + n · (2c + n · (3c + n · 2c)) = 2cn3 + 2cn2 + 2cn + c

3 I MatrixMultiply(n, A, B, C) takes Θ(n ) time. ECE750-TXB Analyzing Recursion Lecture 3 Todd L. Veldhuizen [email protected]

I When functions call themselves, we will get time functions defined in terms of themselves.

I Such equations are called recurrences.

I Example:

T (1) = c Base case T (n) = c + T (n − 1) Recursion

This example easy to solve: T (n) = c + c + c + ··· c = cn | {z } n I Rarely that easy in practice!

ECE750-TXB Fibonacci example Lecture 3 Todd L. Veldhuizen [email protected] Fibonacci(n) if n ≤ 2 then return 1 else return Fibonacci(n − 1) + Fibonacci(n − 2)

I Analyze base case(s) separately:

T (1) = c T (2) = c

I Recurrence:

T (n) = c + T (n − 1) + T (n − 2) ECE750-TXB Fibonacci example: Call graph Lecture 3 Todd L. Veldhuizen [email protected]

ECE750-TXB Lecture 3

Todd L. Veldhuizen [email protected]

Bibliography

Part II

Z-transforms ECE750-TXB Unilateral Z-transform I Lecture 3 Todd L. Veldhuizen I Transforms give us an alternative representation of [email protected]

functions or series in which certain manipulations Bibliography and/or insights are easier.

I The Z-transform is of special interest to algorithm analysis because it makes solving some recurrences simple.

I To put the Z-transform in context of transforms in general: there are Integral transforms (for functions of the real line) and their discrete cousins Generating functions and Formal power series.

I Integral transforms [1, 2]: Represent a function f (x) by its transform F (p). R I Forward transform F (p) = K(p, x)f (x)dx R I Inverse transform f (x) = L(x, p)F (p)dp

I K, L are kernels. −px I Laplace transform: K(p, x) = e

ECE750-TXB Unilateral Z-transform II Lecture 3 Todd L. Veldhuizen [email protected] iωt I Fourier transform: K(p, x) = e p−1 Bibliography I Mellin transform: K(p, x) = x

I Exotica: Hankel transform, Hilbert transform, Abel transform, ···

I Generating functions / formal power series [6, 2]:

I Represent a sequence/discrete function f (n) by its transform/generating function F (z) P I Forward transform: F (z) = K(z, n)f (n) R I Inverse transform: f (n) = L(n, z)F (z) n I Ordinary Generating Functions: K(z, n) = z −n I ? Z-transforms: K(z, n) = z zn I Exponential Generating Functions: K(z, n) = n! e−z zn I Poisson Generating Functions: K(z, n) = n! I Exotica: Lambert series, Bell series ECE750-TXB Unilateral Z-transform III Lecture 3 Todd L. Veldhuizen [email protected]

I Z-transforms are more-or-less the same as ordinary Bibliography generating functions. The OGF can be obtained from the z-transform, and vice versa, by the substitution z 7→ z −1. The OGF form is more common in combinatorics and theoretical computer science; the Z-transform is more common in engineering, particularly signals and controls.

I Very useful for solving linear difference equations, i.e., equations of the form

f (n) + c1f (n − a1) + c2f (n − a2) + ··· = g(n)

that arise frequently in algorithm analysis.

ECE750-TXB Unilateral Z-transform IV Lecture 3 Todd L. Veldhuizen [email protected]

Bibliography I Linear difference equations are a special case of recurrences. Examples of recurrences not in this class are:

T (n) = T (n/2) + 1 Solution T (n) = Θ(log n) √ T (n) = T ( n) + 1 Solution T (n) = Θ(log log n)

However, we can use Z-transforms to obtain approximate solutions to the above pair of recurrences by performing a change of variables that results in a linear recurrence: r = 2n n for the first, r = 22 for the second. For a survey of recurrence-solving techniques, see [3]. ECE750-TXB Unilateral Z-transform V Lecture 3 Todd L. I Definition of the Z-transform and its inverse: Veldhuizen [email protected] ∞ X −n Z[f (n)] = f (n)z = F (z) (1) Bibliography n=0 I −1 1 n−1 Z [F (z)] = 2πi F (z)z dz (2) C where the contour C must be in the region of convergence of F (z) and encircle the origin.

I The function f (n) is discrete, i.e., f : N → R. I The Z-transform F (z) is complex: F : C → C. I The sum of Eqn. (1) often converges for only part of the complex plane, called the Region of Convergence (ROC) of F (z).

I In practice, never use Eqns. (1,2): instead use tables of transform pairs.

I Standard references: [4, 6, 5] or any DSP book

ECE750-TXB Unilateral Z-transform VI Lecture 3 Todd L. Veldhuizen [email protected] I An important transform pair: the z-transform of f (n) = bn is Bibliography

∞ X 1 Z [bn] = z−nbn = 1 − bz−1 n=0

Note that the sequence f (0), f (1), f (2), ··· = b0, b1, b2, ··· can be read off from the series expansion of F (z):

1 = b0 + b1z−1 + b2z−2 + b3z−3 + ··· 1 − bz−1 This is by definition — compare Eqn. (1). ECE750-TXB Unilateral Z-transform VII Lecture 3 Todd L. Veldhuizen [email protected] I A typical z-transform of a function f (n) looks like:

Bibliography −1 −1 (1 − a1z )(1 − a2z ) ··· F (z) = −1 −1 (1 − b1z )(1 − b2z ) ···

Here we have written F (z) in factored form. −1 I When z = ai , (1 − ai z ) = 0, and F (z) = 0. Such values of z are called zeros. −1 I When z → bi , (1 − bi z ) → 0 and F (z) → ∞. Such values of z are called poles.

I To take the inverse Z-transform of something in the form N(z) F (z) = −1 −1 (1 − b1z )(1 − b2z ) ···

ECE750-TXB Unilateral Z-transform VIII Lecture 3 Todd L. Veldhuizen where the b1, b2, ··· are all distinct, we can use partial [email protected] fractions expansion to write Bibliography

N1(z) N2(z) F (z) = −1 + −2 + ··· (1 − b1z ) (1 − b2z )

n 1 and then use the transform pair Z[b ] = (1−bz−1) to obtain something like

n n f (n) = c1b1 + c2b2 + ···

The term with the largest value of |bi | will be asymptotically dominant, e.g. if f (n) = 2n + 3n, then f (n) ∼ 3n.

I Hence, the asymptotic behaviour of f (n) can be read off directly from F (z): find the pole farthest from the n origin (i.e., the bi with |bi | largest); then f (n) = Θ(bi ). ECE750-TXB Unilateral Z-transform IX Lecture 3 Todd L. Veldhuizen [email protected]

Bibliography

I When the largest pole occurs in a form such as −1 2 −1 3 (1 − bi z ) or (1 − bi z ) etc. (double,triple poles), we need to consult a table of transforms and find what −1 k form (1 − bi z ) will take:

F −1[(1 − bz−1)2] = (n + 1)bn −1 −1 3 1 2 3 n F [(1 − bz ) ] = ( 2 n + 2 n + 1)b

ECE750-TXB Z-transforms Lecture 3 Todd L. Veldhuizen [email protected]

I Two compelling reasons for using Z-transforms: Bibliography 1. Because of the transform pair

Z[f (n − a)] = z−aF (z)

linear difference equations become linear equations that can be solved by simple algebraic manipulation. 2. The asymptotics of f (n) are governed by the pole(s) of F (z) farthest from the origin. If we just want to know Θ(f (n)), we can take the z-transform, and find the outermost pole(s) [2].

I e.g. If the outermost pole(s) of F (z) is a single pole at z = 2, then f (n) is Θ(2n). I e.g. If the outermost pole(s) of F (z) is a double pole at z = 5, then f (n) is Θ(n5n). ECE750-TXB Solving linear recurrences with Z-transforms Lecture 3 Todd L. I Workflow for exact solution: Veldhuizen 1. Use (discrete) δ-functions to encode initial conditions: [email protected] ( 1 when n = a Bibliography δ(n − a) = 0 otherwise 2. Take Z-transform of difference equation(s) to obtain equation(s) in F (z). 3. Solve for F (z). Linear difference equations result in F (z) being a ratio of polynomials in z. Factor the denominator. 4. Use partial fraction expansion to split into a sum of simple terms, and take the inverse Z-transform. I Workflow for asymptotic solution: 1. Disregard initial conditions. 2. Take Z-transform of recurrence. Solve for F (z). Factor denominator. 3. Identify outermost pole(s). If they are > 1, find the inverse Z-transform of the term corresponding to those pole(s). If outermost poles are ≤ 1, the initial conditions may matter ⇒ exact solution.

ECE750-TXB Common Z-transform pairs Lecture 3 Todd L. Veldhuizen [email protected] I Linearity: Bibliography Z [af (n) + bg(n)] = aZ[f (n)] + bZ[g(n)]

I Common transform pairs:

Z [T (n − a)] = z−aT (z) shift Z [δ(n)] = 1 impulse n 1 Z [a ] = 1−az−1 single pole n 1 Z [(n + 1)a ] = (1−az−1)2 double pole 1 Z [1] = 1−z−1 single pole at 1 z−1 Z [n] = (1−z−1)2 double pole at 1  2 z−1+z−2 Z n = (1−z−1)3 triple pole at 1 ECE750-TXB Finding boundary conditions I Lecture 3 Todd L. Veldhuizen [email protected]

Bibliography I We use the Z-transform as a unilateral transform: all functions are assumed to be 0 for n < 0.

I Initial conditions must be dealt with by introducing δ-functions.

I E.g., the Fibonacci numbers satisfy the recurrence

f (n) = f (n − 1) + f (n − 2)

with the boundary conditions (BCs) f (0) = f (1) = 1.

I If we evaluate f (0) = f (−1) + f (−2) = 0, it doesn’t satisfy the BC f (0) = 1.

ECE750-TXB Finding boundary conditions II Lecture 3 Todd L. I We add a term δ(n): Veldhuizen [email protected]

f (n) = f (n − 1) + f (n − 2) + αδ(n) Bibliography

Then f (0) = α, so we choose α = 1 to match the BC f (0) = 1. Then try f (1):

f (1) = f (0) + f (−1) +α δ(1) |{z} | {z } |{z} =1 =0 =0 = 1

So, our BC f (1) = 1 is satisfied.

I In general, if the recurrence has a term f (n − k), we may need terms

α0δ(n) + α1δ(n − 1) + ··· + αk δ(n − k)

to account for BC’s. ECE750-TXB Finding boundary conditions III Lecture 3 Todd L. Veldhuizen [email protected]

Bibliography

I However, if we are only interested in asymptotic behaviour, boundary conditions often do not matter: αz−k I Functions αδ(n − k) have a Z-transform 1−z−1 . I Will contribute a pole at z = 1, plus some term(s) to the numerator of the Z-transform when written in factored form.

I If the dominant poles of the Z-transform are > 1, then the pole(s) and zero(s) contributed by δ(n − k) functions do not change the asymptotics.

ECE750-TXB Z-Transforms: Fibonacci Example Lecture 3 Todd L. Example: our time recurrence for the Fibonacci Veldhuizen I [email protected] function. Bibliography t(n) = c + t(n − 1) + t(n − 2)

I Ignore initial conditions; take z-transform: c T (z) = + z−1T (z) + z−2T (z) 1 − z−1

I Solve for T (z): c T (z) = 1 − 2z−1 + z−3 √ 1± 5 I Asymptotics: have poles at z = 1, 2 √ 1+ 5 I Outermost pole (i.e., with |z| maximized) is z = 2 ; dominates asymptotics √ n 1+ 5 I T (n) is Θ(φ ) with φ = 2 ≈ 1.618. ECE750-TXB Z-Transforms: Fibonacci Example Lecture 3 √ Exact solution: via partial fractions. Let φ = 1+ 5 , Todd L. I √ 2 Veldhuizen 1− 5 [email protected] θ = 2 . c Bibliography T (z) = 1 − 2z −1 + z −3 c A B C = (1−z−1)(1−φz−1)(1−θz−1) = 1−z−1 + 1−φz−1 + 1−θz−1 ˛ A = T (z)(1 − z −1)˛ = −c ˛z=1 √ √ −1 ˛ c 5( 5+1)2 B = T (z)(1 − φz )˛ = √ ˛z=φ 10( 5−1) √ √ −1 ˛ c 5( 5−1)2 C = T (z)(1 − θz )˛ = √ (Yech.) ˛z=θ 10(1+ 5) −1 h α i n I Inverse Z transform: Z 1−az−1 = αa f (n) = −c + Bφn + Cθn ∼ Bφn + O(1)

I Partial fractions is tedious: if we only want asymptotics, just read the pole locations and do not bother with an exact inverse transform.

ECE750-TXB Method 2: Maple Lecture 3 Todd L. Veldhuizen The practical method of choice is to use a symbolic algebra [email protected] package like Maple: Bibliography > rsolve({T(n)=c+T(n−1)+T(n−2),T(1..2)=c},T); √ “ √ ”n √ “ √ ”n −1/5 c 5 −1/2 5 + 1/2 + 1/5 c 5 1/2 + 1/2 5 √ „ “√ ”−1«n −1/5 c 5 −2 5 + 1 “√ ” √ „ “ √ ”−1«n “ √ ”−1 −1/5 c 5 − 1 5 −2 − 5 + 1 − 5 + 1 − c

> asympt(%,n,2); √ !n √ 1 + 5 2/5 c 5 + O (1) 2

√ n 1+ 5 So, running time is Θ(φ ) with φ = 2 = 1.6180 ··· ECE750-TXB Fibonacci example cont’d Lecture 3 Todd L. Veldhuizen I So, our little Fibonacci(n) function requires [email protected]

exponential time. Bibliography I Is there a better way?

I Iterate: for i = 2..n sum previous two elements. Requires Θ(n) time.

I Use our Z-transform powers: ( 1 if n ≤ 2 Fib(n) = Fib(n − 1) + Fib(n − 2) otherwise = Fib(n − 1) + Fib(n − 2) + δ(n − 1) − δ(n − 2)

Z-transform, solve, inverse Z-transform:

Fib(n) = aφn + bθn

where a, b are constants, and φ, θ are as before. This can be implemented in O(log n) time.

ECE750-TXB Bibliography I Lecture 3 Todd L. Veldhuizen [email protected]

Bibliography [1] Brian Davies. Integral transforms and their applications. Springer, 3rd edition, 2005. bib [2] Philippe Flajolet and Robert Sedgewick. Analytic Combinatorics. 2007. Book draft. bib pdf

[3] George S. Lueker. Some techniques for solving recurrences. ACM Comput. Surv., 12(4):419–436, 1980. bib pdf ECE750-TXB Bibliography II Lecture 3 Todd L. Veldhuizen [email protected]

[4] Alan V. Oppenheim, Alan S. Willsky, and Syed Hamid Bibliography Nawab. Signals and systems. Prentice-Hall signal processing series. Prentice-Hall, second edition, 1997. bib [5] Robert Sedgewick and Philippe Flajolet. An introduction to the analysis of algorithms. Addison-Wesley, 1996. bib [6] Herbert S. Wilf. Generatingfunctionology. Academic Press, 1990. bib ECE750-TXB Lecture 4: Search & Correctness Proofs

Todd L. ECE750-TXB Lecture 4: Search & Veldhuizen Correctness Proofs [email protected] Outline

Bibliography Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 14, 2007

ECE750-TXB Problem: Searching a sorted array Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected]

I Let hT , ≤i be a total order. e.g. T could be the Outline integers Z. Bibliography Problem: Searching a sorted array I Inputs: 1. An integer n > 0. 2. An array A[0..n − 1] of elements of T , sorted in ascending order so that (i ≤ j) ⇒ (A[i] ≤ A[j]). 3. An element x of T .

I Specification: Return true if x is equal to some element in the array, false otherwise. ECE750-TXB Linear Searching Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected] I Let’s analyze the worst-case time complexity of the

following (naive) algorithm on a RAM. Outline

Bibliography Linsearch(int n, T[] A, T x) for i = 0 to n − 1 if A[i] = x then return true end return false

I e.g.,

Linsearch(5, [3, 5, 9, 9, 13], 4) returns false Linsearch(5, [3, 5, 9, 9, 13], 9) returns true

ECE750-TXB Linear Searching Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected] I Without thinking much we can say this algorithm takes time Θ(n), but let’s get the practice, and be sure: Outline Bibliography I Time taken depends on n, but not on x or the contents of A[], assuming that the type T allows comparisons in time Θ(1).

I Let T (n) be the time taken for an array of size n.

Linsearch(int n, T[] A, T x) t for 4 i = 0 to n − 1

t5 t3 if t1 A[i] = x then t2 return true t7 end

t6 return false ECE750-TXB Linear Searching Lecture 4: Search & Correctness Proofs I Write equations: (see rules from Lecture 3) Todd L.  Veldhuizen t1 = c1 [email protected]   t2 = c2  Outline  t3 = c3 + t1 + max(t2, 0)  Bibliography t4 = c4  Pn−1  t5 = t4 + i=0 (c5 + t3)   t6 = c6  t7 = t5 + t6

I (In the above analysis, I analyzed the “one-armed if” (t3) by pretending it was an if ··· then ··· else ··· statement in which the second branch required zero time.)

I Solving,

T (n) = t7 = n(c5 + c3 + c1 + c2) + c4 + c6 So, Linsearch requires Θ(n) time as expected.

ECE750-TXB How To Catch A Lion In A Desert Lecture 4: Search & Correctness Proofs The Bolzano-Weierstraß method. Divide the desert by a Todd L. Veldhuizen line running from north to south. The lion is then either in [email protected] the eastern or in the western part. Let’s assume it is in the Outline eastern part. Divide this part by a line running from east to Bibliography west. The lion is either in the northern or in the southern part. Let’s assume it is in the northern part. We can continue this process arbitrarily and thereby constructing with each step an increasingly narrow fence around the selected area. The diameter of the chosen partitions converges to zero so that the lion is caged into a fence of arbitrarily small diameter. — from How To Catch A Lion In The Desert, Mathematical Methods Not as elegant as the inversion method1, but a good starting point for a search algorithm.

1Place a spherical cage in the desert, enter it and lock from the inside. Perform an inversion with respect to the cage. Then, the lion is inside the cage and you are outside. ECE750-TXB Binary Search Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected] I Search a sorted array A[0..n − 1] by repeatedly dividing

it in half, and searching one of the halves for x. Outline

I The portion of the array we are searching will be l..h Bibliography (for low and high). Initially we’ll have l = 0 and h = n − 1, so that we are searching A[0..n − 1].

I We will design the algorithm around an invariant: a property that is maintained as the algorithm runs.

I The invariant has three parts, each of which must always be true:

I A is sorted. I l ≤ h. Otherwise, A[l..h] is not a valid interval of the array.

I A[l] ≤ x ≤ A[h]. The lion is always in our segment.

ECE750-TXB Binary Search Lecture 4: Search & Correctness BinarySearch(n, A[], x) Proofs Todd L. I Require A sorted. Veldhuizen [email protected] I Let l = 0 and h = n − 1. A[l..h] is our search range.

I If x < A[l] or A[h] < x then x is not in the array; return Outline

false. Bibliography I Otherwise, A[l] ≤ x ≤ A[h] and l ≤ h. We have established the invariant. I Call BinarySearch2(A, x, l, h). BinarySearch2(A[], x, l, h): I Require l ≤ h, A[l] ≤ x ≤ A[h], A sorted (Invariant). I If l = h then return true. I Otherwise, split the search range in two by choosing 1 midpoint i = l + b 2 (h − l)c. Then either: I x ≤ A[i], in which case A[l] ≤ x ≤ A[i]. Return BinarySearch2(A, x, l, i). I A[i] < x, in which case either I A[i] < x < A[i + 1], in which case return false; or I A[i + 1] ≤ x ≤ A[h]. Return BinarySearch2(A, x, i + 1, h). ECE750-TXB Binary Search: Code Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected] BinarySearch2(A, x, l, h) requires (l ≤ h) ∧ (A[l] ≤ x ≤ A[h]) ∧ (A sorted) Outline if l = h then Bibliography return true else i ← l + b(h − l)/2c if x ≤ A[i] then return BinarySearch2(A, x, l, i) else if x < A[i + 1] then return false else return BinarySearch2(A, x, i + 1, h)

ECE750-TXB Binary Search: Example Lecture 4: Search & Correctness Proofs

I Example: n = 7, search for x = 3. Todd L. Veldhuizen [email protected] 0 1 2 3 4 5 6

Outline A[] 1 3 6 9 10 10 13 O O O Bibliography l i h x ≤ A[i]

A[] 1 3 6 9 10 10 13 O O O l i h x ≤ A[i]

A[] 1 3 6 9 10 10 13 O O l, i h ¬(x ≤ A[i]) ∧ ¬(x < A[i + 1])

A[] 1 3 6 9 10 10 13 O l, h l = h return true k ECE750-TXB Binary Search: Time analysis for h − l + 1 = 2 . Lecture 4: Search & Correctness Proofs I Let n = h − l + 1, the size of the search range. Todd L. I We will analyze BinarySearch2(··· ) only; the Veldhuizen ‘setup’ function BinarySearch(··· ) adds a constant [email protected] overhead. Outline I Let T (n) be an (upper bound for) the time complexity. Bibliography BinarySearch2(A, x, l, h)

if t10 l = h then

t9 return A[l] = x

else

t8 i ← l + b(h − l)/2c

t6 if x ≤ A[i] then t11 t5 return BinarySearch2(A, x, l, i)

else t t 7 if 3 x < A[i + 1] then

t2 return false t4

else t 1 return BinarySearch2(A, x, i + 1, h)

ECE750-TXB Binary Search: Time analysis Lecture 4: Search & Correctness Proofs k I Analyze for the case where h − l + 1 = 2 , for some Todd L. Veldhuizen k ∈ N. [email protected]

I Need to prove that when BinarySearch2 calls itself, Outline k we go from a problem of size 2 to a problem of size Bibliography 2k−1: then recurrence will have the form T (n) = ··· + T (n/2) + ··· . Proposition (Both halves have size 2k−1.) If h − l + 1 = 2k and i = l + b(h − l)/2c, then i − l + 1 = 2k−1 and h − (i + 1) + 1 = 2k−1. Proof. h − l + 1 = 2k implies i = l + b(h − l)/2c = l + b(2k − 1)/2c = l + 2k−1 − 1, so i − l + 1 = (l + 2k−1 − 1) − l + 1 = 2k−1, and h − (i + 1) + 1 = h − (l + 2k−1 − 1 + 1) + 1 = 2k − 2k−1 = 2k−1. ECE750-TXB Binary Search: Time analysis Lecture 4: Search & Correctness k Proofs I Assuming n = h − l + 1 = 2 , a recursive call has Todd L. problem size 2k−1 = n/2. Veldhuizen [email protected] I Write equations: (use c = Θ(1) notation) Outline

Bibliography  t1 = c + T (n/2)   t2 = c   t3 = c   t4 = c + max(t1, t2)   t5 = c + T (n/2)   t6 = c t = t + c + max(t , t )  7 6 5 4  t = c  8  t = c  9  t = c  10  (  c if n = 1  t11 =  c + t7 + t8 otherwise

ECE750-TXB Binary Search: Time analysis Lecture 4: Search & Correctness Proofs I Solving, simplifying, and folding constants, we obtain Todd L. Veldhuizen the recurrence [email protected] ( c if n = 1 Outline T (n) = Bibliography c + T (n/2) otherwise

I So far, we have only seen recurrences of the form

F (n) = αF (n − a) + βF (n − b) + ··· + G(n)

i.e. linear difference equations. ξ I Change of variables: Let ξ = log n, and r(ξ) = T (2 ). Then T (n/2) = r(ξ − 1). New recurrence:

r(ξ) = c + r(ξ − 1)

Now it is a linear difference equation in ξ. ECE750-TXB Binary Search: Solve recurrence Lecture 4: Search & Correctness 2 Proofs I From inspection r(ξ) = c(1 + ξ), but for practice: Todd L. Veldhuizen r(ξ) = c + r(ξ − 1) [email protected] ⇓ z-transform Outline c R(z) = + z−1R(z) Bibliography 1 − z−1 c I Solve: get R(z) = (1−z−1)2 . A double pole at z = 1.

z−1 Z[n] = (1−z−1)2 Z[T (n − a)] = z−aT (z)

+1 z−1 Write as R(z) = c · z · (1−z−1)2 . Then  −1   −1  −1 +1 z −1 z Z c · z · −1 2 = cZ −1 2 (1 − z ) (1 − z ) ξ←ξ+1

= cξ|ξ←ξ+1 = c(ξ + 1) 2Note: we treat the c term as if it were c · u(n), where u(n) is the step function: u(n) = 1 if and only if n ≥ 0.

ECE750-TXB Binary Search: Solve recurrence Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected] I Therefore r(ξ) = c(ξ + 1). And, T (n) = r(log n), so

T (n) = c(1 + log n). Outline

k Bibliography I So, binary search takes time O(log n) for n = 2 .

I It turns out (we won’t prove) it is O(log n) for any n ≥ 1. (See assignment 1, section 1, problem 1.)

I Much faster than Linsearch. On an array of size n = 1, 000, 000,

I On average Linsearch will require ≈ 500, 000 comparisons. I BinarySearch requires ≤ 21 comparisons. I There is an even faster search method, interpolation search, that under favourable assumptions takes average time Θ(log log n) steps. ECE750-TXB Binary Search: Correctness Proof Lecture 4: Search & Correctness Proofs

I The invariant helps immensely in proving correctness. Todd L. Veldhuizen I To prove: if the preconditions are satisfied, then [email protected] BinarySearch2(A, x, l, h) returns true if and only if Outline there is a k ∈ [l, h] such that A[k] = x. Bibliography I Proof architecture: Progress and Preservation. 1. Progress is made: in each recursive call, the problem becomes strictly smaller. (This implies a base case is eventually reached.) 2. Preservation: the invariant is satisfied at all entries to the function. 2.1 the invariant is satisfied when BinarySearch2 is called from BinarySearch (initial entry). 2.2 the invariant is preserved when BinarySearch2 calls itself recursively. 3. Base cases: when BinarySearch2 returns directly (without calling itself), its return value is correct. 4. (1,2,3) together yield a simple correctness proof by induction over problem size.

ECE750-TXB Binary Search: Correctness Proof Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen A basic rule of inference: in the branches of an if statement, [email protected]

Outline

if ψ then Bibliography (1) ··· else (2) ···

I At point (1), ψ is true.

I At point (2), ψ is false. Example: if ψ were ‘n > m’, where n, m ∈ N, then at (1) n > m is true, and at (2) ¬(n > m) is true, which means m ≤ n. ECE750-TXB Binary Search: Correctness Proof I Lecture 4: Search & Correctness Proofs Lemma (Base cases correct) Todd L. Veldhuizen If the invariant holds, and BinarySearch2 returns directly [email protected] without calling itself, then it returns true if and only if there Outline is a k ∈ [l, h] such that A[k] = x. Moreover, Bibliography BinarySearch2 always returns directly on problems of size n = 1. Proof. There are two cases. 1. The program path taken is:

if (1)l = h then return true

We have l = h, and the invariant says A[l] ≤ x ≤ A[h]; since hT , ≤i is a total order, antisymmetry ((x ≤ y) ∧ (y ≤ x) ⇒ x = y) implies A[l] = x. The return value is true, satisfying the requirement. If the

ECE750-TXB Binary Search: Correctness Proof II Lecture 4: Search & Correctness problem size is h − l + 1 = 1 then l = h, and Proofs BinarySearch2 returns directly. Todd L. Veldhuizen 2. The program path taken is: [email protected] Outline

BinarySearch2(A, x, l, h) Bibliography if (1)l = h then else if (2)x ≤ A[i] then else if (3)x < A[i + 1] then return false

We have (1) l 6= h and (2) x > A[i] and (3) x < A[i + 1]. Putting (2) and (3) together, A[i] < x < A[i + 1], which together with the array being sorted implies x is not in the array, and the return value is false. ECE750-TXB Binary Search: Correctness Proof Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen Lemma (Progress) [email protected]

Each time BinarySearch2 calls itself, the problem size is Outline

strictly smaller. Bibliography Proof. The possible recursion paths are:

BinarySearch2(A, x, l, h) if (1)l = h then else i ← l + b(h − l)/2c ··· (a)BinarySearch2(A, x, l, i) ··· (b)BinarySearch2(A, x, i + 1, h)

From (1) we have l 6= h. Therefore problem size n = (h − l + 1) satisfies n ≥ 2.

ECE750-TXB Binary Search: Correctness Proof Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen We have two cases: [email protected] 1. (Call site (a).) The problem size of the recursive call is Outline i − l + 1, so we must prove i − l + 1 < h − l + 1. This Bibliography is equivalent to i < h. (By contradiction.) Suppose that i ≥ h. Then, substituting i = l + b(h − l)/2c, we obtain b(h − l)/2c ≥ h − l, a contradiction since h − l ≥ 1. 2. (Call site (b).) To prove: h − (i + 1) + 1 < h − l + 1. Equivalent to l < i + 1. From the invariant, l ≤ h, so b(h − l)/2c ≥ 0. Since i = l + b(h − l)/2c, l ≤ i. Therefore l < i + 1. ECE750-TXB Binary Search: Correctness Proof Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected]

Outline

Lemma (Preservation) Bibliography If the invariant is satisfied on a call to BinarySearch2, then it is satisfied on calls by BinarySearch2 to itself. Proof. Recall the invariant is:

(l ≤ h) ∧ (A[l] ≤ x ≤ A[h]) ∧ (A sorted)

We never modify the array, so A remains sorted. There are two cases to consider.

ECE750-TXB Binary Search: Correctness Proof I Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen 1. The program path taken is [email protected]

BinarySearch2(A, x, l, h) Outline if (1)l = h then Bibliography else i ← l + b(h − l)/2c if (2)x ≤ A[i] then return BinarySearch2(A, x, l, i)

We have (1) l 6= h and (2) x ≤ A[i]. Need to prove 1.1 A[l] ≤ x ≤ A[i]. We have A[l] ≤ x from the invariant, and x ≤ A[i] from the branch condition (2), so A[l] ≤ x ≤ A[i]. 1.2 l ≤ i. From the invariant, l ≤ h, so b(h − l)/2c ≥ 0. Therefore l ≤ i = l + b(h − l)/2c. ECE750-TXB Binary Search: Correctness Proof II Lecture 4: Search & Correctness 2. The program path taken is Proofs Todd L. Veldhuizen BinarySearch2(A, x, l, h) [email protected] if (1)l = h then Outline else Bibliography i ← l + b(h − l)/2c if (2)x ≤ A[i] then else if (3)x < A[i + 1] then else return BinarySearch2(A, x, i + 1, h)

Need to prove: 2.1 A[i + 1] ≤ x ≤ A[h]. From (3) x ≥ A[i + 1] and from the invariant x ≤ A[h]. Therefore A[i + 1] ≤ x ≤ A[h]. 2.2 i + 1 ≤ h. We proved i < h in the first case of the progress lemma, so i + 1 ≤ h.

ECE750-TXB Binary Search: Correctness Proof: Denouement Lecture 4: Search & Correctness Put everything together: Proofs Todd L. Veldhuizen Lemma (Correct Step) [email protected] If the invariant is initially satisfied, then Outline 1. If the problem is of size n = 1, BinarySearch2 Bibliography returns immediately a correct answer. 2. If the problem is of size n > 1 then BinarySearch2 either immediately returns a correct answer, or calls itself with the invariant satisfied on a problem of size n0 where 1 ≤ n0 < n.

Proof. (1) from the base cases lemma. (2) if it returns immediately it is correct from the base cases lemma. Otherwise, it calls itself: progress lemma gives n0 < n, preservation lemma gives invariants satisfied, and 1 ≤ n0 follows from l ≤ h (invariant). ECE750-TXB Binary Search: Correctness Proof Lecture 4: Search & Correctness Proofs

Theorem Todd L. If the invariant is satisfied then returns a Veldhuizen BinarySearch2 [email protected] correct answer. Outline Proof. Bibliography By induction on problem size. 1. Base case. To prove: if n = 1 a correct answer is returned. Proof: apply Correct Step Lemma. 2. Induction step. To prove: if a correct answer is returned for problems of size ≤ n (induction hypothesis), then a correct answer is returned for a problem of size n + 1. Proof: apply the Correct Step Lemma: for a problem of size n + 1, BinarySearch2 is correct or calls itself on a problem of size n0 < n + 1. Therefore n0 ≤ n, and from the induction hypothesis a correct answer is returned.

ECE750-TXB Binary Search: The front end Lecture 4: Search & Correctness One last item: the entry routine BinarySearch. It Proofs Todd L. establishes the invariant. Its only requirement is that A is a Veldhuizen sorted array of at least one element. [email protected]

BinarySearch(n, A, x) Outline requires (A sorted) ∧ (n ≥ 1) Bibliography if (1)x < A[0] then return false else if (2)x > A[n − 1] then return false else return (3)BinarySearch2(A, x, 0, n − 1)

For (3) to be reached, A must be sorted, and 1. n ≥ 1 implies 0 ≤ n − 1, establishing l ≤ h; 2. (1) gives x ≥ A[0], and (2) gives x ≤ A[n − 1], establishing A[l] ≤ x ≤ A[h]. ECE750-TXB Binary Search: Correctness Proof Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected]

Outline

I This establishes the correctness of binary search up to a Bibliography basic level of rigour.

I However, proofs by hand are notoriously error-prone: I I did this proof by hand, and given my track record, I give it a 25% chance of being correct.

I Reward of $5 for each error found, up to a maximum of $20. :)

I Gold standard is a formal proof in a system such as Isabelle, Coq, ACL2, etc.

ECE750-TXB In praise of invariants Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected] I A good invariant is an indispensible tool in designing algorithms and data structures. Outline Bibliography I The fine details of the algorithm are often dictated by the need to 1. Preserve the invariant (during recursion, iteration, changing state); 2. Make progress; 3. Handle the base cases correctly.

I If you design an algorithm around an invariant: 1. the invariant guides you in the design; 2. you are more likely to have a correct implementation; 3. the proof of correctness is often easier (and, sometimes, straightforward). ECE750-TXB Bibliography I Lecture 4: Search & Correctness Proofs

Todd L. Veldhuizen [email protected]

Outline

Bibliography ECE750-TXB Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen ECE750-TXB Lecture 5: Veni, Divisi, Vici [email protected] (Divide and Conquer) Divide and Conquer

Abstract Data Todd L. Veldhuizen Types Bibliography [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 14, 2007

ECE750-TXB Divide And Conquer Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] BinarySearch is an (arguably degenerate) instance of a basic algorithm design pattern: Divide and Conquer

Abstract Data Types

Bibliography Divide And Conquer 1. If the problem is of trivial size, solve it and return. 2. Otherwise: 2.1 Divide the problem into several problems of smaller size. 2.2 Conquer these smaller problems by recursively applying the divide and conquer pattern. 2.3 Combine the answers to the smaller problems into an answer to the whole problem. ECE750-TXB Divide and Conquer Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] Binary Search as Divide-and-Conquer: Divide and 1. If the problem is of size n = 1, answer is obvious — Conquer Abstract Data return. Types 2. Otherwise: Bibliography

I Split the array into two halves. Principle:

(x in A[l..h]) ≡ (x in A[l..i]) ∨ (x in A[i + 1..j])

I Search the two halves: for one half, call self recursively; for other half, answer is false.

I Combine the two answers: since one answer is always false, and “x or false” is just “x,” simply return the answer from the half we searched.

ECE750-TXB Divide-and-Conquer Recurrences Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen If [email protected]

Divide and 1. The base cases (trivially small problems) require time Conquer

O(1), and Abstract Data 2. a problem of size n is split into k subproblems of size Types Bibliography s(n), and 3. splitting the problem into subproblems and combining the answers takes time f (n) then the general form of the time recurrence is

T (n) = c + kT (s(n)) + f (n)

e.g. for binary search we had k = 1 (we only had to search one half), s(n) = n/2, and f (n) = 0. ECE750-TXB Strassen Matrix Multiplication Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer Recall that our MatrixMultiply routine took Θ(n3) I Abstract Data time. Types Bibliography I Can we do better? No one thought so until...

I A landmark paper: Volker Strassen, Gaussian Elimination is not optimal (1969). [1] 3 I An o(n ) divide-and-conquer approach to matrix multiplication.

ECE750-TXB Strassen’s method Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer k+1 Abstract Data Strassen: “If A, B are matrices of order m2 to be Types

multiplied, write Bibliography

 A A   B B   C C  A = 11 12 B = 11 12 C = 11 12 A21 A22 B21 B22 C21 C22

k where the Aik , Bik , Cik matrices are of order m2 ... ECE750-TXB Strassen’s method Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen “Then compute [email protected]  Divide and I = (A11 + A22)(B11 + B22) Conquer   II = (A21 + A22)B11 Abstract Data  Types  III = A11(B12 − B22)  Bibliography (7 subproblems) IV = A22(−B11 + B21)   V = (A11 + A12)B22   VI = (−A11 + A21)(B11 + B12)  VII = (A12 − A22)(B21 + B22)  C = I + IV − V + VII  11  C = II + IV (and combine) 21 C12 = III + V   C22 = I + III − II + VI

ECE750-TXB Strassen’s method Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and 1. Subproblems: compute 7 matrix multiplications of size Conquer n/2. Abstract Data Types

2. Constructing the subproblems and combining the Bibliography answers is done with matrix additions/subtractions, taking Θ(n2) time. 3. Apply general divide-and-conquer recurrence with k = 7, s(n) = n/2, f (n) = Θ(n2):

T (n) = c + 7T (n/2) + Θ(n2) ECE750-TXB Strassen’s method Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

f (n) Divide and 2 Conquer Recall that Θ(n ) means “some function f (n) for which n2 is eventually restricted to some finite positive interval Abstract Data Types [c , c ].” 1 2 Bibliography 2 Pick a value c > c2; then eventually f (n) ≤ cn . Solve recurrence:

T (n) = c + 7T (n/2) + cn2

This will give an asymptotically correct bound, but possibly T (n) is less than the actual time required for small n.

ECE750-TXB Strassen’s method Lecture 5: Veni, Divisi, Vici ξ I Let r(ξ) = T (2 ), and ξ = log2 n. Recurrence becomes Todd L. Veldhuizen r(ξ) = c + 7r(ξ − 1) + c(2ξ)2 [email protected]

ξ Divide and = c + 7r(ξ − 1) + c(4 ) Conquer

Abstract Data I Z-transform: Types c Z [c] = Bibliography 1 − z−1 Z [7r(ξ − 1)] = 7z−1R(z) h i c Z c(4ξ) = 1 − 4z−1

I Z-transform version of recurrence is: c c R(z) = + 7z−1R(z) + 1 − z−1 1 − 4z−1

I Solve: c2(1 − 5 z−1) R(z) = 2 (1 − z−1)(1 − 4z−1)(1 − 7z−1) ECE750-TXB Strassen’s method Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Look at singularities: zero at z = 5 , poles at z = 1, 4, 7. Divide and I 2 Conquer I Pole at z = 7 is asymptotically dominant: Abstract Data Types

ξ Bibliography r(ξ) ∼ c17

I Change variables back: ξ = log2 n:

log n T (n) ∼ c17 2 log 7 2.807··· = c1n 2 = c1n

2.807··· I Strassen matrix multiplication takes Θ(n ) time.

ECE750-TXB Divide-and-conquer recurrences I Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen I Our analysis of Strassen’s algorithm is easily generalized. [email protected] Consider a divide-and-conquer recurrence of the form I Divide and Conquer d T (n) = kT (n/s) + Θ(n ) (1) Abstract Data Types

Bibliography I To obtain an asymptotic upper bound we can solve the related recurrence

T 0(n) = kT 0(n/s) + cnd (2)

0 I It can be shown that T (n)  T (n). ξ I Analyzing for the case when n = s with ξ ∈ N, we can change variables to turn Eqn. (2) into:

r(ξ) = kr(ξ − 1) + c(sd )ξ ECE750-TXB Divide-and-conquer recurrences II Lecture 5: Veni, Divisi, Vici Z-transform: I Todd L. c Veldhuizen R(z) = kz−1R(z) + [email protected] 1 − sd z−1 Divide and Conquer Solve: I Abstract Data c Types R(z) = Bibliography (1 − kz−1)(1 − sd z−1)

I Which is the dominant pole? Depends on the values of k, s, d. d d d I Three cases: s < k, s = k, s > k. 1. sd < k. Then the dominant pole is z = k. Get

r(ξ)  kξ

ξ Since n = s , use ξ = logs n:

log k log k 0 log n log k log n log n T (n)  k s = (2 ) log s = (2 ) log s = n log s

ECE750-TXB Divide-and-conquer recurrences III Lecture 5: Veni, Divisi, Vici

d Todd L. 2. s = k. Then get a double pole at z = k. Recall that Veldhuizen [email protected]  −1  −1 kz ξ Z = ξk Divide and (1 − kz−1)2 Conquer Abstract Data End up with Types Bibliography r(ξ)  ξkξ

log k·log n log k 0 logs n log s log s T (n)  (logs n)k  (log n)2 = n log n log sd = n log s log n = nd log n

3. sd > k. Then dominant singularity is z = sd . Get

r(ξ)  (sd )ξ

0 d log n d log s log n d T (n)  (s ) s = (2 ) log s = n ECE750-TXB ‘Master’ Theorem Lecture 5: Veni, Divisi, Vici Theorem (‘Master’) Todd L. Veldhuizen The solution to a divide-and-conquer recurrence of the form [email protected]

d Divide and T (n) = kT (dn/se) + Θ(n ) Conquer Abstract Data where s > 1, is Types Bibliography   log k  Θ n log s if sd < k  T (n) = Θ nd log n if sd = k  Θ(nd ) if sd > k

I Examples: log k I Binary search: k=1, s=2, d=0: second case, log s = 0, so T (n) = Θ(n0 log n) = Θ(log n). log k I Strassen: k = 7, s = 2, d = 2: first case, log s ≈ 2.807, so T (n) = Θ(n2.807···).

ECE750-TXB Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer

Abstract Data Types Abstract Data Types Bibliography ECE750-TXB Abstract Data Types Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] I A basic software engineering principle: Divide and Separate the interface (what you can do) from Conquer the implementation (how it is done.) Abstract Data Types

Bibliography I An abstract data type is a an interface to a collection of data.

I There may be numerous ways to implement an ADT, each with different performance characteristics.

I An ADT consists of 1. Some types that may be required to provide operations, relations, and satisfy properties. e.g. a totally ordered set. 2. An interface: the capabilities provided by the ADT.

ECE750-TXB Abstract Data Types Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen I Notation for types: hT , f1, f2, ··· , R1, R2, · · · i indicates [email protected] a set T together with Divide and I Some operators or functions f1, f2,... Conquer I Some relations R1, R2, ··· . Abstract Data Types This is the standard notation for a structure in logic: Bibliography could be

I an algebra (functions but no relations) e.g. a field hF , +, ∗, −, ·−1, 0, 1i;

I a relational structure (relations but no functions) e.g. a total order hT , ≤i;

I some structure with both functions and relations e.g. an ordered field hF , +, ∗, −, ·−1, 0, 1, ≤i

I Often a type is required to satisfy certain axioms, or belong to a specified class of structures, e.g. (a field, a total order, a distance metric.) ECE750-TXB ADT: Dictionary[K, V ] Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen Dictionary[K,V] [email protected]

Stores a set of pairs (key, value), and finds values by Divide and I Conquer key. At most one value per key is permitted. Abstract Data Types I Types: Bibliography I hKi: key type (e.g. a word). I hV , 0i: value type (e.g., a definition of a word). The value 0 is a special value used to indicate absence of a dictionary entry.

I Operations:

I insert(k, v): insert a key-value pair into the dictionary. If an entry for the key is already present, it is replaced.

I V find(k): if (k, v) is in the dictionary, returns v; otherwise returns 0.

I remove(k): deletes the entry for key k, if present.

ECE750-TXB Abstract Data Types Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] Implementation ADT Divide and Data structure Conquer

Abstract Data Linked list Types k5 Bibliography kkkk kkkk Dictionary / Sorted array SSS SSSS SS) Search tree

Implementation insert time find time remove time Linked list O(n) O(n) O(n) Sorted array O(n) O(log n) O(n) Search tree O(log n) O(log n) O(log n) ECE750-TXB The role of ADTs in choosing data structures Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Questions to consider when choosing a data structure: Divide and Conquer

1. What type of data do you need to store? Abstract Data Types I Does it have a natural total order? e.g. integers, strings under lexicographic ordering, database keys, etc. Bibliography

I Does it bear some natural partial order? I Do elements represent points or regions in some metric space, e.g., Rn? (e.g., screen regions, boxes in a three-dimensional space, etc.) The order relation(s) or geometric organization of your data may allow the use of ADTs that exploit those properties to allow efficient access.

ECE750-TXB The role of ADTs in choosing data structures Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and 2. What operations are required? Do you need to Conquer Abstract Data I determine whether a value is in the data set (search)? Types I insert, modify, delete elements? Bibliography I iterate through the elements (order doesn’t matter)? I iterate through the elements in some sorted order? I find the “biggest” or “smallest” element? I find elements that are “close” to some value? These requirements can be compared with the interfaces provided by ADTs, to decide what ADTs might be suitable. ECE750-TXB The role of ADTs in choosing data structures Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and 3. What is the typical mixture of operations you will Conquer perform? Will you Abstract Data Types I insert frequently? Bibliography I delete frequently? I search frequently? Different ADT implementations may offer different performance characteristics. By understanding the typical mixture of operations you can choose an implementation with the most suitable performance characteristics.

ECE750-TXB ADT: Array[V] Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] Array[V] Divide and I A finite sequence of cells, each containing a value. Conquer : Abstract Data I Types Types

I hV , 0i: a value type, with a “default value” 0 used to Bibliography initialize cells.

I Operations : I Array(n): create an array of length n, where n ∈ N is a positive integer. Cells are initialized to 0.

I integer length(): returns the length of the array I get(i): returns the value of cell i. It is required that 0 ≤ i ≤ length() − 1.

I set(i, v): sets the value of cell i to v. It is required that 0 ≤ i ≤ length() − 1. ECE750-TXB ADT: Set[V] Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] Set[V] Divide and Conquer I Stores a set of values. Permits inserting, removing, and Abstract Data testing whether a value is in the set. At most one Types instance of a value can be in the set. Bibliography

I Types:

I hV i: a value type

I Operations:

I insert(v): adds v to the set, if absent. Inserting an element already in the set causes no change.

I remove(v): removes v from the set, if present; I boolean contains(v): returns true if and only if v is in the set.

ECE750-TXB ADT: Multiset[V] Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] Multiset[V] Divide and Conquer I Stores a multiset of values (i.e., with duplicate elements Abstract Data permitted.) Permits inserting, removing, and testing Types whether a value is in the set. Sometimes called a bag. Bibliography

I Types:

I hV i: a value type

I Operations:

I insert(v): adds v to the multiset. I remove(v): removes an instance of v from the multiset, if present;

I boolean contains(v): returns true if and only if v is in the set. ECE750-TXB Stacks, Queues, and Priority Queues Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

I Informally, a queue is a collection of objects “awaiting Divide and their turn.” e.g., customers queueing at a grocery store. Conquer Abstract Data The queueing policy governs “who goes next.” Types

I First In, First Out (FIFO): like a line at the grocery Bibliography store: the element that was added least recently goes next.

I Last In, First Out (LIFO): the item added most recently goes next. A stack: like an “in-box” of work where new items are placed on the top, and what ever is on the top of the stack gets processed next.

I Priority Queueing: items are associated with priorities; the item with the highest priority goes next.

ECE750-TXB ADT: Queue[V] Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Queue[V] Divide and Conquer

I A FIFO queue (first in, first out) of objects. Abstract Data Types I Types : Bibliography I hV i: a value type

I Operations :

I insert(v): adds the object v to the end of the queue I boolean isEmpty(): returns true just when the queue is empty

I V next(): returns and removes the value at the front of the queue. It is an error to perform this operation when the queue is empty. ECE750-TXB ADT: Stack[V] Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Stack[V] Divide and Conquer

I A LIFO (last in, first out) stack of objects. Abstract Data Types I Types : Bibliography I hV i: a value type

I Operations :

I push(v): adds a value to the top of the stack. I boolean isEmpty(): returns true just when the stack contains no elements.

I V pop(): returns the value at the top of the stack. It is an error to perform this operation when the stack is empty.

ECE750-TXB ADT: PriorityQueue[P,V] Lecture 5: Veni, Divisi, Vici

Todd L. PriorityQueue[P,V] Veldhuizen [email protected] I A queue of objects, each with an associated priority, in Divide and which an object with maximal priority is always chosen Conquer

next. Abstract Data Types Types : I Bibliography I hP, ≤i: a priority type, with a total order ≤. I hV i: a value type

I Operations :

I insert(p, v): insert a pair (p, v) where p ∈ P is a priority, and v ∈ V is a value;

I boolean isEmpty(): returns true just when the queue is empty;

I (P, V ) next(): returns and removes the object at the front of the queue, which is guaranteed to have maximal priority. It is an error to perform this operation on an empty queue. ECE750-TXB Data Structure: Linked List Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected] I Realizes a list of items, e.g., (2, 3, 5, 7, 11) Divide and I Inserting and removing elements at the front of the list Conquer requires O(1) time. Abstract Data Types

I Searching requires iterating through the list: O(n) time Bibliography I List can be iterated through from front to back. I Basic building block is a node that contains:

I data: A piece of data; I next:A pointer to the next node, or a null pointer if the node is the end of the list.

ECE750-TXB Data Structure: Linked List Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen public class LinkedList { [email protected] Node head; // Pointer to the first node in the list Divide and Conquer

... Abstract Data } Types Bibliography class Node { T data; // Data item Node next; // Next node in the list , if any

Node(T data, Node next) { data = data; next = next; } } ECE750-TXB Data Structure: Linked List I Lecture 5: Veni, Divisi, Vici

Todd L. I Insert a new data element: Veldhuizen [email protected] public void insert (T data) Divide and { Conquer

head = new Node(data, head); Abstract Data } Types Bibliography

I Remove the front element, if any: public T removeFirst() { if (head == null) throw new RuntimeException(”List is empty.”); else { T data = head.data; head = head.next; return data; } }

ECE750-TXB Data Structure: Linked List II Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer Check if the list is empty: Abstract Data I Types public boolean isEmpty() Bibliography { return (head == null); } ECE750-TXB Implementing Stack[V ] with a Linked List Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and I The ADT Stack[V ] is naturally implemented by a Conquer

linked list. Abstract Data Types

public class Stack { Bibliography LinkedList list;

public void push(V v) { list . insert (v); } public boolean isEmpty() { return list .isEmpty(); } public V pop() { return list .removeFirst (); } }

I push(v), isEmpty(), and pop() require O(1) time.

ECE750-TXB Iterators Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen Iterator[V] [email protected]

I An iterator is an ADT that provides traversal through Divide and some container. It abstracts away from the details of Conquer Abstract Data how the data are stored, and presents a simple interface Types

for retrieving the elements of the container one at a Bibliography time.

I Types:

I V : the value type stored in the container

I Operations:

I Iterator(C): initialize the iterator to point to the first element in the container C;

I boolean hasNext(): returns true just when there is another element in the container;

I V next(): returns the next element in the container, and advances the iterator. ECE750-TXB An Iterator for a linked list Lecture 5: Veni, Divisi, Vici public class ListIterator { Todd L. Veldhuizen Node node; [email protected]

Divide and ListIterator ( LinkedList list) Conquer { node = list .head; } Abstract Data Types boolean hasNext() Bibliography { return (node == null); }

T next() { if (node == null) throw new RuntimeException(”Tried to iterate past end of list ”);

T data = node.data; node = node.next; return data; } }

ECE750-TXB Iterators Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer

Abstract Data Example usage of an iterator: Types ListIterator iter = new ListIterator(list ); Bibliography while ( iter .hasNext()) { System.out. println (”The next element is ” + iter .next ()); } ECE750-TXB Implementing Queue[V ] with a Linked List Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer

Abstract Data I Recall that a Queue[V ] ADT requires first-in first-out Types queueing. But, the list as we have shown it only Bibliography supports inserting and removing at one end.

I Simple variant of a linked list: a in addition to maintaining a pointer to the head of the list, also maintain a pointer to the tail of the list.

ECE750-TXB Bidirectional Linked List Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

class BiNode { Divide and T data; // Data item Conquer Abstract Data BiNode next; // Next node in the list , if any Types

BiNode prev; // Previous node in the list , if any Bibliography

BiNode(T data, BiNode next) { data = data; next = next; next.prev = this; } } ECE750-TXB Bidirectional Linked List Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer

Abstract Data Types

Bibliography

ECE750-TXB Bidirectional Linked List Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and public class BiLinkedList { Conquer BiNode head; // Pointer to the first node in the list Abstract Data Types BiNode tail; // Pointer to the last node in the list Bibliography

public void insert (T data) { head = new BiNode(data, head); if ( tail == null) tail = head; } ECE750-TXB Bidirectional Linked List Lecture 5: Veni, Divisi, Vici

Todd L. public T removeLast() Veldhuizen { [email protected]

if ( tail == null) Divide and throw new RuntimeException(”List is empty.”); Conquer else { Abstract Data T data = tail .data; Types if ( tail .prev == null) Bibliography { head = null; tail = null; } else { tail = tail .prev; tail .next = null; } return data; } }

ECE750-TXB Implementing a Queue < V > with a Lecture 5: Veni, Divisi, Vici

Bidirectional Linked List Todd L. Veldhuizen [email protected]

Divide and Conquer

Abstract Data public class Queue { Types

BiLinkedList list; Bibliography

public void insert (V v) { list . insert (v); } public boolean isEmpty() { return list .isEmpty(); } public V next() { return list .removeLast(); } }

I insert(v), isEmpty() and next() all take O(1) time. ECE750-TXB Bibliography I Lecture 5: Veni, Divisi, Vici

Todd L. Veldhuizen [email protected]

Divide and Conquer

Abstract Data Types [1] V. Strassen. Bibliography Gaussian elimination is not optimal. Numerische Mathematik, 13:354–356, 1969. bib pdf ECE750-TXB Lecture 6: Lists and Trees

Todd L. Veldhuizen ECE750-TXB Lecture 6: Lists and Trees [email protected] Linear Data Structures Todd L. Veldhuizen Trees [email protected] Bibliography

Electrical & Computer Engineering University of Waterloo Canada

February 14, 2007

ECE750-TXB Iterators Lecture 6: Lists and Trees

Todd L. Veldhuizen Iterator[V] [email protected] Linear Data I An iterator is an ADT that provides traversal through Structures some container. It abstracts away from the details of Trees how the data are stored, and presents a simple interface Bibliography for retrieving the elements of the container one at a time.

I Types:

I V : the value type stored in the container

I Operations:

I boolean hasNext(): returns true just when there is another element in the container;

I V next(): returns the next element in the container, and advances the iterator. ECE750-TXB An Iterator for a linked list Lecture 6: Lists and Trees public class ListIterator { Todd L. Veldhuizen Node node; [email protected]

Linear Data ListIterator ( LinkedList list) Structures

{ node = list .head; } Trees

Bibliography boolean hasNext() { return (node == null); }

T next() { if (node == null) throw new RuntimeException(”Tried to iterate past end of list ”);

T data = node.data; node = node.next; return data; } }

ECE750-TXB Iterators Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures Example usage of an iterator: Trees Bibliography ListIterator iter = new ListIterator(list ); while ( iter .hasNext()) { System.out. println (”The next element is ” + iter .next ()); } ECE750-TXB Bidirectional Linked List Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures I Bidirectional Linked List Trees I Each node has a link to both the next and previous items in the list. Bibliography

I We maintain a pointer to both the front and the back. I We can insert and remove items at both the front and back of the list.

I We will use Bidirectional Linked Lists to illustrate two basic, but extremely useful principles: 1. Maintaining invariants of data structures; 2. Symmetry.

ECE750-TXB Bidirectional Linked List Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] I We have encountered invariants already, in the correctness sketch of binary search. That was an Linear Data invariant for a recursive algorithm, and was required to Structures Trees be true of each recursive invokation of the function. Bibliography Here we discuss data structure invariants. An invariant of a data structure is a property that is required to be always true, except when we are performing some transient update operation.

I Invariants help us to implement data structures correctly: many basic operations can be viewed as disruptions of the invariant (e.g., inserting an element) after which we need to repair or maintain the invariant. ECE750-TXB Bidirectional Linked List Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] I As in a linked list, the basic building block of a bidirectional linked list is a node. Linear Data Structures class BiNode { Trees T data; /∗ Data item ∗/ Bibliography BiNode next; /∗ Next node in the list , if any ∗/ BiNode prev; /∗ Previous node in the list , if any ∗/

BiNode(T data) { data = data; next = null; prev = null; } }

ECE750-TXB Bidirectional Linked List: Invariants Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] I Let’s look at a few examples to see what invariants

suggest themselves. Linear Data Structures

Trees

Bibliography ECE750-TXB Bidirectional Linked List: Invariants Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures

I Here are the invariants we will use: Trees

1. (front 6= null) implies (front.prev = null). (If the list is Bibliography nonempty, there is no element before the front element.) 2. (back 6= null) implies (back.next = null). (If the list is nonempty, there is no element after the last element.) 3. (front = null) if and only if (back = null). 4. For any node x, 4.1 (x.next 6= null) implies x.next.prev = x; 4.2 (x.prev 6= null) implies x.prev.next = x.

ECE750-TXB Bidirectional Linked List: Symmetry Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] I Bidirectional linked lists have a natural symmetry: if we Linear Data ‘reverse’ the list by swapping the ‘front  back’ Structures pointers and each of the ‘next  prev’ pointers, we get Trees another bidirectional linked list. Bibliography

I We’ll call this the dual list.

I The symmetry extends to operations on the list: insertFront() is ‘dual’ to insertBack(), and removeFront() is dual to removeBack().

I This kind of duality has the following nice property:

I Carrying out a sequence of operations op1,..., opk gives the same result as

I Taking the dual list, carrying out the dual sequence of operations ˆop1,..., ˆop2, and taking the dual list. ECE750-TXB Bidirectional Linked List: Symmetry Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures I Example: starting from the list [1, 2], we can Trees I insertBack(3) to give [1, 2, 3]; Bibliography I removeFront() to give [2, 3]. The dual version:

I take the dual list [2, 1]; I insertFront(3) to give [3, 2, 1]; I removeBack() to give [3, 2]; I take the dual list [2, 3]. Get the same answer both ways.

ECE750-TXB Bidirectional Linked List: Symmetry Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

I Why we care about symmetry: if our implementation is Linear Data correct, Structures Trees I We can obtain the routine insertBack() by taking the code for insertFront() and swapping front/back and Bibliography prev/next;

I Ditto for removeBack() and removeFront(); I The set of invariants should not change under swapping front/back and prev/next. Example: For any node x, 1. (x.next 6= null) implies x.next.prev = x; 2. (x.prev 6= null) implies x.prev.next = x. If we swap next/prev, we get 1. (x.prev 6= null) implies x.prev.next = x; 2. (x.next 6= null) implies x.next.prev = x. ECE750-TXB Bidirectional Linked List: Implementation Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures I For operations at the front of the list, there are three Trees cases to consider: Bibliography 1. front=null (empty list) 2. front 6= null and front.next = null (one-element list) 3. front 6= null and front.next 6= null (multi-element list)

I We need to consider each of these cases when we implement, and ensure that in each case, the invariants are maintained.

ECE750-TXB Bidirectional Linked List: Implementation Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] public void insertFront (T data) Linear Data { Structures

BiNode node = new BiNode(data); Trees

Bibliography if ( front == null) /∗ Case 1 ∗/ { front = node; /∗ Both made non−null for Inv. 3 ∗/ back = node; } else { /∗ Case 2,3 ∗/ front .prev = node; /∗ Inv 4.1 ∗/ node.next = front; /∗ Inv 4.2 ∗/ front = node; } } ECE750-TXB Bidirectional Linked List: Implementation Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] public void insertBack(T data) Linear Data { Structures

BiNode node = new BiNode(data); Trees

Bibliography if (back == null) /∗ Case 1 ∗/ { back = node; /∗ Both made non−null for Inv. 3 ∗/ front = node; } else { /∗ Case 2,3 ∗/ back.next = node; /∗ Inv 4.1 ∗/ node.prev = back; /∗ Inv 4.2 ∗/ back = node; } }

ECE750-TXB Bidirectional Linked List: Implementation Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] public T removeFront() { Linear Data if ( front == null) /∗ Case 1 ∗/ Structures throw new RuntimeException(”Empty list.”); Trees else { Bibliography T data = front.data; front = front.next; if ( front == null) /∗ Case 2 ∗/ back = null; else { front .prev = null; /∗ Case 3 ∗/ } return data; } } ECE750-TXB Bidirectional Linked List: Implementation Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] public T removeBack() { Linear Data if (back == null) /∗ Case 1 ∗/ Structures throw new RuntimeException(”Empty list.”); Trees else { Bibliography T data = back.data; back = back.prev; if (back == null) /∗ Case 2 ∗/ front = null; else { back.next = null; /∗ Case 3 ∗/ } return data; } }

ECE750-TXB Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures

Trees

Bibliography Trees ECE750-TXB Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

I Recall that binary search allowed us to find items in a Linear Data sorted array in Θ(log n) time. However, inserting or Structures removing an item from the array took Θ(n) time in the Trees Bibliography worst case.

I Balanced Binary Search Trees offer Θ(log n) search, and also Θ(log n) insert and remove.

I More generally, trees offer a hierarchical decomposition of a search space:

I Spatial searching: R-trees, quadtrees, octtrees, kd-trees; I Databases: BTrees and their kin; I Intervals: interval trees; I ...

ECE750-TXB Binary Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Basic building block is a tree node, which contains: Linear Data I Structures

I A data value, drawn from some total order hT , ≤i; Trees

I A pointer to a left child; Bibliography I A pointer to a right child. ECE750-TXB Binary Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures

Trees

Bibliography

I Traversing a tree means visiting all its nodes. There are three common orders for doing this:

I Preorder: a node is visited before its children, e.g., [E, C, B, A, D, G, F , H, I ]

I Inorder: the left subtree is visited, then the node, then the right subtree, e.g., [A, B, C, D, E, F , G, H, I ].

I Postorder: a node is visited after its children, e.g., [A, B, D, C, F , I , H, G, E].

ECE750-TXB Binary Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures

Trees

Bibliography

I Terminology:

I E is the root and is on level 0; I C, G are children of E and are on level 1; I C is the parent of B, D; I C is an ancestor of A (as are E and B); I I is a descendent of G (and E and H); I The sequence (E, C, B, A) is a path from the root to A. ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data I Binary Search Trees satisfy the following invariant: for Structures any tree node x, Trees 1. If x.left 6= null, then x.left.data ≤ x.data; Bibliography 2. If x.right 6= null, then x.data ≤ x.right.data.

I If we want to visit the nodes of the tree in order, we start at the root and recursively: 1. Visit the left subtree; 2. Visit the node; 3. Visit the right subtree. i.e. an inorder traversal.

ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected] I Search procedure is very similar to binary search of an Linear Data array. Structures

Trees

boolean contains(int z) Bibliography { if (z == data) return true; else if ((z < data) && (left != null)) return left . contains(z); else if ((z > data) && (right != null)) return right . contains(z);

return false ; } ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures I The worst-case performance of contains(z) depends on Trees the height of the tree. Bibliography I The height of a tree is the length of the longest path from the root to a leaf.

I Root: the node at the top of the tree I Leaf (or external node): a node with no children I Internal node: any node that is not a leaf.

I Worst-case time required for contains(z) is O(h), where h is the height of the tree.

ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

I A binary search tree of height h can contain up to Linear Data 2h − 1 values. Structures Trees Given n values, we can always construct a binary search I Bibliography tree of height at most 1 + log2(n): I Sort the n values in ascending order and put them in an array A[0..n − 1].

I Make the root element A[bn/2c]. I Build the left subtree using elements A[0..bn/2c − 1] I Build the right subtree using elements A[bn/2c + 1].

I But, we can also construct a binary tree of height n − 1.

I Make the root A[0]; make its right subtree A[1], ... ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen I To achieve O(log n) search times, it is necessary to have [email protected] the tree balanced, i.e., have all leaves roughly the same Linear Data distance from the root. Structures

I This is easy if the contents of the tree are fixed. Trees Bibliography I This is not easy if we are adding and removing elements dynamically.

I We can aim for average-case balanced, i.e., the probability of having a badly balanced tree → 0 as n → ∞.

I Example: treaps

I We can have deterministic balancing that guarantees balance in the worst case.

I red-black trees; I AVL trees; I 2-3 trees; I B-trees; I splay trees.

ECE750-TXB Enumeration of Binary Search Trees Lecture 6: Lists and Trees I A useful fact: the number of valid binary search trees Todd L. Veldhuizen on n keys is given by the Catalan numbers: [email protected]

2n 1 Linear Data Structures Cn = n n + 1 Trees

1 Bibliography ∼ 4n · √ πn3/2 (Sequence A000108 in the Online Encyclopedia of Integer Sequences.)

I First few values: 1 1 7 429 2 2 8 1430 3 5 9 4862 4 14 10 16796 5 42 11 58786 6 132 12 208012 ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees I A naive insertion strategy, which does not guarantee Todd L. Veldhuizen balance, is the following: [email protected]

Linear Data void insert (int z) Structures { Trees if (z == data) return; Bibliography else if (z < data) { if ( left == null) left = new Tree(z); else left . insert (z); } else if (z > data) { if ( right == null) right = new Tree(z); else right . insert (z); } }

Note the symmetry: can swap left/right and <, >.

ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees Naive insertion works fairly well in the average case [1]. Todd L. I Veldhuizen Theorem [email protected] The expected height of a binary search tree constructed by Linear Data Structures inserting a sequence of n random values is ∼ c log n with Trees

c ≈ 4.311. Bibliography I Equivalently, inserting n values in a randomly chosen order.

I Using Markov’s inequality, we can say that if H is a random variable giving the height of a tree after the insertion of n keys chosen uniformly at random, then

[H] c log n log n Pr(H ≥ αn) ≤ E = = O αn αn n i.e., the probability of a tree having height linear in n converges to zero. So, badly balanced trees are very unlikely for large n. ECE750-TXB Binary Search Trees Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures

Trees

Bibliography

Result of 100 random insertions.

ECE750-TXB Rotations Lecture 6: Lists and Trees I A rotation is a simple, local operation that makes some Todd L. Veldhuizen subtrees shorter and others deeper. [email protected] I Rotations preserve the inorder traversal, i.e., the order Linear Data of keys in the tree remains the same. Structures I Any two binary search trees on the same set of keys can Trees be transformed into one another by a sequence of Bibliography rotations.1 I Rotations are a common method to restore balance to a tree.

D rotate right at D → B @ @ ~~ @@  @@ ~ ~ @   @ B E AD @ @  @@ ~~ @@   @ ← rotate left at B ~ ~ @ AC CE 1In fact, something even more interesting is true: for each n there is a sequence of rotations that produces every possible binary tree without any duplicates, and eventually returns the tree to its initial configuration. (i.e., the rotation graph Gn, where vertices are trees and edges are rotations, contains a Hamiltonian path [2].) ECE750-TXB Rotations Lecture 6: Lists and Trees

Todd L. Veldhuizen I Code to rotate right: [email protected]

Tree rotateRight () Linear Data { Structures if ( left == null) Trees throw new RuntimeException(”Cannot rotate here”); Bibliography

Tree A = left . left ; Tree B = left ; Tree C = left . right ; Tree D = this; Tree E = right;

return new Tree(B.data, A, new Tree(D.data, C, E)); }

I Code to rotate left: use duality, swap left/right.

ECE750-TXB Rotation example Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures I Badly balanced binary tree: rebalance by rotating left at a, then left at c: Trees Bibliography a >> b > b ÐÐ > b >> a c Ò == c ?? a Ò ? d ? d  ?? d ?? c e ?? e e ECE750-TXB Iterator for a binary search tree I Lecture 6: Lists and Trees

Todd L. class BSTIterator implements Iterator { Veldhuizen [email protected] Stack stack ; Linear Data public BSTIterator(BSTNode t) Structures { Trees stack = new Stack(); Bibliography fathom(t); }

public boolean hasNext() { return !stack .empty(); }

public Object next() { BSTNode t = (BSTNode)stack.pop(); if (t . right child != null)

ECE750-TXB Iterator for a binary search tree II Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

fathom(t. right child ); Linear Data return t ; Structures } Trees Bibliography void fathom(BSTNode t) { do { stack .push(t); t = t. left child ; } while (t != null ); } } ECE750-TXB Bibliography I Lecture 6: Lists and Trees

Todd L. Veldhuizen [email protected]

Linear Data Structures [1] Luc Devroye. Trees

A note on the height of binary search trees. Bibliography Journal of the ACM (JACM), 33(3):489–498, 1986. bib pdf

[2] J. M. Lucas, D. R. van Baronaigien, and F. Ruskey. On rotations and the generation of binary trees. Journal of Algorithms, 15(3):343–366, November 1993. bib ps ECE750-TXB Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. ECE750-TXB Lecture 7: Red-Black Trees, Veldhuizen Heaps, and Treaps [email protected] Red-Black Trees

Heaps Todd L. Veldhuizen Treaps [email protected] Bibliography

Electrical & Computer Engineering University of Waterloo Canada

February 14, 2007

ECE750-TXB Binary Search Trees Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] I Recall that in a binary tree of height h the time required to find or insert an element is O(h). Red-Black Trees Heaps I In the worst case h = n, the number of elements. Treaps

I To keep h ∈ O(log n) one needs a balancing strategy. Bibliography I Balancing strategies may be either:

I Randomized: e.g. a random insert order results in expected height of c log n with c ≈ 4.311.

I Deterministic (in the sense of not random).

I Today we will see an example of each:

I Red-black trees: deterministic balancing I Treaps: randomized. Also demonstrate persistence and unique representation. ECE750-TXB Red-black trees Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] I Red-black trees are a popular form of binary search tree with a deterministic balancing strategy. Red-Black Trees Heaps I Nodes are coloured red or black. Treaps

I Properties of the node-colouring ensure that the longest Bibliography path to a leaf is no more than twice the length of the shortest path.

I This ensures height of ≤ 2 log2(n + 1), which implies search, min, max in O(log n) worst-case time.

I Insert and Delete can also be performed in O(log n) worst-case time.

I Invented by Bayer [2], red-black formulation due to Guibas and Sedgewick [9]. Other sources: [5, 10].

ECE750-TXB Red-Black Trees: Invariants Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps Balance invariants: I Bibliography 1. No red node has a red child. 2. Every path in a subtree contains the same number of black nodes. ECE750-TXB Red-Black Trees Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

ECE750-TXB Red-Black Trees: Balance I Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees Let bh(x) be the number of black nodes along any path Heaps from a node x to a leaf, excluding the leaf. Treaps Bibliography Lemma The number of internal nodes in the subtree rooted at x is at least 2bh(x) − 1.

Proof. ECE750-TXB Red-Black Trees: Balance II Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen By induction on height: [email protected] 1. Base case: If x has height 0, then x is a leaf, and Red-Black Trees bh(x) = 0; the number of internal (non-leaf) Heaps bh(x) descendents of x is 0 = 2 − 1. Treaps 2. Induction step: assume the hypothesis is true for height Bibliography ≤ h. Consider a node of height h + 1. From invariant (2), the children have black height either bh(x) − 1 (if the child is black) or bh(x) (if the child is red). By induction hypothesis, each child subtree has at least 2bh(x)−1 − 1 internal nodes. The total number of internal nodes in the subtree rooted at x is therefore ≥ (2bh(x)−1 − 1) + 1 + (2bh(x)−1 − 1) = 2bh(x) − 1.

ECE750-TXB Red-Black Trees: Balance Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Theorem Red-Black Trees A red-black tree with n internal nodes has height at most Heaps Treaps 2 log2(n + 1). Bibliography Proof. Let h be the tree height. From invariant 1 (a red node must have both children black), the black-height of the root must be ≥ h/2. Applying Lemma 1.1, the number of internal nodes n of the tree satisfies n ≥ 2h/2 − 1. Rearranging, h ≤ 2 log2(n + 1). ECE750-TXB Red-Black Trees: Balance Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] I As with all non-randomized binary search trees, balance must be maintained when insert or delete operations are Red-Black Trees performed. Heaps Treaps

I These operations may disrupt the invariants, so Bibliography rotations and recolourings are needed to restore them.

I Insert for red-black tree: 1. Insert the new key as a red node, using the usual binary tree insert. 2. Perform restructurings and recolourings along the path from the newly added leaf to the root to restore invariants. 3. Root is always coloured black.

ECE750-TXB Red-Black Trees: Balance Lecture 7: Red-Black Trees, Heaps, and Treaps

I Four cases for red nodes with red children: Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

I Restructure/recolour to correct: each of the above cases becomes ECE750-TXB Red-Black Trees: Example Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen I Insertion of [1,2,3,4,5] into a red-black tree: [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

I Implementation of rebalancing is straightforward but a bit involved.

ECE750-TXB Heaps and Treaps Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. I Treaps are a randomized search tree that combine Veldhuizen TRees and hEAPS. [email protected]

Red-Black Trees I First, let’s look at heaps. Heaps I Consider determining the maximum element of a set. Treaps I We could iterate through the array and keep track of Bibliography the maximum element seen so far. Time taken: Θ(n).

I We could build a binary tree (e.g. red-black). We can obtain the maximum (minimum) element in O(h) time by following rightmost (leftmost) branches. If tree is balanced, requires O(n log n) time to build the tree, and O(log n) time to retrieve the maximum element.

I A heap is a highly efficient data structure for maintaining the maximum element of a set. It is a rudimentary example of a dynamic algorithm/data structure. ECE750-TXB Dynamic Algorithms Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. I A static problem is one where we are given an instance Veldhuizen of a problem to solve, we solve it, and are done (e.g., [email protected]

sort an array). Red-Black Trees I A dynamic problem is one where we are given a problem Heaps to solve, we solve it. Treaps

I Then the problem is changed slightly and we resolve. Bibliography I ...ad infinitum.

I The challenge goes from solving a single instance of a problem to maintaining a solution as the problem is modified.

I It is usually more efficient to update the solution than recompute from scratch.

I e.g., binary search trees can be viewed as a method for dynamically maintaining an ordered list as elements are inserted and removed.

ECE750-TXB Heaps Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen I A heap dynamically maintains the maximum element in [email protected] a collection (or, dually, the minimum element). A binary heap can: Red-Black Trees Heaps I Obtain the maximum element in O(1) time; Treaps I Remove the maximum element in O(log n) time; Bibliography I Insert new element in O(log n) time. Heaps are a natural implementation of the PriorityQueue ADT.

I There are several flavours of heaps: binary heaps, binomial heaps, fibonacci heaps, pairing heaps. The more sophisticated of these support merging (melding) two heaps.

I We will look at binary heaps. ECE750-TXB Binary Heap Invariants Lecture 7: Red-Black Trees, Heaps, and Treaps

1. A binary heap is a complete binary tree of height h − 1, Todd L. Veldhuizen plus a possibly incomplete level of height h filled from [email protected] left to right. Red-Black Trees

2. The key stored at each node is ≥ the key(s) stored in Heaps

its children. Treaps

Bibliography

ECE750-TXB Binary Heap Lecture 7: Red-Black Trees, Heaps, and Treaps I A binary heap may be stored as a (1-based) array, where Todd L. Veldhuizen I Parent(j) = bj/2c [email protected] I LeftChild(i) = 2 ∗ i Red-Black Trees I RightChild(i) = 2 ∗ i + 1 Heaps I e.g., [17, 11, 13, 9, 6, 2, 12, 4, 3, 1] is an array Treaps representation of the heap: Bibliography ECE750-TXB Heap operations Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] I To insert a key k into the heap:

I Place k at the next available position. Red-Black Trees

I Swap k with its parent(s) until the heap invariant is Heaps

satisfied. (Takes O(log n) time.) Treaps

I The maximum element is just the key stored at the Bibliography root, which can be read off in O(1) time.

I To delete the maximum element:

I Place the key at the last heap position at the root (overwriting the current maximum), and decrease the size of the heap by one.

I Choose the largest of the root and its two children, and make this the root; perform this procedure recursively until the heap invariant is satisfied.

ECE750-TXB Heap: insert example Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps I Example: insert 23 into the heap and restore the heap Treaps invariant. Bibliography ECE750-TXB Heap: delete-max example Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

I To delete the max element, move the element from the last position (2) to the root;

I To restore heap invariant, swap root with the largest child greater than it, if any, and repeat down the heap.

ECE750-TXB Treaps Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] Treaps (binary TRee + hEAP) Red-Black Trees I a randomized binary search tree Heaps

I with O(log n) average-case insert, delete, search Treaps

I with O(∆ log n) average-case union, intersection, ⊆, ⊇, Bibliography where ∆ = |(A \ B) ∪ (B \ A)| is the difference between the sets

I uniquely represented (to be explained)

I easily made persistent (to be explained)

I Due to Vuillemin [14] and independently, Seidel and Aragon [11]. Additional references: [3, 16, 15]. ECE750-TXB Treaps: Basics Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees I Keys are assigned (randomly chosen) priorities. Heaps I Two total orders on keys: Treaps I The usual key order; Bibliography I A randomly chosen priority order, often obtained by assigning each key a random integer, or using an appropriate hash function

I Treaps are kept sorted by key in the usual way (inorder tree traversal visits keys in order).

I The heap property is maintained wrt the priority order.

ECE750-TXB Treap ordering Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen I Each node has key k and priority p [email protected]

I Ordering invariants: Red-Black Trees

Heaps (k , p ) 2 2 Treaps t KK tt KK Bibliography tt KK tt KK tt K (k1, p1)(k3, p3)

k1 ≤ k2 ≤ k3 Key order

 p ≥ p 2 p 1 Priority order p2 ≥p p3

Every node has a higher priority than its descendents. ECE750-TXB Treaps: Basics Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees I If priorities are chosen randomly, the tree is on average balanced, and insert, delete, search take O(log n) time Heaps Treaps I Random priorities behave like a random insertion order: the structure of the treap is exactly that obtained by Bibliography inserting the keys into a binary search tree in descending order of heap prioritity.

I If keys are unique (no duplicates), and priorities are unique, then the treap has the unique representation property

ECE750-TXB Unique representation Lecture 7: Red-Black Trees, Heaps, and Treaps

I Unique representation: each set is represented by a Todd L. Veldhuizen unique data structure [1, 13, 12] [email protected] I Most tree data structures do not have this property: depending on order of inserts, deletes, etc. the tree can Red-Black Trees have different forms for the same set of keys. Heaps n −3/2 −1/2 Treaps I Recall there are Cn ∼ 4 n π ways to place n keys in a binary search tree (Catalan numbers). e.g. Bibliography C20 = 6564120420. Deterministic (i.e., not randomized) uniquely I √ represented search trees are known to require Ω( n) worst-case time for insert, delete, search [12].

I Treaps are randomized (not deterministic), and have O(log n) average-case time for insert, delete, search

I If you memoize or cache the constructors of a uniquely represented data structure, you can do equality testing in O(1) time by comparing pointers. ECE750-TXB Treap: Example Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps Treap A1 = R.insert("f"); // Insert the key f Treaps Treap A2 = A1.insert("u"); // Insert the key u Bibliography

Treap B1 = R.insert("u"); // Insert the key u into R Treap B2 = R.insert("f"); // Insert the key f

ECE750-TXB Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography ECE750-TXB Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

ECE750-TXB Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography ECE750-TXB Canonical forms Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] I The structure of the treap does not depend on the order on which the operations are carried out. Red-Black Trees Heaps Treaps give a canonical form for sets: if A, B are sets, I Treaps

we can determine whether A = B by constructing treaps Bibliography containing the elements of A and B, and comparing them. If the treaps are the same, the sets are equal.

I Treaps give an easy decision procedure for equality of terms modulo associativity, commutativity, and idempotency.

I Treaps are very useful in program analysis (e.g., for compilers) for solving fixpoint equations on sets.

ECE750-TXB Persistent Data Structures Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen Literature: [7, 8, 4, 6] [email protected]

Red-Black Trees I Partially persistent: Can access previous versions of a data structure, but cannot derive new versions from Heaps them (read-only access to a linear past.) Treaps Bibliography I Fully persistent: Can make changes in previous versions of the data structure: versions can “fork.”

I Any linked data structure with constant bounded in-degree can be made fully persistent with amortized O(1) space and time overhead, and worst case O(1) overhead for access [7]

I Confluently persistent: Can branch into two versions of the data structure, and later reconcile these branches ECE750-TXB The Version Graph Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. The version graph shows how versions of a data structure Veldhuizen are derived from one another. [email protected]

I Vertices: Data structures Red-Black Trees Heaps I Edges: Show how one data structure was derived from another Treaps Bibliography I Treaps example:

R B }} BB }} BB }} BB ~}} B A1 B1

  A2 B2

ECE750-TXB Version graph Lecture 7: Red-Black Trees, Heaps, and Treaps I Partial persistence: version graph is a linear sequence of Todd L. Veldhuizen versions, each derived from the previous version. [email protected]

I Partial/full persistence: get a version tree Red-Black Trees I Confluent persistence: get a version DAG (directed Heaps acyclic graph) Treaps Bibliography X A {{ AA {{ AA {{ AA }{{ A Y 1 Z     Y 2  C  CC  CC  CC  C! Ö W ECE750-TXB Purely Functional Data Structures Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

I Literature: [10] Heaps Treaps I Functional data structures: cannot modify a node of the data structure once it is created. (One implication: Bibliography no cyclic data structures.)

I Functional data structures are by nature partially persistent: we can always hold onto pointers to old versions of the data structure.

ECE750-TXB Scopes Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Partial persistence is very useful for managing scopes in Veldhuizen I [email protected] compilers and program analysis. Red-Black Trees I A scope is a representation of the names that are visible Heaps at a given program point: Treaps

int foo(int a, int b) Bibliography { // S1 int x = a*a, y = b*b, z=0; // S2

for (int k=0; k < x; ++k) // S3 for (int l=0; l < y; ++l) // S4 ++c; // S5 return x; } ECE750-TXB Scopes Example Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected]

Red-Black Trees

Heaps

Treaps

Bibliography

ECE750-TXB Bibliography I Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] [1] A. Andersson and T. Ottmann. Faster uniquely represented dictionaries. Red-Black Trees In IEEE, editor, Proceedings: 32nd annual Symposium Heaps on Foundations of Computer Science, San Juan, Puerto Treaps Bibliography Rico, October 1–4, 1991, pages 642–649, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1991. IEEE Computer Society Press. bib pdf

[2] Rudolf Bayer. Symmetric binary B-trees: Data structure and maintenance algorithms. Acta Inf, 1:290–306, 1972. bib ECE750-TXB Bibliography II Lecture 7: Red-Black Trees, Heaps, and Treaps

[3] Guy E. Blelloch and Margaret Reid-Miller. Todd L. Veldhuizen Fast set operations using treaps. [email protected] In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 16–26, Red-Black Trees Heaps Puerto Vallarta, Mexico, June 1998. bib ps Treaps [4] Adam L. Buchsbaum and Robert E. Tarjan. Bibliography Confluently persistent deques via data-structural bootstrapping. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pages 155–164. ACM Press, 1993. bib pdf ps

[5] Thomas H. Cormen, Charles E. Leiserson, and Ronald R. Rivest. Intoduction to algorithms. McGraw Hill, 1991. bib

ECE750-TXB Bibliography III Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [email protected] [6] P. F. Dietz. Fully persistent arrays. Red-Black Trees In F. Dehne, J.-R. Sack, and N. Santoro, editors, Heaps Proceedings of the Workshop on Algorithms and Data Treaps Bibliography Strucures, volume 382 of LNCS, pages 67–74, Berlin, August 1989. Springer. bib [7] James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. In ACM Symposium on Theory of Computing, pages 109–121, 1986. bib pdf ECE750-TXB Bibliography IV Lecture 7: Red-Black Trees, Heaps, and Treaps [8] Amos Fiat and Haim Kaplan. Todd L. Making data structures confluently persistent. Veldhuizen [email protected] In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA-01), pages Red-Black Trees 537–546, New York, January 7–9 2001. ACM Press. Heaps bib pdf Treaps Bibliography [9] Leonidas J. Guibas and Robert Sedgewick. A dichromatic framework for balanced trees. In FOCS, pages 8–21. IEEE, 1978. bib [10] Chris Okasaki. Purely Functional Data Structures. Cambridge University Press, Cambridge, UK, 1998. bib [11] Raimund Seidel and Cecilia R. Aragon. Randomized search trees. Algorithmica, 16(4/5):464–497, 1996. bib pdf ps

ECE750-TXB Bibliography V Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [12] Lawrence Snyder. [email protected]

On uniquely representable data structures. Red-Black Trees In 18th Annual Symposium on Foundations of Heaps Computer Science, pages 142–146, Long Beach, Ca., Treaps USA, October 1977. IEEE Computer Society Press. bib Bibliography [13] R. Sundar and R. E. Tarjan. Unique binary search tree representations and equality-testing of sets and sequences. In Baruch Awerbuch, editor, Proceedings of the 22nd Annual ACM Symposium on the Theory of Computing, pages 18–25, Baltimore, MY, May 1990. ACM Press. bib pdf ECE750-TXB Bibliography VI Lecture 7: Red-Black Trees, Heaps, and Treaps

Todd L. Veldhuizen [14] Jean Vuillemin. [email protected] A unifying look at data structures. Red-Black Trees

Communications of the ACM, 23(4):229–239, 1980. Heaps

bib pdf Treaps

Bibliography [15] M. A. Weiss. A note on construction of treaps and Cartesian trees. Information Processing Letters, 54(2):127–127, April 1995. bib [16] Mark Allen Weiss. Linear-time construction of treaps and Cartesian trees. Information Processing Letters, 52(5):253–257, December 1994. bib pdf ECE750-TXB Lecture 8: Treaps, Tries, and Hash Tables

Todd L. ECE750-TXB Lecture 8: Treaps, Tries, and Veldhuizen Hash Tables [email protected] Review: Treaps

Tries Todd L. Veldhuizen Hash Tables [email protected] Bibliography

Electrical & Computer Engineering University of Waterloo Canada

February 1, 2007

ECE750-TXB Review: Treaps Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen I Recall that a binary search tree has keys drawn from a [email protected] totally ordered structure hK, ≤i Review: Treaps I An inorder traversal of the tree recovers the keys in Tries ascending order. Hash Tables

Bibliography d

b h

a c f i ECE750-TXB Review: Treaps Lecture 8: Treaps, Tries, and Hash Tables Recall that a heap has priorities drawn from a totally Todd L. I Veldhuizen ordered structure hP, ≤i [email protected]

I The priority of a parent is ≥ that of its children (for a Review: Treaps max heap.) Tries Hash Tables I The largest priority is at the root. Bibliography

23

11 14

7 1 6 13

ECE750-TXB Review: Treaps Lecture 8: Treaps, Tries, and Hash Tables

Todd L. I In a treap, nodes contain a pair (k, p) where k ∈ K is a Veldhuizen [email protected] key, and p ∈ P is a priority. Review: Treaps I A Treap is a mixture of a binary search tree and a heap: Tries

Hash Tables I A binary search tree with respect to keys; Bibliography I A heap with respect to priorities.

(d,23)

(b,11) (h,14)

(a,7) (c,1) (f,6) (i,13) ECE750-TXB Review: Unique Representation Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected] I If the keys and priorities are unique, then treaps have the unique representation property: given a set of (k, p) Review: Treaps pairs, there is only one way to build the tree. Tries Hash Tables I For the heap property to be satisfied, there is only one (k, p) pair that can be the root: the one with the Bibliography highest priority.

I The left subtree of the root will contain all keys < k, and the right subtree of the root will contain all keys > k.

I Of the keys < k, the one with the highest priority must occupy the left child of the root. This then splits constructing the left subtree into two subproblems.

I etc.

ECE750-TXB Review: Unique Representation Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen I Example: to build a treap from {(i, 13), (c, 1), (d, 23), (b, 11), (h, 14), (a, 7), (f , 6)}, [email protected] unique choice of root: (d, 23) Review: Treaps

Tries (d, 23) T Hash Tables jjjj TTTT j Bibliography {(c, 1), (b, 11), (a, 7)} {(i, 13), (h, 14), (f , 6)}

I To build the left subtree, pick out the highest priority element: (b, 11). And so forth.

(d, 23) T t TTTT ttt (b, 11) {(i, 13), (h, 14), (f , 6)} u KK uuu KK (a, 7) (c, 1) ECE750-TXB Review: Unique Representation Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen I Data structures with the unique representation can be [email protected] checked for equality in O(1) time by using caching (also known as memoization): Review: Treaps Tries I Implement the data structure in a purely functional style Hash Tables (a node’s fields are never altered after construction. Any changes require creating a new node.) Bibliography

I Maintain a map from (key, priority, lchild, rchild) tuples to already constructed nodes.

I Before constructing a node, check the cache to see if it already exists; if so, return the pointer to that node. Otherwise, construct the node and add it to the cache.

I If two treaps contain the same keys, their root pointers will be equal: can be checked in O(1) time.

I Checking and maintaining the cache requires additional time overhead.

ECE750-TXB Review: Balance of treaps Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected]

I Treaps are balanced if the priorities are chosen Review: Treaps randomly. Tries Hash Tables I Recall that building a binary search tree with a random insertion order results in a tree of expected height Bibliography c log n, with c ≈ 4.311. I A treap with random priorities assigned to keys has exactly the same structure as a binary search tree created by inserting keys in descending order of priority

I Descending order of priority is a random order; I Therefore treaps have expected height c log n with c ≈ 4.311. ECE750-TXB Insertion into treaps Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected]

I Insertion for treaps is much simpler than that for Review: Treaps red-black trees. Tries 1. Insert the (k, p) pair as for a binary search tree, by key Hash Tables alone: the new node will be placed somewhere at the Bibliography bottom of the tree. 2. Perform rotations along the path from the new leaf to the root to restore invariants:

I If there is a node x whose right subchild has a higher priority, rotate left at x. I If there is a node x whose left subchild has a higher priority, rotate right at x.

ECE750-TXB Insertion into treaps Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected] I Example: the treap below has just had (e, 19) inserted

as a new leaf. Rotations have not yet been performed. Review: Treaps

(d,23) Tries Hash Tables

Bibliography (b,11) (h,14)

(a,7) (c,1) (f,6) (i,13)

(e,19)

I f has a left subchild with greater priority: rotate right at f . ECE750-TXB Insertion into treaps Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected] I After rotating right at f : Review: Treaps (d,23) Tries

Hash Tables

(b,11) (h,14) Bibliography

(a,7) (c,1) (e,19) (i,13)

(f,6)

I h has a left subchild with greater priority: rotate right at h.

ECE750-TXB Insertion into treaps Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected] I After rotating right at h: Review: Treaps

(d,23) Tries

Hash Tables

Bibliography (b,11) (e,19)

(a,7) (c,1) (h,14)

(f,6) (i,13)

I Heap invariant is satisfied: all done. ECE750-TXB Lecture 8: Treaps, Tries, and Hash I Treaps are easily made persistent (retain previous Tables Todd L. versions) by implementing them in a purely functional Veldhuizen style. Insertion requires duplicating at most a sequence [email protected]

of nodes from the root to a leaf: an O(log n) space Review: Treaps

overhead. The remaining parts of the tree are shared. Tries

Hash Tables I E.g. the previous insert done in a purely functional style: Bibliography

Version 2 Version 1

(d,23) (d,23)

(e,19)

(b,11) (h,14)

(a,7) (c,1) (f,6) (i,13)

ECE750-TXB Strings Lecture 8: Treaps, Tries, and Hash Tables

I A string is a sequence of characters drawn from some Todd L. Veldhuizen alphabet Σ. We will often use Σ = {0, 1}: binary [email protected] strings. Review: Treaps We write Σ∗ to mean all finite strings1 composed of I Tries characters from Σ. (∗ is the Kleene closure.) Hash Tables ∗ I Σ contains the empty string . Bibliography ∗ I If w, v ∈ Σ are strings, we write w · v or just wv to mean the concatenation of w and v.

I Example: given w = 010 and v = 11, w · v = 01011. hΣ∗, ·, i is an example of a monoid: a set (Σ∗) together with an associative binary operator (·) and an identity element (). For any strings u, v, w ∈ Σ∗,

u · (v · w) = (u · v) · w v = v = v

1Infinite strings are very useful also: if we write a real number x ∈ [0, 1] as a binary number e.g. 0.101100101000 ··· , this is a representation of x by an infinite string from Σω. ECE750-TXB Tries Lecture 8: Treaps, Tries, and Hash Tables I Recall that we may label the left and right links of a Todd L. Veldhuizen binary tree with 0 (for left) and 1 (for right): [email protected]

Review: Treaps 0 y @@ 1 yyy @@ Tries x : 0 ÓÓ ::1 Hash Tables  ÓÓ : y z Bibliography 

I To describe a path in the tree, one can list the sequence of left/right branches to take from the root. E.g., 10 gives y, 11 gives z.

I The set of all paths from the root to leaves is P◦ = {0, 10, 11}

I The set of all paths from the root to leaves or internal nodes is: P• = {, 0, 1, 10, 11}, where  is the empty string indicating the path starting and ending at the root.

ECE750-TXB Tries Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected]

◦ Review: Treaps I The set P is prefix-free: no string is an initial segment Tries of any other string. Otherwise, there would be a path Hash Tables to a leaf passing through another leaf! Bibliography • • • I The set P is prefix-closed: if wv ∈ P , then w ∈ P also. i.e., P• contains all prefixes of all strings in P•.2

2We can define • as an operator by A• ≡ {w : wv ∈ A}. • is a closure operator. A useful fact: every closure operator has as its range a complete lattice, where meet and join are given by (X u Y )• = X • ∩ Y • and (X t Y )• = (X • ∪ Y •)•. Applying this fact to the representation of binary trees by strings, • induces a lattice of binary trees. ECE750-TXB Tries Lecture 8: Treaps, Tries, and Hash • Tables I Given a binary tree, we can produce a set of strings P ◦ Todd L. or P that describe all paths (resp. all paths to leaves). Veldhuizen [email protected] • ◦ I The converse is also true: given a set P or P , we can reproduce the tree.3 Review: Treaps Tries Example: the set {100, 11, 001, 01} is prefix free, and I Hash Tables

the corresponding tree can be built by simply adding Bibliography the paths one-by-one to an initially empty tree:

O ooo OOO 0 ooo OOO1 ooo OOO ooo  OOO ?o ? 0  ?? 1 0  ?? 1  ??  ??  ??  ??   ?   ? ?  ?? 1 0  ??  ??   ?   

3Formally we can say there is a bijection (a 1-1 correspondence) between binary trees and prefix-closed  (resp. prefix-free) sets.

ECE750-TXB Tries Lecture 8: Treaps, Tries, and Hash Tables I A tree constructed in this way — by interpreting a set Todd L. of strings as paths of the tree — is called a trie. (The Veldhuizen term comes from reTRIEval; pronounced either “tree” [email protected] or “try” depending on taste. Tries were invented by de Review: Treaps la Briandais, and independently by Fredkin [5].) Tries I The most common use of a trie is to implement a Hash Tables DictionaryhK, V i, i.e., maintaining a map Bibliography f : K * V by associating each k ∈ K with a path through the trie to a node where f (k) is stored.4 I Tries find applications in bioinformatics, coding and compression, sorting, SAT solving, routing, natural language processing, very large databases (VLDBs), data mining, etc. I Binary Decision Diagrams (BDDs) are essentially tries with caching and sharing of subtrees. I Recent survey by Flajolet [4]. 4The notation K * V indicates a partial function from K to V : a function that might not be defined for some keys. ECE750-TXB Trie example: word list Lecture 8: Treaps, Tries, and Hash Tables I Example: build a trie to store english words: trie, larch, Todd L. saxophone, tried, saxifrage, squeak, try, squeak, Veldhuizen squeaky, squeakily, squeakier. [email protected] I Common implementation variants of a trie: Review: Treaps I associate internal nodes with entries also, if one occurs Tries there. (Can use 1 bit on internal nodes to indicate Hash Tables whether a key terminates there.) Bibliography

I when a node has only one descendent, end the trie there, rather than including a possibly long chain of nodes with single children.

I Use the trie to store keys only; implicitly the values we are storing are V = {0, 1}. The function the trie represents is a map χ : K → {0, 1} where χ is the characteristic function of the set: χ(k) = 1 if and only if k is in the set.

I Use the alphabet {a, b, ··· , z}. I Instead of having a 26-way branch in each node, put a little BST at each node with up to 26 elements in it (a “ternary search trie” [1])

ECE750-TXB Trie example: wordlist Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected]

Review: Treaps saxifrage Tries i Hash Tables x o larch saxophone squeakier a e Bibliography l i l s q u e a k squeak squeakily t y squeaky r i e d trie tried y

try ECE750-TXB Trie example: coding Lecture 8: Treaps, Tries, and Hash Tables Suppose we want to transmit (or compress) data. Todd L. I Veldhuizen [email protected] I At the receiving (or decoding) end, we will have a long

string of bits to decode. Review: Treaps

Tries I A simple but effective strategy is to build a codebook that maps binary codewords to plaintext. The incoming Hash Tables transmission is then just a sequence of codewords that Bibliography we will replace, one by one, with their corresponding plaintext.5

I A code that can be described by a trie, with outputs only at the leaves, is an example of a uniquely decodeable code: there is only one way an encoded message can be decoded. Specifically, such codes are called prefix codes or instantaneous codes.

5This strategy is asymptotically optimal (achieves a bitrate ≤ H +  for any  > 0) for stationary ergodic random processes, with an appropriate choice of codebook.

ECE750-TXB Trie example: coding Lecture 8: Treaps, Tries, and Hash Tables

I Example: to encode english, we might assign codewords Todd L. to sequences of three letters, giving the most frequent Veldhuizen words shorter codes: [email protected] Three-letter combination Codeword Review: Treaps the 000 Tries and 001 Hash Tables for 010 Bibliography are 011 but 100 not 1010 you 1011 all 1100 . . . . etc 11101101 . . . . qxw 1111011001101001 I These codewords are chosen to be a prefix-free set. ECE750-TXB Trie example: coding Lecture 8: Treaps, Tries, and Hash Tables

Todd L. I For decoding messages we build a trie: Veldhuizen [email protected]

0 1 Review: Treaps Tries

0 1 0 1 Hash Tables

Bibliography 01 01 01 0 1

the and for are but

0 1 0 1 0 1

not you all

0 1

ECE750-TXB Trie example: decoding Lecture 8: Treaps, Tries, and Hash Tables I Incoming message: 100101001010111100 Todd L. Veldhuizen [email protected] I To decode: start at root of trie, follow path given by

bits. When a leaf is reached, output the word there, Review: Treaps

and return to the root. Tries

Hash Tables

Bibliography 100 1010 010 1011 1100 |{z} |{z} |{z} |{z} |{z} but not for you all

I This requires substantially fewer bits than transmitting as ASCII text (24 bits per 3-letter sequence).

I A good code assigns short codewords to frequently-occurring strings; if a string occurs with probability pi , one wants the codeword to have length about − log2 pi . I Later in the course we shall see how such codes can be constructed optimally using a greedy algorithm. ECE750-TXB Tries: Kraft’s inequality Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected] I Kraft’s inequality is a simple constraint on the lengths of codewords in a prefix code (equivalently, leaf depths Review: Treaps in a binary tree.) Tries Theorem (Kraft) Hash Tables Bibliography Let (d1, d2,...) be a sequence of code lengths of a code. There is a prefix code with code lengths d1, d2,... (equivalently, a binary tree with leaves at depth d1, d2,...) if and only if

n X 2−di ≤ 1 (1) i=1

ECE750-TXB Tries: Kraft’s inequality I Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected] I Positive example: the codeword lengths 3, 3, 2, 2 satisfy 1 1 1 1 3 Review: Treaps Kraft’s inequality: 8 + 8 + 4 + 4 = 4 . Possible trie realization: Tries Hash Tables 0 o OO 1 Bibliography ooo OOO 0 ?o 1 0  ??  0 ? 1    ??       I Negative example: the codeword lengths 3, 3, 3, 2, 2, 2   9 violate Kraft’s inequality: sum is 8 . I Kraft’s inequality becomes an equality for trees in which every internal node has two children. ECE750-TXB Tries: Kraft’s inequality Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Two ways to prove Kraft’s inequality: Veldhuizen [email protected] I Put each node of a binary tree in correspondence with a subinterval of [0, 1] on the real line: root is [0, 1], its children get Review: Treaps [0, 1 ] and [ 1 , 1]. Each node at depth d receives an interval of 2 2 Tries length 2−d and splits it in half for its children. The union of the Hash Tables intervals at the leaves is ⊆ [0, 1], and the intervals at the leaves are pairwise disjoint, so the sum of their interval lengths is ≤ 1. Bibliography

I Kraft’s inequality can also be proved with a simple induction argument. The list of valid codeword length sequences can be generated from the initial sequence h1, 1i (codewords {0, 1}) by the rewrite rules k → k + 1, k + 1 (expand a node into two children) and k → k + 1 (expand a node to have a single child). Base case: with h1, 1i obviously 2−1 + 2−1 = 1. Induction step: if sum is ≤ 1, consider expanding a single element of the sequence: have either the rewrite k → k + 1, k + 1, and 2k ≥ 2k−1 + 2k−1; or the rewrite k → k + 1, and 2k ≥ 2k−1. So rewrites never increase the “weight” of a node.

ECE750-TXB Tries: Kraft’s inequality I Lecture 8: Treaps, Tries, and Hash Tables

It is occasionally useful to have an infinite set of codewords Todd L. Veldhuizen handy, in case we do not know in advance how many [email protected] different objects we might need to code. For an infinite set of codewords (or infinite binary tree), Review: Treaps Kraft’s inequality implies Tries Hash Tables + + ∗ Bibliography dk ≥ c + log k + log log log k infinitely often (2)

where

log+ x ≡ log x + log log x + log log log x + ···

with the sum taken only over the positive terms, and log∗ x is the “iterated logarithm” — ( 0 if x ≤ 1 log∗ x = 1 + log∗(log x) otherwise ECE750-TXB Tries: Kraft’s inequality II Lecture 8: Treaps, Tries, and Hash Tables

See e.g., [2, 9]. Todd L. Where does this bound come from? Well, a necessary condition for Veldhuizen [email protected] ∞ X −d 2 k ≤ 1 Review: Treaps k=0 Tries

Hash Tables P∞ −dk to hold is that the series k=0 2 converges. For example, if −dk 1 Bibliography dk = log k, then 2 = k , the Harmonic series. The Harmonic series diverges, so Kraft’s inequality can’t hold. We can parlay this into an inequality by remembering the “comparison test” for convergence of series: if ak , bk are two positive series, and P P ak ≤ bk for all k, then ak ≤ bk . If we stick the Harmonic series in −dk for ak and 2 for bk , we get:

1 −dk P −dk If k ≤ 2 for all k then ∞ ≤ 2 .

The premiss of this test must be false if P 2−dk does not diverge to −dk 1 −dk 1 infinity. Therefore 2 must be < k for at least some k. If 2 < k for only some finite number of choices of k, the series would still −dk −dk 1 diverge. So, a necessary condition for 2 to converge is that 2 < k

ECE750-TXB Tries: Kraft’s inequality III Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected]

Review: Treaps

for infinitely many terms. Taking logarithms and multiplying through by Tries −1 we get dk > log k for infinitely many i. Hash Tables We can generalize this by saying that if g ∈ ω(1) is any diverging 0 Bibliography function, then dk > − log g (k) for infinitely many k. (The Harmonic series bound follows from choosing g(x) = log x.) Unfortunately there is no “slowest growing function” g(x) from which we could obtain a tightest possible bound. Eqn. (2) is from [2]; Bentley credits the result to Ronald Graham and Fan Chung, apparently unpublished. ECE750-TXB Tries: Variations on a theme I Lecture 8: Treaps, Tries, and Hash Tables

There are many useful variants of tries [4]: Todd L. Veldhuizen I Multiway branching: instead of choosing Σ = {0, 1}, [email protected] one can choose any finite alphabet, and allow each Review: Treaps node to have |Σ| children. Tries

I Paged trie: each node is required to have a minimal Hash Tables number of leaves descended from it; when this Bibliography threshold is not met, the subtree is converted into a compact form (e.g., an array of keys and values) suitable for secondary storage. This technique can also be used to increase performance in main memory [6].

I Patricia tries [7] (“Practical Algorithm To Retrieve Information Coded in Alphanumeric6”) Introduce skip pointers to avoid long sequences of single-branch nodes like

0 1 1 0 / / / /

    

ECE750-TXB Tries: Variations on a theme II Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected]

I LC-Trie: the first few levels of a big trie tend to be Review: Treaps almost a complete binary tree of some depth, which can Tries be collapsed into an array of pointers to tries [8]. Hash Tables Bibliography I Ternary Search Tries (TSTs): a blend of a trie and a BST; can require substantially less space than a trie. For a large |Σ|, replace a |Σ|-way branch at each internal node with a BST of depth ≤ log |Σ|.

6Almost better than my all-time favourite strained CS acronym, PERIDOT: “Programming by Example for Real-time Interface Design Obviating Typing.” Great project, despite the acronym. ECE750-TXB Hash Tables Lecture 8: Treaps, Tries, and Hash Tables I Suppose we wanted to represent the following set: Todd L. Veldhuizen [email protected] M = {35, 139, 395, 1691, 1760, 1795, 3632, 3789, 4657} Review: Treaps

Tries Given some x ∈ N, we want to quickly test whether x ∈ M. Hash Tables Bibliography I Binary search trees: require following a path through a tree — perhaps not fast enough for our problem.

I Super fast way: allocate an array of 4657 bytes. Set ( 0 if i 6∈ M A[i] = 1 if i ∈ M

Then, on a RAM, can test whether x ∈ M with a single memory access to A[i] (a constant amount of time). However, space required by this strategy is O(sup M).

ECE750-TXB Hash Tables Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected]

I Obviously the array A would contain mostly empty Review: Treaps space. Can we somehow “compress” the array but still Tries support fast access? Hash Tables Bibliography I Yes: allocate a much smaller table B of length k. Define a function h : [1, 4657] → [1, k] that maps indices of A to indices of B, can be computed quickly, and ensures that if x, y ∈ M and x 6= y, then h(x) 6= h(y) i.e., no two elements of M have the same index in B.

I Then, x ∈ M if and only if B[h(x)] = x. ECE750-TXB Hash Tables Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen I For our example, h(x) = x mod 17 does the trick. Here is the array B: [email protected] Review: Treaps j B[j] j B[j] j B[j] Tries 0 0 6 0 12 0 1 35 7 0 13 0 Hash Tables 2 0 8 1691 14 0 Bibliography 3 139 9 1760 15 3789 4 395 10 1795 16 4657 5 0 11 3632

I e.g.: x = 1691: h(x) = 8, and B[8] = 1691, so x ∈ M.

I e.g.: x = 1692: h(x) = 9, and B[9] = 1760 6= 1692, so x 6∈ M.

I This is a hash table. h(x) = x mod 17 is called a hash function.

ECE750-TXB Hash Functions Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [email protected] I A hash function is a map h : K → H from some

(usually large) key space K to some (usually small) set Review: Treaps of hash values H. In our example, we were mapping Tries from K = [1, 4657] to H = [1, 17]. Hash Tables Bibliography I If the set M ⊆ K is chosen uniformly at random, keys are uniformly distributed (i.e., each k ∈ K has the same probability of appearing in a set to represent). In this case the hash function should distribute the keys evenly amongst elements of H, i.e., we want that |h−1(y)| ≈ |h−1(z)| for y, z ∈ H.7

I For a nonuniform distribution on keys, one just wants to choose h so that the distribution induced on H is close to uniform.

7Recall that for a function f : R → S, f −1(s) ≡ {r : f (r) = s}. ECE750-TXB Hash Functions Lecture 8: Treaps, Tries, and Hash Tables I We will describe some hash functions where K = N Todd L. Veldhuizen (keys are nonnegative integers). These are easily [email protected] adapted to other kinds of keys (e.g., strings) by interpreting the binary representation of the key as an Review: Treaps integer. Tries Hash Tables Some commonly used hash functions are the following: Bibliography 1. Division: use h(k) = k mod m where m = |H| is usually chosen to be a prime number far away from any power of 2. (Note.8)

I For long bit strings, use Horner’s rule for evaluating polynomials in Z/mZ (will explain.) 2. Multiplication: use h(k) = bm{kφ}c, where 0 < φ < 1

is an irrational number√ and {x} ≡ x − bxc. A popular 5−1 choice of φ is φ = 2 . 8A particularly terrible choice would be m = 256, which would hash objects based only on their lowest 8 bits. e.g., the hash of a string would depend only on its last character.

ECE750-TXB Lecture 8: Treaps, Multiplication hash functions: Example√ 5−1 Tries, and Hash Example of multiplication hash function using φ = 2 , and hash Tables table with m = 100 slots: Todd L. Veldhuizen key {kφ} bm{kφ}c [email protected] 1 0.618034 61. 2 0.236068 23. Review: Treaps

3 0.854102 85. Tries 4 0.472136 47. Hash Tables 5 0.090170 9. Bibliography 6 0.708204 70. 7 0.326238 32. 8 0.944272 94. 9 0.562306 56. 10 0.180340 18. 11 0.798374 79. 12 0.416408 41. 13 0.034442 3. 14 0.652476 65. 15 0.270510 27. 16 0.888544 88. 17 0.506578 50. Idea is that the third column (the hash slots) ‘looks like’ a random sequence. ECE750-TXB Multiplication hash functions Lecture 8: Treaps, Tries, and Hash Tables I The reason why h(k) = bm{kφ}c is a reasonable hash Todd L. function is interesting. Veldhuizen [email protected] I The short answer is that the sequence {kφ} for k = 1, 2, 3,... ‘kind of behaves like’ a random real Review: Treaps drawn from (0, 1). So, h(k) = bm{kφ}c ‘looks like’ a Tries randomly chosen hash function. Hash Tables A less sketchy explanation: Bibliography 1. {kφ} is uniformly distributed on (0, 1): asymptotically, the proportion of {kφ} falling in an interval (α, β) where (α, β) ⊆ (0, 1) is (β − α). Just like a uniform distribution on (0, 1)! 2. {kφ} satisfies an ergodic theorem: if we sample a suitably well-behaved9 function f at points {kφ} and average, this converges to the integral: m 1 X Z 1 f ({kφ}) → f (x)dx m k=1 0 Just like a uniform distribution on (0, 1)! See [3]. Variously called Weyl’s ergodic principle, Weyl’s equidistribution theorem. However, {kφ} is emphatically not a random sequence. 9Continuously differentiable and periodic with period 1 ECE750-TXB Hash Functions Lecture 8: Treaps, Tries, and Hash Tables I To evaluate whether a hash function is a good choice Todd L. for a set of data S ⊆ K, one can see how the observed Veldhuizen distribution of keys into hash table slots compares to a [email protected] uniform distribution. Review: Treaps I Suppose there are n keys and m hash slots. Compute Tries the observed distribution of the keys: Hash Tables |{k : h(k) = i}| Bibliography pˆ = i n I To measure how far from uniform, compute m ˆ X D(P||U) = log2 m + pˆi log2 pˆi i=1

Convention: 0 log2 0 = 0. I This is the Kullback-Leibler divergence of the observed distribution Pˆ from the uniform distribution U. It may be thought of as the “distance” from Pˆ to U. I The smaller D(Pˆ||U), the better the hash function. ECE750-TXB Bibliography I Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [1] Jon L. Bentley and Robert Sedgewick. [email protected] Fast algorithms for sorting and searching strings. In SODA ’97: Proceedings of the eighth annual Review: Treaps Tries ACM-SIAM symposium on Discrete algorithms, pages Hash Tables 360–369, Philadelphia, PA, USA, 1997. Society for Bibliography Industrial and Applied Mathematics. bib [2] Jon Louis Bentley and Andrew Chi Chih Yao. An almost optimal algorithm for unbounded searching. Information Processing Lett., 5(3):82–87, 1976. bib pdf

[3] Bernard Chazelle. The Discrepancy Method — Randomness and Complexity. Cambridge University Press, Cambridge, 2000. bib

ECE750-TXB Bibliography II Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [4] Philippe Flajolet. [email protected]

The ubiquitous digital tree. Review: Treaps

In Bruno Durand and Wolfgang Thomas, editors, Tries

STACS, volume 3884 of Lecture Notes in Computer Hash Tables Science, pages 1–22. Springer, 2006. bib pdf Bibliography

[5] Edward Fredkin. Trie memory. Commun. ACM, 3(9):490–499, 1960. bib [6] Steffen Heinz, Justin Zobel, and Hugh E. Williams. Burst tries: a fast, efficient data structure for string keys.

ACM Trans. Inf. Syst., 20(2):192–223, 2002. bib ECE750-TXB Bibliography III Lecture 8: Treaps, Tries, and Hash Tables

Todd L. Veldhuizen [7] Donald R. Morrison. [email protected]

PATRICIA—practical algorithm to retrieve information Review: Treaps

coded in alphanumeric. Tries

J. ACM, 15(4):514–534, 1968. bib pdf Hash Tables

Bibliography [8] Stefan Nilsson and Gunnar Karlsson. IP-address lookup using LC-tries. IEEE Journal on Selected Areas in Communications, 17:1083–1092, June 1999. bib [9] Jorma Rissanen. Stochastic Complexity in Statistical Inquiry, volume 15 of Series in Computer Science. World Scientific, 1989. bib ECE750-TXB Lecture 9: Hashing

Todd L. Veldhuizen [email protected] ECE750-TXB Lecture 9: Hashing Outline

Hashing Todd L. Veldhuizen Bibliography [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 6, 2007

ECE750-TXB Hash tables Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline

Hashing

I Recall that a hash table consists of Bibliography

I m slots into which we are placing items; I A map h : K → [0, m − 1] from key values to slots.

I We put n keys k1, k2,..., kn into locations h(k1), h(k2),..., h(kn).

I In the ideal situation we can then locate keys with O(1) operations. ECE750-TXB Horner’s Rule I Lecture 9: Hashing Todd L. Veldhuizen [email protected] I Horner’s rule gives an efficient method for evaluating hash functions for sequences, e.g., strings. Outline Hashing

I Consider a hash function of the form Bibliography

h(k) = k mod m

I If we wish to hash a string such as “hello,” we can interpret it as a long binary number: in ASCII, “hello” is

01101000 01100101 01101100 01101100 01101111 | {z } | {z } | {z } | {z } | {z } h e l l o

I As a sequence of integers, “hello” is [104, 101, 108, 108, 111]. We want to compute

(104 · 232 + 101 · 224 + 108 · 216 + 108 · 28 + 111 · 20) mod m

ECE750-TXB Horner’s Rule II Lecture 9: Hashing Todd L. Veldhuizen I Horner’s rule is a general trick for evaluating a [email protected]

polynomial. We write Outline

Hashing 3 2 2 ax + bx + cx + d = (ax + bx + c)x + d Bibliography = ((ax + b)x + c)x + d

So that instead of computing x3, x2,... we have only multiplications:

t1 = ax + b

t2 = t1x + c

t3 = t2x + d

I Trivia: some early CPUs included an instruction opcode for applying Horner’s rule. May be making a comeback! ECE750-TXB Horner’s Rule III Lecture 9: Hashing Todd L. I To use Horner’s rule for hashing: to compute Veldhuizen (a · 224 + b · 216 + c · 28 + d) mod m, [email protected]

Outline 8 t1 = (a · 2 + b) mod m Hashing

8 Bibliography t2 = (t1 · 2 + c) mod m 8 t3 = (t2 · 2 + d) mod m

Note that multiplying by 2k is simply a shift by k bits. I Why this works. In short, algebra. The integers Z form a ring under multiplication and addition. The hash function h(k) = k mod m can be interpreted as a homomorphism from the ring Z of integers to the ring Z/mZ of integers modulo m. Homomorphisms preserve structure in the following sense: if we write + for integer addition, and ⊕ for addition modulo m,

h(a + b) = h(a) ⊕ h(b)

i.e., it doesn’t matter whether we compute (a + b) mod m or compute (a mod m) and (b mod m) and add with modular

ECE750-TXB Horner’s Rule IV Lecture 9: Hashing arithmetic: we get the same answer either way. Similarly, if we Todd L. write × for multiplication in , and ⊗ for multiplication in /m , Veldhuizen Z Z Z [email protected] h(a × b) = h(a) ⊗ h(b) Outline Horner’s rule works precisely because h : Z → Z/mZ is a Hashing homomorphism: Bibliography h(((a × 28 + b) × 28 + c) × 28 + d) = (((h(a) ⊗ h(28) ⊕ h(b)) ⊗ h(28) ⊕ h(c)) ⊗ h(28) ⊕ h(d)) This can be optimized to use fewer applications of h, as above. In this form it is obvious why m = 28 is a horrible choice for a hash table size: 28 mod 28 = 0, so (((h(a) ⊗ h(28) ⊕ h(b)) ⊗ h(28) ⊕ h(c)) ⊗ h(28) ⊕ h(d)) = (((h(a) ⊗ 0 ⊕ h(b)) ⊗ 0 ⊕ h(c)) ⊗ 0 ⊕ h(d)) = h(d) i.e., the hash value depends only on the last byte. Similarly, if we used m = 216, we would have h(216) = 0, which would remove all but the last two bytes from the hash value computation. For background on algebra see, e.g., [1, 9, 7]. ECE750-TXB Collisions Lecture 9: Hashing Todd L. Veldhuizen [email protected]

I A collision occurs when two keys map to the same Outline location in the hash table, i.e., there are distinct Hashing x, y ∈ M such that h(x) = h(y). Bibliography

I Strategies for handling collisions: 1. Pick a value of m large enough so that collisions are rare, and can be easily dealt with e.g., by maintaining a short “overflow” list of items whose hash slot is already occupied. 2. Pick the hash function h to avoid collisions. 3. Put another data structure in each hash table slot (a list, tree, or another hash table); 4. If a hash slot is full then try some other slots in some fixed sequence (open addressing).

ECE750-TXB Collision Strategy 1: Pick m big I Lecture 9: Hashing Todd L. Veldhuizen [email protected] I Let’s see how big m must be for the probability of collisions to be small. Outline Hashing I Two cases: Bibliography I n > m: then there must be a collision, by the pigeonhole principle.1

I n ≤ m: may or may not be a collision.

I The “birthday problem”: what is the probability that amongst n people, at least two share the same birthday?

I This is a hashing problem: people are keys, days of the year are slots, and h maps people to their birthdays.

I If n ≥ 23, then the probability of two people having the 1 same birthday is > 2 . (Counterintuitive, but true.) I The “birthday problem” analysis is straightforward to adapt to hashing. ECE750-TXB Collision Strategy 1: Pick m big II Lecture 9: Hashing Todd L. I Suppose the hash function h and the distribution of Veldhuizen keys cooperate to produce a uniform distribution of keys [email protected]

into hash table slots. Outline Hashing I Recall that with a uniform distribution, probability may be computed by simple counting: Bibliography

# outcomes in which E happens Pr(event E happens) = # outcomes

I First we count the number of hash functions without collisions:

I There are m choices of where to put the first key; m − 1 choices of where to put the second key; ... m − n + 1 choices of where to put the nth key.

I The number of hash functions with no collisions is n m! 2 m = m · (m − 1) ··· (m − n + 1) = (m−n)! . (Note .) I Next we count the number of hash functions allowing collisions:

ECE750-TXB Collision Strategy 1: Pick m big III Lecture 9: Hashing Todd L. I There are m choices of where to put the first key; m Veldhuizen choices of where to put the second key; ... m choices of [email protected] where to put the nth key. Outline n I The number of hash functions allowing collisions is m . Hashing I The probability of a collision-free arrangement is Bibliography m! p = (m − n)! · mn

I Asymptotic estimate of ln p, assume m n:

n2 n  n3  ln p ∼ − + + O (1) 2m 2m m2

Here we have used Stirling’s approximation and n “ n2 ” ln(m − n) = ln m − m − O m2 . 2 2 I Two cases: If n ≺ m then ln p → 0. If n m then ln p → −∞. ECE750-TXB Collision Strategy 1: Pick m big IV Lecture 9: Hashing Todd L. I Recall that if Veldhuizen [email protected]

ln p = x +  Outline

Hashing

then Bibliography

p = ex+ = ex e = ex 1 +  + 2 + ···  Taylor series = ex (1 + O()) if  ∈ o(1)

I Probability of a collision-free arrangement is

3 − n(n−1) ! − n(n−1) n e 2m p ∼ e 2m + O m2

I Interpretation:

ECE750-TXB Collision Strategy 1: Pick m big V Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline

2 Hashing I If m ∈ ω(n ) there are no collisions (almost surely). 2 Bibliography I If m ∈ o(n ) there is a collision (almost surely).

I i.e., if we want a low probability of collisions, our hash table has to be quadratic (or more) in the number of items.

1If m + 1 pigeons are placed in m pigeonholes, there must be two pigeons in the same hole. (Replace “pigeons” with “keys,” and “pigeonholes” with “hash slots.”) 2The handy notation mm is called a “falling power” [8]. ECE750-TXB Threshold functions Lecture 9: Hashing Todd L. 1 2 Veldhuizen I m = 2 n is an example of a threshold function: [email protected] I ≺ the threshold, asymptotic probability of event is 0 Outline I the threshold, asymptotic probability of event is 1. Hashing £C£C Bibliography

Prob. of no collision 1

0 XX nn2− n2 n2+ n3 Hash table size (m)

ECE750-TXB Collision Strategy 1: pick m big Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline I Picking m big is not an effective strategy for handling Hashing collisions. Bibliography I For n = 1000 elements, this table shows how big m must be to achieve the desired probability of no collisions: p m 0.1 5 000 000 0.01 50 000 000 10−6 500 000 000 000 10−9 500 000 000 000 000 ECE750-TXB Collision Strategy 1: pick m big Lecture 9: Hashing Todd L. Veldhuizen [email protected] I The analysis of collisions in hashing demonstrates two Outline pigeonhole principles. Hashing I The simplest pigeonhole principle states that if you put Bibliography ≥ m + 1 pigeons in m holes, there must be one hole with ≥ 2 pigeons.

I With respect to hash tables, the pigeonhole applies as follows: If a hash table with m slots is used to store ≥ m + 1 elements, there is a collision.

I The probability-of-collision analysis of the previous slide demonstrates a probabilistic pigeonhole principle: if you √ put ω( n) pigeons in n holes, there is a hole with ≥ 2 pigeons almost surely (i.e., with probability converging to 1 as n → ∞.)

ECE750-TXB Collision Strategy 2: pick h carefully I Lecture 9: Hashing Todd L. Veldhuizen [email protected] I Can we pick our hash function h to avoid collisions? Outline I For example, if we use hash functions of the form Hashing h(k) = bm{kφ}c Bibliography

we could try random values of φ ∈ (0, 1) until we found one that was collision-free.

I We have a probability of success − n p ∼ e 2m(m−1) (1 + o(1))

I Geometric distribution:

I Probability of success p, probability of failure 1 − p

I Each trial independent, identically distributed.

I Probability that k tries are needed for success = (1 − p)k−1p −1 I Mean: p ECE750-TXB Collision Strategy 2: pick h carefully II Lecture 9: Hashing Todd L. Veldhuizen [email protected]

I Number of values of φ we expect to try before we find a Outline

collision-free hash table for n = 1000: Hashing

Bibliography m # Expected failures before success 1000 10217 2000 10109 10000 1022 100000 147

I Picking hash functions randomly in this manner is unlikely to be practical.

I There are better strategies: see [6, 2].

ECE750-TXB Collision Strategy 3: secondary data structures I Lecture 9: Hashing Todd L. I By far the most common technique for handling Veldhuizen collisions is to put a secondary data structure in each [email protected]

hash table slot: Outline

I A linked list (‘chaining’) Hashing

I A binary search tree (BSTs) Bibliography I Another hash table n I Let α = m be the load factor: the average number of items per hash table slot.

I Assuming uniform distribution of keys into slots:

I Linked lists require 1 + α steps (on average) to find a key;

I Suitable BSTs require 1 + max(c log α, 0) steps (on average).3

I Using secondary hash tables of size quadratic in the number of elements in the slot, one can achieve O(1) lookups on average, and require only Θ(n) space. ECE750-TXB Collision Strategy 3: secondary data structures II Lecture 9: Hashing I Analysis of secondary hash tables: Todd L. Veldhuizen I Let Ni be a random variable indicating the number of [email protected] items landing in slot j. I E[Ni ] = α Outline „ « 1 1 Hashing I Var[Ni ] = n · 1 − m m Bibliography | {z } Bernoulli variance I Space required for secondary hash tables is proportional to 2 3 X 2 X 2 X 2 E 4 Ni 5 = E[Ni ] = Var[Ni ] + α 1≤i≤m 1≤i≤m 1≤i≤m „ 1 „ 1 « n2 « = m · n · 1 − + m m m2 n2 n ∼ + n − m m Plus space Θ(m) for the primary hash table = n2 Θ(m + m + n). Choosing m = Θ(n) yields linear space. 3The max(··· ) deals with the possibility that α < 1, in which case log α < 0.

ECE750-TXB Collision Strategy 4: open addressing I Lecture 9: Hashing Todd L. Veldhuizen [email protected] I Open addressing is a family of techniques for resolving collisions that do not require secondary data structures. Outline This has the advantage of not requiring any dynamic Hashing Bibliography memory allocation.

I In the simplest scenario we have a function s : H → H that is ideally a permutation of the hash values, for example the “linear probing” function

s(x) = (x + 1) mod m

I When we attempt to insert a key k, we look in slot h(k), s(h(k)), s(s(h(k))), etc. until an empty slot is found.

I To find a key k, we look in slot h(k), s(h(k)), s(s(h(k))), etc. until either k or an empty slot is found. ECE750-TXB Collision Strategy 4: open addressing II Lecture 9: Hashing I However, the use of permutations performs badly as the Todd L. Veldhuizen hash table becomes fuller: tend to get [email protected] “clumps/clusters,” i.e., long sequences h(k), s(h(k), s(s(h(k))),... where all the slots are Outline occupied (see e.g. [10]). Hashing

I Performance can be good for not very full tables, e.g. Bibliography 2 √ α < 3 . As α → 1 operations begin to take Θ( n) time [5]. I Quadratic probing offers less clumping: try slots h0(k), h1(k), ··· where

2 hi (k) = (h(k) + i ) mod m

h(k) is an initial fixed hash function. If m prime, the sequence hi (k) will visit every slot. I Double hashing uses two hash functions, h1 and h2:

hi (k) = (h1(k) + i · h2(k)) mod m

h1(k) gives an initial slot to try; h2(k) gives a ‘stride’ (reduces to linear probing when h2(k) = 1.)

ECE750-TXB Collision Strategy 4: open addressing III Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline

Hashing

Bibliography I Under favourable conditions, an open addressing scheme behaves like a geometric distribution when searching for an open slot: the probability of finding an empty slot is 1 − α, so the expected number of trials is 1 1−α . Note the catastrophe when α → 1. ECE750-TXB Summary of collision strategies Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline

Hashing Strategy E[access time] Space Bibliography Choose m big O(1) Ω(n2) Linked List 1 + α O(n + m) Binary Search Tree 1 + max(c log α, 0) O(n + m) Secondary Hash Tables O(1) O(n + m) 1 Open addressing 1−α O(m)

I Open addressing can be quite effective if α  1, but fails catastrophically as α → 1.

ECE750-TXB Summary of collision strategies Lecture 9: Hashing Todd L. I If unexpectedly n m (e.g. we have far more data than Veldhuizen we designed for), then α → ∞. For example, if [email protected]

m ∈ O(1) and n ∈ ω(1): Outline

I Linked list has O(n) accesses; Hashing I BSTs have O(log n) accesses—offer a gentler failure Bibliography mode. I If hash function is badly nonuniform: I Linked list can be O(n); I BST will have O(log n); 2 I Secondary hash tables may require O(n ) space.

I To summarize: hash table + BST will give fast search times, and let you sleep at night.

I To maintain O(1) access times as n → ∞, it is necessary to maintain m  n. This can be done by choosing an allowable interval α ∈ [c1, c2]; when α > c2 resize the hash table to make α = c1. So long as c2 > c1, this strategy adds O(1) amortized time per insertion, as in dynamic arrays. ECE750-TXB Applications of hashing I Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline

Hashing I Hashing is a ubiquitous concept, used not just for maintaining collections but also for Bibliography

I cryptography

I combinatorics

I data mining

I computational geometry

I databases

I router traffic analysis

I An example: probabilistic counting

ECE750-TXB Probabilistic Counting I Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline

Hashing Problem: estimate the number of unique elements in a I Bibliography LARGE collection (e.g., a database, a data stream) without requiring much working space

I Useful for query optimization in databases [11]:

I e.g. to evaluate A ∩ B ∩ C can do either A ∩ (B ∩ C) or (A ∩ B) ∩ C

I one of these might be very fast, one very slow.

I have rough estimates of |B ∩ C| vs |A ∩ B| to decide which strategy will be faster. ECE750-TXB Probabilistic Counting I Lecture 9: Hashing Todd L. Veldhuizen [email protected] I Less serious (but more readily understood) example: Outline I Shakespeare’s complete works: Hashing I N=884,647 words (or so) Bibliography I n=28,239 unique words (or so)

I w = average word length I Nmax ≈ n = prior estimate on n I Problem: estimate n — the number of unique words used. Approaches: 1. Sorting: Put all 884,647 words in a list and sort, then count. (Time O(Nw log N), space O(Nw)) 2. Trie: Scan through the words and build a trie, with counters at each node; requires O(nw) space (neglecting size of counters.) 3. Super-LogLog Probabilistic Counting [3]: Use 128 bytes of space, obtain estimate of 30897 words (error 9.4%).

ECE750-TXB Probabilistic Counting I Lecture 9: Hashing Todd L. Veldhuizen [email protected] I Inputs: a multiset A of elements, possibly with many duplicates (e.g., Shakespeare’s plays) Outline Hashing I Problem: estimate card(A): the number of unique elements in A (e.g., number of distinct words Bibliography Shakespeare used)

I Simple starting idea: hash the objects into an m-element hash table. Instead of storing keys, just count the number of elements landing in each hash slot.

I Extreme cases to illustrate the principle:

I Elements of A are all different: will get an even distribution in the hash table.

I Elements of A are all the same: will get one hash table slot with all the elements!

I The shape of the hash table distribution reflects the frequency of duplicates. ECE750-TXB Probabilistic Counting Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline

Hashing I Linear Counting [11] Bibliography I Compute hash values in the range [0, Nmax) I Maintain a bitmap representing which elements of the hash table would be occupied, and estimate n from the sparsity of the hash table. I Uses Θ(Nmax) bits, e.g., on the order of card(A) bits.

I Room for improvement: the precise sparsity pattern doesn’t matter: just the number of full vs. empty slots.

ECE750-TXB Probabilistic Counting I Lecture 9: Hashing Todd L. Veldhuizen [email protected] I Probabilistic Counting [4] I Compute hash values in the range [0, Nmax) Outline I Instead of counting hash values directly, count the Hashing occurrence of hash values matching certain patterns: Bibliography

Pattern Expected occurrences xxxxxxx1 2−1 · card(A) xxxxxx10 2−2 · card(A) xxxxx100 2−3 · card(A) xxxx1000 2−4 · card(A) . . . .

Use these counts to estimate card(A).

I To improve accuracy, use m different hash functions. I Uses Θ(m log Nmax) storage, and delivers accuracy of O(m−1/2) ECE750-TXB Probabilistic Counting Lecture 9: Hashing Todd L. Veldhuizen [email protected]

Outline I Super-LogLog [3] requires Θ(log log Nmax) bits. With 1.28kb of memory can estimate card(A) to within Hashing Bibliography accuracy of 2.5% for Nmax ≤ 130 million.

I Probabilistic counters: count to N using log log N bits:

1 3 7 15 2 4 8 16

 / / / / ··· 1 1 1 1 2 4 8 16     Need log N states, which can be encoded in log log N bits.

ECE750-TXB Bibliography I Lecture 9: Hashing Todd L. Veldhuizen [email protected] [1] Stanley Burris and H. P. Sankappanavar. A Course in Universal Algebra. Outline Hashing Springer-Verlag, 1981. bib pdf Bibliography [2] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, and Friedhelm MeyerAuf Der. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., 23(4):738–761, 1994. bib [3] Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In Giuseppe Di Battista and Uri Zwick, editors, ESA, volume 2832 of Lecture Notes in Computer Science, pages 605–617. Springer, 2003. bib pdf ECE750-TXB Bibliography II Lecture 9: Hashing Todd L. Veldhuizen [email protected] [4] Philippe Flajolet and G. N. Martin. Probabilistic counting algorithms for data base Outline applications. Hashing Journal of Computer and System Sciences, Bibliography 31(2):182–209, September 1985. bib pdf

[5] Philippe Flajolet, Patricio V. Poblete, and Alfredo Viola. On the analysis of linear probing hashing. Algorithmica, 22(4):490–515, 1998. bib pdf

[6] Michael L. Fredman and Janos Komlos an Endre Szemeredi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544, 1984. bib

ECE750-TXB Bibliography III Lecture 9: Hashing Todd L. Veldhuizen [7] Joseph A. Gallian. [email protected] Contemporary Abstract Algebra. Outline D. C. Heath and Company, Toronto, 3rd edition, 1994. Hashing bib Bibliography [8] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, MA, USA, second edition, 1994. bib [9] Saunders MacLane and Garrett Birkhoff. Algebra. Chelsea Publishing Co., New York, third edition, 1988. bib ECE750-TXB Bibliography IV Lecture 9: Hashing Todd L. Veldhuizen [email protected] [10] Robert Sedgewick and Philippe Flajolet. Outline An introduction to the analysis of algorithms. Hashing Addison-Wesley Publishing Company, Reading, Bibliography MA-Menlo Park-New York-Don Mills, Ontario-Wokingham, England-Amsterdam-Bonn- Sydney-Singapore-Tokyo-Madrid-San Juan-Milan-Paris, 1996. bib [11] Kyu-Young Whang, Brad T. Vander-Zanden, and Howard M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208–229, 1990. bib pdf ECE750-TXB Lecture 10: Design Tradeoffs, Introduction to Average-Case ECE750-TXB Lecture 10: Design Tradeoffs, Analysis Todd L. Introduction to Average-Case Analysis Veldhuizen [email protected]

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

March 1, 2007

ECE750-TXB Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen [email protected] Part I

Design Tradeoffs ECE750-TXB Theme: Design Tradeoffs Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. I Tradeoffs between design parameters: A recurring Veldhuizen theme in algorithms & data structures. [email protected]

I Examples:

I By making a hash table bigger, we can decrease α (the load factor) and achieve faster search times. (A tradeoff between space and time.)

I In designing circuits to add n-bit integers, we can obtain very low delays (the maximum number of gates between inputs and outputs) by increasing the number of gates: trading time (delay) for area (number of gates)

I In many tasks we can trade the precision of an answer for time and space, e.g., responding quickly to database queries with an estimate of the answer, rather than the exact answer.

ECE750-TXB Theme: Design Tradeoffs Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen I Design tradeoffs are often parameterizable. [email protected] I For example, in speed/accuracy tradeoffs we don’t usually have to choose either speed or accuracy. Instead we have a parameter  — the allowable error — that we can adjust.

I With  large we get fast (but possibly not very accurate) answers

I As  → 0 we get very accurate answers that take longer to compute.

I Let’s look at an example of a tradeoff in the design of data structures. ECE750-TXB Design Tradeoff: Hash tables vs. BSTs Lecture 10: Design Tradeoffs, Introduction to Average-Case I Consider representing a collection of n keys drawn from Analysis an ordered structure hK, ≤i. Todd L. Veldhuizen [email protected] I A (balanced) binary search tree (BST) has Θ(log n) search times.

I A hash table has Θ(1) search (if we keep the size  number of elements, and choose an appropriate hash function.)

I Difference between these two data structures:

I A BST allows us to iterate through the elements in order, using Θ(log n) working space. The Θ(log n) space is used to record the path from the root to the iterator position in a stack.

I Items in a hash table are not stored in order — if we want to iterate through them in order, we need extra space and time, e.g. Θ(n) space for a temporary array and Θ(n log n) time to sort the items.

ECE750-TXB Design Tradeoff: Hash tables vs. BSTs Lecture 10: Design Tradeoffs, Introduction to Average-Case I We can view BSTs and hash tables as two points in a Analysis design space: Todd L. Veldhuizen Data structure Search time Working space for [email protected] ordered iteration Hash table Θ(1) Θ(n) Binary Search Tree Θ(log n) Θ(log n)

I Suppose: 9 I We have a very large (n = 10 ) collection of keys that barely fits in memory

I Dynamic: keys added and removed frequently. I We need fast search, fast insert, fast remove.

I Red-black: height is ≈ 2 log n ≈ 61 levels

I We need to be able to iterate through the collection in order.

I There is not enough room in memory to create a temporary array for sorting; also, this would be prohibitively slow. ECE750-TXB Design Tradeoff: Hash tables vs. BSTs Lecture 10: Design Tradeoffs, Introduction to I Let’s make a simple data structure that will offer a Average-Case smoother tradeoff between search time and the working Analysis Todd L. space required for an ordered iteration. Veldhuizen [email protected] I If you think of BST + hash table as two points in a design space, we want a structure that will ‘interpolate’ smoothly between them. c log n Binary Search Tree cc

Search Time

Hash Table 1 log n n Working space for ordered iteration cc

ECE750-TXB Design Tradeoff: Hash tables vs. BSTs I Lecture 10: Design Tradeoffs, Introduction to Average-Case I Consider a hash table of m slots, using a BST in each Analysis

slot to resolve collisions: Todd L. !! Veldhuizen !! aa [email protected] aa b b b b b b b b b b b b !! X  XXXb b b b b b b  XXhhXX b b b b b b b I Observation:

I When m = 1 we have a degenerate hash table with a single slot.

I All the keys are put in a single BST.

I So, choosing m = 1 essentially gives us a BST: we can iterate through the keys in order, search requires c log n steps, where c reflects the average tree depth. ECE750-TXB Design Tradeoff: Hash tables vs. BSTs II Lecture 10: Design Tradeoffs, Introduction to What about the case m = 2? Average-Case I Analysis I We have a hash table with two slots. If hash function is Todd L. good, we get two BSTs of roughly n/2 keys apiece. Veldhuizen [email protected] I Search time is about c log(n/2). I Can we iterate through the keys in order?

I Yes: have two iterators, one for each tree. Initially the two iterators point at the smallest key in their tree. I At each step of the iteration, choose the iterator that is pointing at the smaller of the two keys. Retrieve that key, and advance the iterator.

I Generalize: if we choose an arbitrary m,

I We will have m BSTs of average size n/m

I Search times will be around c log(n/m), assuming m  n. I To iterate through the keys in order,

I Obtain m iterators, one for each tree. I At each step, choose the iterator pointing at the smallest key, retrieve that key and advance the iterator.

ECE750-TXB Design Tradeoff: Hash tables vs. BSTs III Lecture 10: Design Tradeoffs, Introduction to I To do this efficiently, we need a fast way to maintain a Average-Case collection (of iterators) that lets us quickly obtain the Analysis one with the smallest value (the iterator pointing at the Todd L. Veldhuizen smallest key) [email protected]

I Easy: a min-heap.

I Our algorithm for ordered iteration will look like this: 1. Create an array of m BST iterators, one for each hash table slot. 2. Turn this array into a min-heap, ordering iterators by the key they are pointing at. (The heap can be built in O(m) time.) 3. To obtain the next element, 3.1 Remove the least element from the min heap. (This takes O(log m) time.) 3.2 Obtain its key, and advance the iterator. (Advancing a BST iterator requires O(1) amortized time.) 3.3 Put the iterator back into the min-heap. (This takes O(log m) time.) ECE750-TXB Design Tradeoff: Hash tables vs. BSTs I Lecture 10: Design Tradeoffs, Introduction to I We can iterate through the keys in order in time Average-Case O(n(1 + log m)). Analysis Todd L. I O(m) time to obtain the iterators and build the heap Veldhuizen I O(1 + log m) time per key to adjust the heap, times n [email protected] keys = O(n log m) (The 1 + ··· handles the case m = 1.)

I Overall, O(n(1 + log m)) time, assuming m  n. I The space required for iterating through the keys in order is O(m(1 + log(n/m))):

I We need m iterators, one per hash table slot. I Each iterator requires space O(1 + log(n/m)), on average, for a stack recording its position in the tree. (The 1 + ··· handles the case where n = m.)

I The number of steps for searching is on average 1 + c log(n/m), where c is a constant depending on the kind of BST we choose. The constant 1 is added to reflect visiting the correct slot in the hash table; and to handle the case where m = n, in which case c log(n/m) = 0, and having 0 search steps doesn’t make sense.

ECE750-TXB Design Tradeoff: Hash tables vs. BSTs Lecture 10: Design Tradeoffs, Introduction to I Looking at these complexities, a sensible Average-Case parameterization is m = n1−β. Analysis Todd L. I When β = 0, m = n and we get a hash table; Veldhuizen [email protected] I When β = 1, m = 1 and we get a BST. I Space and time: I Number of search steps is ≈ 1 + c log(n/m) = 1 + c log(n/(n1−β)) = 1 + c log nβ = 1 + βc log n. 1 I β directly multiplies our search time: choosing β = 2 halves our search time.

I Working space for ordered iteration is O(m(1 + log(n/m))) = O(n1−β(1 + log nβ)). E.g., if we choose β = 1 we are twice as fast as a BST I 2 √ for searching, and require O( n log n) working space for ordered iteration. I The amount of extra space we need for ordered iteration, relative to the space needed to store the keys, n1−β (1+β log n) −β is ≈ n = n (1 + β log n). NB: if β > 0 the relative space overhead for supplying ordered iteration is → 0. ECE750-TXB Design Tradeoff: Hash tables vs. BSTs Lecture 10: Design Tradeoffs, 9 Introduction to I Let’s look at some real-life numbers. Take n = 10 keys. Average-Case Analysis I Assume we use red-black trees, so that average depth of Todd L. keys in a tree of n/m elements is ≤ 2 log(n/m). Veldhuizen [email protected] Parameter #Search steps Space for iter. Space overhead β 1 + β2 log n n1−β (1 + β log n) n−β (1 + β log n) (Hash) 0 1 1000000000 100% 1/8 4.7 355237568 35% 1/4 16.0 47654705 4.7% 1/2 31.9 504341 0.05% 3/4 45.8 4165 0.0004% 7/8 53.3 362 0.00004% (BST) 1 60.8 31 0.000003%

I e.g. Choosing β = 1/4, we can get searches 4 times faster than the plain red-black tree, and have only a 4.7% space overhead for ordered iteration.

I Choosing β = 1/2, we can get searches twice as fast as a plain red-black tree, with a 0.05% space overhead for ordered iteration.

ECE750-TXB Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen [email protected]

Part II Bibliography

Introduction to Average-Case Analysis ECE750-TXB Average-case Analysis Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen I Worst-case analysis is very important for some [email protected] applications (e.g., real-time systems: worst-case execution time), and in theoretical computer science. Bibliography

I However, average-case performance is usually more important for practical engineering work.

I Given a choice between a data structure that always finds an item in 253 steps, vs one that finds an item in 5 steps on average (but with probability 10−12 takes more than 10000 steps), we would usually choose the fast-on-average data structure.

I In practice we are usually interested in performance for the average case, rather than worst case.

ECE750-TXB Average-case Analysis Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

I We can often find algorithms + data structures that are Todd L. Veldhuizen much more efficient on average than the best [email protected] worst-case data structures. Bibliography I We shall see that randomness can be an extremely effective tool for achieving good average-case performance.

I Example: uniquely represented dictionaries.

I It is known that deterministic (no randomness) tree-based√ uniquely represented dictionaries require Ω( n) time for insert/search operations

I Sundar-Tarjan trees [3] achieve this bound. I Treaps are uniquely represented and on average achieve O(log n) search and insert. However, with vanishingly small probability they may require O(n) time. ECE750-TXB Average-case Analysis Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis I Example: QuickSort Todd L. I QuickSort is the standard sorting algorithm. It achieves Veldhuizen O(n log n) time on average, and is faster in practice [email protected] than other algorithms. However, in the worst case it Bibliography requires O(n2) time.

I Merge Sort requires O(n log n) time in the worst case, but is slower in practice than QuickSort.

I Example: Searching

I Binary search requires O(log n) time in the worst case. I Interpolation search [1] requires O(log log n) time on average, if the data is uniformly distributed. However, it can require O(n) time with pathological distributions.

I The function log log n is so slowly growing as to be effectively constant: log log 1010 ≈ 5; log log 1020 ≈ 6; log log 1040 ≈ 7. I We can often use hashing to make an arbitrary key distribution uniform.

ECE750-TXB Some key ideas we will explore I Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis 1. We can always obtain an algorithm that combines the Todd L. best average case and best worst-case bounds of any Veldhuizen [email protected] algorithms we have available. (We needn’t settle for “fast on average, but occasionally catastrophically Bibliography slow.”) 2. If an algorithm has O(f (n)) average time, the probability of taking ω(f (n)) time goes to zero. 3. The amount of randomness, or entropy, of an input distribution plays a critical role in the performance of algorithms.

I Randomness can help average-case performance: if there are comparatively few “worst cases,” then as long as we have at least a certain amount of randomness in the inputs, those worst cases do not contribute to the average running time. ECE750-TXB Some key ideas we will explore II Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen [email protected]

Bibliography I Randomness can hurt average-case performance: we can design BSTs so that search time depends not on the number of keys n, but just on the amount of randomness in the distribution of keys we are asked to search for. In this case, the less randomness, the better!

ECE750-TXB Best of both worlds Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. I Algorithms that are good in the average-case Veldhuizen occasionally break down on pathological examples, e.g., [email protected] 2 that make QuickSort run in O(n ) time, cause search Bibliography trees of height O(n), etc.

I Suppose we have a pair of algorithms:

I Algorithm A has average case time Θ(f (n)) and worst case Θ(g(n)) 0 I Algorithm B has average case time Θ(f (n)) and worst case Θ(g 0(n))

I We can always construct an algorithm that has the best of both: 0 I average case time Θ(min(f (n), f (n))) and 0 I worst case time Θ(min(g(n), g (n))). ECE750-TXB Best of both worlds Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis Easy: Run A and B in parallel (or interleaved), and Todd L. I Veldhuizen return the result of the first finisher. [email protected]

Algorithm Bibliography A Input First finisher

Algorithm e B

I E.g. we can simultaneously perform a binary search and an interpolation search: this gives an algorithm with O(log log n) average case, and O(log n) worst case [2].

I However, note that this may entail larger constant factors and extra space compared to using a single algorithm.

ECE750-TXB Applications of randomness Lecture 10: Design Tradeoffs, Here are a few of the applications of randomness we will Introduction to I Average-Case encounter: Analysis I Modelling the distribution of inputs, so we can: Todd L. Veldhuizen I compute average-case performance; [email protected] I determine the structure of “typical” inputs, so we can tune our algorithm to them; Bibliography I Exploiting the amount of randomness (entropy) of inputs to achieve better performance;

I Using randomness to force some quantity of interest into a desirable distribution (e.g., height of a treap);

I Using randomness to foil an adversary: e.g. our algorithm/data structure could perform poorly only if the entity generating the queries could predict some random sequence we were using

I Using randomness to break symmetry, e.g., leader election in distributed systems;

I Using randomness to efficiently approximate answers in extremely small amounts of time or space.

I Random distributions of inputs can cause complexity classes to collapse, so that problems that were hard to solve efficiently suddenly become efficiently solveable on average. ECE750-TXB Typical inputs I Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen I We can use tools of average-case analysis to [email protected] characterize the “typical case” for which we should tune our algorithms. Bibliography

I Simple example: find the first nonzero bit in a binary string of n bits.1

I Simple strategy: scan the string from left to right, stop when a 1 is encountered.

I Clearly has worst case O(n). I What does a typical input look like?

I With a uniform distribution on n-bit strings, each bit is 1 0 or 1 with probability 2 .

ECE750-TXB Typical inputs II Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis I Waiting time to encounter a 1 follows a geometric distribution: probability of success p = 1 . Todd L. 2 Veldhuizen [email protected] Bitstring pattern Probability 1 Bibliography 1xxxxx ··· 2 1 01xxxx ··· 4 1 001xxx ··· 8 . . . .

1 I Mean of a geometric distribution is p = 2. 1 I Conclusion: on average we encounter a 1 after p = 2 bits. The running time of our naive “scan left to right looking for a 1” algorithm is Θ(1) — does not depend on n.

1In practice, many CPUs have a single instruction that will compute this for you. ECE750-TXB Typical inputs Lecture 10: Design Tradeoffs, Introduction to I This is an example of an exponential concentration: Average-Case Analysis 1 2 Todd L. Veldhuizen Prob. [email protected] 1 4 1 Bibliography 8 1 16 1 2 3 4 5 6 7 ··· Location of first nonzero bit

I Probability is concentrated around the mean. I Probability of the first nonzero bit being ≥ 2 + δ is ≤ 2−δ−1 = O(2−δ): falls off exponentially quickly. I Exponential concentrations are enormously useful. I An exponential concentration can swallow any polynomial function: if f (n) = O(na) for a ∈ N, then ∞ X O(2−δ) · f (c + δ) = O(1) δ=0 The tail of the distribution can contribute only an O(1) factor when we compute an average of f (··· ).

ECE750-TXB Typical inputs Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen I On a more practical note, understanding typical inputs [email protected] can help us engineer fast implementations. 2 Bibliography I Consider our “first nonzero bit” example: I From our analysis we know that with probability 15 16 ≈ 0.94 will encounter a 1 in the first 4 bits. I Design: have a lookup table indexed by the first four bits: 94% of the time we can just glance in the table and return.

I For the other ≈ 6% of the time, scan the remaining bits to find the first 1.

2Again, this is only ‘for example’: in practice if you were at all interested in performance you would be using a single cpu instruction for this. ECE750-TXB Average-case time Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen I We will consider several equivalent definitions of [email protected] average-case performance. First, an informal definition: Bibliography An algorithm runs in average time O(f (n)) (respectively, Θ, Ω, o, ω) if the average time T (n) is O(f (n)), where X T (n) = Pr(w) · T (w) | {z } | {z } all inputs w probability of time of algorithm of size n input w on input w

ECE750-TXB Average-case time Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis I Let’s make this more precise. Todd L. Veldhuizen I For each n, let Kn be all possible inputs of size n. [email protected]

I For each n, let µn : Kn → R be a probability Bibliography distribution on Kn. I µn(w) ≥ 0 for all w ∈ Kn. (Probabilities are positive.) P I µ (w) = 1. (Probabilities sum to 1.) w∈Kn n I Example: Bit strings n I Kn = {0, 1} , e.g., K2 = {00, 01, 10, 11}. 1 I We often choose the uniform distribution µn(w) = 2n . 1 1 e.g. µ2(00) = 4 , µ2(01) = 4 , etc.

I (Kn)n∈N is a family of sets indexed by n.

I Similarly, (µn)n∈N is a family of distributions. We often call (µn)n∈N an asymptotic distribution. ECE750-TXB Average-case time Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis

Todd L. Veldhuizen [email protected] I With these definitions in hand, our first formal

definition: Bibliography Definition (1) Let the average time T (n) for inputs of size n be given by X T (n) = µn(w)T (w)

w∈Kn

An algorithm has average-case time O(f (n)) if and only if T (n) ∈ O(f (n)).

ECE750-TXB Bibliography I Lecture 10: Design Tradeoffs, Introduction to Average-Case Analysis [1] Yehoshua Perl, Alon Itai, and Haim Avni. Todd L. Interpolation search—a log logn search. Veldhuizen [email protected] Commun. ACM, 21(7):550–553, 1978. bib pdf Bibliography [2] Nicola Santoro and Jeffrey B. Sidney. Interpolation-binary search. Inform. Process. Lett., 20(4):179–181, 1985. bib pdf

[3] R. Sundar and R. E. Tarjan. Unique binary search tree representations and equality-testing of sets and sequences. In Baruch Awerbuch, editor, Proceedings of the 22nd Annual ACM Symposium on the Theory of Computing, pages 18–25, Baltimore, MY, May 1990. ACM Press. bib pdf ECE750-TXB Lecture 11: Probability

Todd L. Veldhuizen ECE750-TXB Lecture 11: Probability [email protected]

Bibliography

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 28, 2007

ECE750-TXB Twentieth-Century Probability Lecture 11: Probability

Todd L. Veldhuizen [email protected] I Two foundational questions: Bibliography 1. How to define probability of infinite sequences in a meaningful way? 2. What is a randomness? What is a “random sequence”?

I Two landmarks: 1. Andrei N. Kolmogorov (1933): The rigorous formulation of probability theory using measure theory, allowing a consistent treatment of both finite and infinite sample spaces. 2. Per Martin-L¨of(1966): An acceptable definition of random sequences using constructive measure theory. Martin-L¨of’s definition implies a sequence is random if a computer is incapable of compressing it. [2] ECE750-TXB Probability and Measure I Lecture 11: Probability

Todd L. Veldhuizen I Modern probability theory is based on the concept of a [email protected]

measure. A measure generalizes the idea of “volumes,” Bibliography “lengths,” “probabilities,” and so forth.

I Consider defining a “length measure” that assigns a measure to subsets of the real line.

I Recall the following notations for closed and open intervals:

[a, b] = {x ∈ R : a ≤ x ≤ b} (a, b) = {x ∈ R : a < x < b} where we require a ≤ b.

I We will say that the interval [a, b] has measure b − a.

I The empty set ∅ has measure 0.

I This is the usual definition of the measure of an interval, called Lebesgue measure.

ECE750-TXB Probability and Measure II Lecture 11: Probability I Lebesgue measure is a function µ that assigns measures Todd L. to certain subsets of the reals R. For example, Veldhuizen µ([1, 2]) = 1. [email protected]

I What should the measure of [1, 2] ∪ [3, 5] be? Bibliography

I [1, 2] and [3, 5] are disjoint sets, so we should be able to add their measures: µ([1, 2]) = 1 and µ([3, 5]) = 2, so we can set µ([1, 2] ∪ [3, 5]) = 3.

I In general if X and Y are disjoint sets,

µ(A ∪ B) = µ(A) + µ(B)

I What should be the measure of the open interval (1, 2) (that is, [1, 2] with the endpoints removed)?

I Note that [1, 2] = (1, 2) ∪ {1} ∪ {2}.

I We can write the set {1} as [1, 1], and by our previous definition

µ([1, 1]) = 0

similarly for {2}: “points have no length.” ECE750-TXB Probability and Measure III Lecture 11: Probability

Todd L. I We can then apply the disjoint-sets rule to say that Veldhuizen [email protected] µ([1, 2]) = µ({1} ∪ (1, 2) ∪ {2}) Bibliography = µ({1}) +µ( (1, 2) ) + µ({2}) | {z } | {z } =0 =0

therefore µ([1, 2]) = µ( (1, 2) ).

I By similar reasoning, µ([a, b]) = µ((a, b) ).

I The disjoint-sets rule implies we can combine any finite number of disjoint sets:

n X µ(A1 ∪ A2 ∪ · · · ∪ An) = µ(Ai ) i=1 Should this extend also to an infinite collection of disjoint sets?

ECE750-TXB Probability and Measure IV Lecture 11: Probability

I For example, we can construct the interval (0, 1] as the Todd L. union of the intervals Veldhuizen [email protected] 1  1 1  1 1  [0, 1] = 2 , 1 ∪ 4 , 2 ∪ 8 , 4 ∪ · · · Bibliography We would like to say that

∞ ! ∞ [ X µ 2−i−1, 2−i  = 2−i−1 = 1 i=0 i=0

I We could also build the interval (0, 1] as a union of S points: 0

This is an inconsistency. ECE750-TXB Probability and Measure V Lecture 11: Probability

Todd L. I To avoid this particular inconsistency, we can restrict Veldhuizen the union-of-disjoint-sets rule to countable sequences of [email protected] sets, i.e., collections of sets that can be put into Bibliography one-to-one correspondence with the naturals:

I If A1, A2, ··· are a countable sequence of pairwise disjoint sets, then ! [ X µ Ai = µ(Ai ) i∈N i∈N

I However, even this restriction (to countable sequences of sets) is not enough. In certain flavours of set theory (e.g., ZFC: usual set theory with the axiom of choice), measures cannot be defined to every subset of R in a consistent way; this leads to contradictions like being able to chop up the unit interval (measure 1) into pieces and reassemble it into something of measure 2. (If interested, see Vitali sets, Banach-Tarski paradox.)

ECE750-TXB Probability and Measure VI Lecture 11: Probability

Todd L. I Measure theory sidesteps inconsistencies by declaring Veldhuizen certain sets to be nonmeasurable. In Lebesgue measure [email protected]

on the real line, measurable sets are defined by the Bibliography following rules: 1. R is measurable, and has measure µ(R) = ∞. 2. All intervals of the form [a, b] are measurable. 3. If A is measurable, then its complement R \ A is measurable. 4. The union of a finite or countable sequence of measurable sets is measurable.

I Sets that cannot be constructed by the above rules are deemed nonmeasurable (e.g., the Cantor set). I So, a measure space consists of three things: 1. A set Ω on which measures are being defined (e.g., R) 2. A set F of measurable sets. Each X ∈ F is a subset of Ω. 3. A measure µ : F → [0, ∞]. ECE750-TXB Probability Spaces I Lecture 11: Probability

Todd L. I A probability space comprises Veldhuizen [email protected] I A sample space of outcomes. For continuous n distributions the sample space is often R or R ; for Bibliography discrete distributions the sample space is often Z, N, {0, 1}, etc. I A class of measurable sets of the sample space;

I A probability measure that assigns probabilities to events.

I A probability space is a triple (Ω, F , µ) where

I Ω is a sample space. The elements x ∈ Ω are outcomes. I F is a collection of subsets of Ω we call events (the measurable sets); I µ : F → R is a probability measure. I Of the events F , these properties are required: I Ω ∈ F (we can measure the whole sample space) I F is closed under complementation and countable union:

I If X ∈ F then so is its complement (Ω \ X ) ∈ F ;

ECE750-TXB Probability Spaces II Lecture 11: Probability

Todd L. S Veldhuizen I If X1, X2, ··· are in F , then so is Xi . i∈N [email protected] I The probability measure µ must satisfy these properties (the Kolmogorov axioms): Bibliography 1. Probabilities are positive: for every X ∈ F , µ(X ) ≥ 0. 2. µ(Ω) = 1. (Probabilities sum to 1.) 3. For any finite or countable sequence of pairwise disjoint events X1, X2, ··· , [  X µ Xi = µ(Xi ) i

(The probability of one of the events Xi happening is the sum of their probabilities.)

I Finite probability spaces are very simple. We usually take F = 2Ω (the powerset of Ω), i.e., every subset of Ω is measurable. ECE750-TXB Probability Spaces III Lecture 11: Probability I Example: a random bit. We can define the probability Todd L. space by Ω = {0, 1}, F = {∅, {0}, {1}, {0, 1}}, and Veldhuizen [email protected]

µ(∅) = 0 Bibliography 1 µ({0}) = 2 1 µ({1}) = 2 µ({0, 1}) = 1

I We can take a product of two probability spaces. For a uniform distribution on two bits, we can use the probability space

(Ω2, F2, µ2) = (Ω, F , µ) × (Ω, F , µ)

which has

I Sample space Ω2 = Ω × Ω = {(0, 0), (0, 1), (1, 0), (1, 1)}, i.e., all combinations of two bits;

ECE750-TXB Probability Spaces IV Lecture 11: Probability

I F2 = F × F , containing all pairs of events drawn from Todd L. Veldhuizen F ; [email protected] I Probability measure µ2 defined by Bibliography

µ2(X , Y ) = µ(X )µ(Y )

for all X , Y . Note this implies that events in the first probability space of the product are independent from those of the second.

I We can repeat this process to obtain probability measures µ3, µ4,... on any finite number of bits. I One of the useful consequences of using the measure-theoretic treatment of probability is the Kolmogorov extension theorem,

which says that the finite distributions µ1, µ2, µ3,... (on bit sequences of length 1, 2, 3, etc.) define a unique stochastic process: ω I a probability space whose sample space Ω is infinite binary sequences, ECE750-TXB Probability Spaces V Lecture 11: Probability

Todd L. Veldhuizen [email protected]

I with an appropriate set F of measurable sets of Bibliography sequences, I with a probability measure µω on those measurable sets; I such that the finite projections of µω (e.g., the first k bits) match the finite distributions µ1, µ2, ··· . I Any set of finite distributions satisfying the Kolmogorov consistency conditions can be extended to a random process in this way. This particular stochastic process is an example of a Bernoulli process, in which outcomes are sequences of digits drawn from a binary alphabet.

ECE750-TXB Probability Basics I Lecture 11: Probability

Todd L. Veldhuizen I Informally, we will just write Pr(··· ) to mean the [email protected] probability of some event; the implication is that “··· ” Bibliography specifies some measurable event X ∈ F .

I For example, when we write Pr(Z ≥ 0) we are referring to µ(X ) where X is the set of outcomes X ∈ F in which Z ≥ 0. (Strictly speaking, it would not make sense to write µ(Z ≥ 0) because ‘Z ≥ 0’ is a formula, rather than a subset of Ω.) Probability essentials. 1. Independence. Two events X and Y are said to be independent if and only if

µ(X ∩ Y ) = µ(X ) · µ(Y )

i.e., Pr(both X and Y happen) = Pr(X ) · Pr(Y ) ECE750-TXB Probability Basics II Lecture 11: Probability

Similarly, events X1, X2, ··· are independent if and only Todd L. Veldhuizen if [email protected] \  Y µ Xi = µ(Xi ) Bibliography i 2. Union bound. For any finite or countable set of events X1, X2, ··· , [  X µ Xi ≤ µ(Xi ) (1) i

Eqn. (1) is an equality when the Xi are pairwise disjoint. The union bound is often used to obtain an upper bound on the probability of some rare event happening.

I Example: what is the probability that a binary string of length√ n chosen uniformly at random contains a run of ≥ n ones? (For convenience, we limit ourselves to n = k2 for some integer k.)

ECE750-TXB Probability Basics III Lecture 11: Probability

I Uniformly at random means the probability space is Todd L. n Veldhuizen defined as a product measure Ω with Ω = {0, 1}, and [email protected] 1 µ1(0) = µ1(1) = 2 . Bibliography I Define√ Xi to be the event that the string contains√ a run of n ones starting at position√ i, where i ≤ n − n. I The probability of a run of n 1’s starting at position i 1 is µ (X ) = √ . (This is obvious, but to be finicky we n i 2 n could consider every possible string having such a run starting at i. Each such string w is an event {w}; the events √ are pairwise disjoint; there are 2n− n such strings, each with probability 2−n, so summing over the pairwise disjoint events √ (cf. Kolmogorov’s 3rd axiom) the sum comes out to 2− n.) √ I The events X1, X2, ··· , Xn− n are definitely not independent. However, we can use the union bound to obtain √ n− n √ X √ 1 Pr(run of n ones) ≤ µ(Xi ) = (n − n) √ 2 n i=1 ECE750-TXB Probability Basics IV Lecture 11: √ Probability So, the√ probability of having a run of n ones is − n Todd L. O(n2 ), which is going to zero very quickly. (Note Veldhuizen we used O(·), not Θ(·), because we have a possibly [email protected] loose upper bound.) Bibliography 3. Inclusion-Exclusion. For any two events X1 and X2,

µ(X1 ∪ X2) = µ(X1) + µ(X2) − µ(X1 ∩ X2)

More generally, for any events X1, X2, ··· , [  X µ Xi = µ(Xi ) i X − µ(Xi ∩ Xj ) i

ECE750-TXB Probability Basics V Lecture 11: Probability

I This can be easily remembered by looking at Venn Todd L. Veldhuizen diagrams: sum up the areas, subtract the things you [email protected] counted twice, then add in the things you subtracted too many times, ... Bibliography

I Inclusion-Exclusion is a particular instance of a general principle called M¨obius inversion. 4. Random variables. In the measure-theoretic treatment of probability, a random variable Z is a function Z :Ω → V from outcomes to some set V. Commonly V is R (a continuous random variable), N (a discrete random variable), or {0, 1} (an indicator variable or Bernoulli variable).

I Example. Take bitstrings of length n again. Let Z be a random variable counting the number of ones in the string. Then formally Z is a function from {0, 1}n → N, so that for example

Z(010010110) = 4 ECE750-TXB Probability Basics VI Lecture 11: Probability

5. Indicator (Bernoulli) random variables. In the special Todd L. Veldhuizen case that a random variable takes on only the values 0 [email protected] and 1, it is called an indicator variable or Bernoulli random variable. We can associate each event E ∈ F with an Bibliography indicator variable ZE that is 1 if E occurs, and 0 otherwise, i.e. ( 1 if X ∈ E ZE (X ) = 0 otherwise

6. Expectation. For a random variable Z, we write E[Z] for the expected value of Z. This is simply the average over the sample space. For a finite probability space (i.e., the number of outcomes |Ω| is finite), X E[Z] ≡ Z(X )µ({X }) (2) X ∈Ω

But in the general case (i.e., a possibly infinite sample space) the expectation is an integral over the probability

ECE750-TXB Probability Basics VII Lecture 11: Probability space; in the case that Ω = R this is the familiar Todd L. Veldhuizen integral on the real line defined in terms of pdf’s. I.e., if [email protected] F (x) = µ((−∞, x] ) (the cdf), and f (x) = F 0(x) (the Bibliography pdf), then

Z +∞ E[Z] = xf (x)dx x=−∞

7. Linearity of Expectation. If Z1, Z2,... are random variables, then " # X X E Zi = E[Zi ] i i

I The usefulness of this cannot be understated! The random variables may be far from independent, but we can still sum their expectation. ECE750-TXB Probability Basics VIII Lecture 11: Probability

Todd L. Veldhuizen I In particular, the combination of indicator variables with [email protected] linearity of expectation is very powerful, and one of the most basic tools of the Probabilistic Method [1]. Bibliography

I In algorithm analysis, we can sometimes choose a set of indicator variables Z1, Z2,... where each Zi represents some piece of work we may or may not have to do. The expected value of the sum E[Z1 + Z2 + ··· ] is an upper bound on on the average amount of work we need to do. For example, in [3, §2.5] you can find an ingeniously simple analysis of the average time complexity of QuickSort using this method. I It can also be used to characterize what the typical “largest occurrence” of some pattern is in a random object. For example, what is the expected length of the longest run of 1’s in a random n-bit string?

I Let t be the length of a run. (We will solve for t to find the most likely situation.)

ECE750-TXB Probability Basics IX Lecture 11: Probability

Todd L. Veldhuizen [email protected] I Let X1, X2,..., Xn−t+1 be indicator random variables,

where Xi = 1 just when a run of ≥ t 1’s starts at Bibliography position i. The probability of a run of t 1’s is 2−t , so −t Pr(Xi = 1) = 2 . I Let Y = X1 + X2 + ... + Xn−t+1 be the expected number of runs of length ≥ t. Although the Xi are not independent, we can use linearity of expectation to obtain

E[Y ] = E[X1 + X2 + ··· + Xn−t+1] = E[X1] + E[X2] + ··· + E[Xn−t+1] = (n − t + 1)2−t

To find a likely value for t, we can set E[Y ] = 1: i.e., we want to find the value of t for which we expect to have one run of t 1’s. ECE750-TXB Probability Basics X Lecture 11: Probability

Todd L. I This is a simple asymptotics exercise: taking Veldhuizen logarithms of E[Y ] = 1, we obtain [email protected] t = log(n − t + 1) Bibliography “ t ” = log(n) − Θ n „ log n « = log(n) − Θ n

I Note also the similarity to the union bound: if we view the Xi as events (rather than random variables), we can say ! [ X Pr Xi ≤ Pr(Xi ) i i = (n − t + 1)2−t

This works because the expected value of a indicator variable is just its probability of being 1.

ECE750-TXB Probability Basics XI Lecture 11: Probability

I We can combine the above results to obtain a Todd L. Veldhuizen concentration inequality for the length of the longest [email protected] run: set t = log(n) + δ. Then Bibliography Pr (run of length ≥ log(n) + δ) ≤ (n − log n − δ + 1)2− log n−δ 1 = (n − log n − δ + 1) 2−δ n = (1 − o(1))2−δ

So, with very little work we have obtained an exponential concentration for the length of the longest run. 8. Markov’s inequality. This inequality gives us quick but loose bounds on the deviation of random variables from their expectation. If X is a random variable and α > 0, then [|X |] Pr(|X | ≥ α) ≤ E α ECE750-TXB Probability Basics XII Lecture 11: Probability

Todd L. Veldhuizen [email protected] If X takes on only positive values, then the | · |’s may be dropped. Bibliography

I Example. Recall that the expected height of a binary search tree constructed by inserting a random sequence of n keys is c log n, with c ≈ 4.311. If we let H be a random variable representing the height of the tree, then Markov’s inequality gives

c log n log n  Pr(H ≥ βn) ≤ = O βn n

So, the probability of getting a tree of height Θ(n)  log n  tends to zero as O n .

ECE750-TXB Probability Basics XIII Lecture 11: Probability

Todd L. Veldhuizen [email protected]

I Example. Suppose an algorithm runs in average-case Bibliography time f (n) but has worst-case time g(n), where g(n) f (n). What is the probability that the algorithm will take time g(n)? Treat the running time as a random variable; applying Markov’s inequality, we immediately obtain

f (n) Pr(running in time ≥ g(n)) ≤ g(n)

Since f (n) ≺ g(n), the probability of running in time ≥ g(n) is tending to zero, at least as quickly as the ratio between the average and worst-case time. ECE750-TXB Probability Basics XIV Lecture 11: Probability

Todd L. Veldhuizen 9. Variance and Standard Deviation. Recall that the [email protected] variance and standard deviation of a random variable X are: Bibliography

2 2 2 Var[X ] = E[(X − E[X ]) ] = E[X ] − (E[X ]) σ[X ] = (Var[X ])1/2

A common special case: if X is a Bernoulli random variable with probability β of being 1, then Var[X ] = β(1 − β). If X1, X2,... are independent random variables, then " # X X Var Xi = Var[Xi ] i i

ECE750-TXB Probability Basics XV Lecture 11: Probability

10. Chebyshev’s bound. Chebyshev’s bound, like Markov’s Todd L. Veldhuizen bound, is a tail inequality that gives a bound on how [email protected] slowly probability can drop as you move away from the mean. Bibliography 1 Pr(|X − [X ]| ≥ aσ[X ]) ≤ E a2

11. Distributions. A probability space (Ω, F , µ) on Ω = N or Ω = R coincides with our familiar idea of a “probability distribution.” A random variable X : F → ΩX has an associated distribution (probability space) (ΩX , FX , µX ), where I The sample space is ΩX ; I The measurable events are given by

−1 FX = {E ⊆ ΩX : X (E) ∈ F }

−1 I The probability measure is µX (E) = µ(X (E)). ECE750-TXB Probability Basics XVI Lecture 11: Probability

Todd L. Veldhuizen [email protected] I Consider a uniform distribution on 3-bit strings, i.e., Ω = {0, 1}3, and a random variable X that counts the Bibliography number of bits that are 1, e.g. X (011) = 2. Then

−1 µX (2) = µ(X (2)) = µ({110, 101, 011})

I For a continuous random variable, e.g., something of the form X :Ω → R, the familiar probability density function (pdf) and cumulative density function (cdf) are:

F (x) = µ( (0, x]) d f (x) = dx F (x)

ECE750-TXB Bibliography I Lecture 11: Probability

Todd L. Veldhuizen [email protected]

[1] Noga Alon and Joel Spencer. Bibliography The Probabilistic Method. John Wiley, second edition, 2000. bib [2] Per Martin-L¨of. The definition of random sequences. Information and Control, 9(6):602–619, December 1966. bib [3] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. bib ECE750-TXB Lecture 12: Markov Chains and their Applications

Todd L. ECE750-TXB Lecture 12: Markov Chains Veldhuizen and their Applications [email protected] Bibliography

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 20, 2007

ECE750-TXB Markov Chains I Lecture 12: Markov Chains and their Applications

Todd L. I The probabilistic counter is a simple example of a Veldhuizen [email protected] Markov chain. Roughly speaking, a Markov chain is a finite state machine with probabilistic transitions. Bibliography Consider the follow two-state Markov chain with transition probabilities as shown:

p " 1−p A B 1−p 9 b f p /.-,()*+ /.-,()*+ Let fA(n) be the probability that the machine is in state A at step n, similarly fB (n). Obviously the machine can only be in one state at a time, so

fA(n) + fB (n) = 1 ECE750-TXB Markov Chains II Lecture 12: Markov Chains and their Applications

Todd L. Let’s consider the scenario where the machine is in state Veldhuizen A to begin: [email protected] Bibliography fA(0) = 1

fB (0) = 0

These are the initial conditions. We can write equations to describe how the system evolves. For each state, we look at the incident edges, and write the probability of being in that state at time n in terms of where it was at time n − 1. We reach state A at time n

I with probability (1 − p) if we were in state A at time n − 1;

I with probability p if we were in state B at time n − 1

ECE750-TXB Markov Chains III Lecture 12: Markov Chains and This leads to the equations their Applications

Todd L. fA(n) = (1 − p)fA(n − 1) + pfB (n − 1) Veldhuizen [email protected] fB (n) = pfA(n − 1) + (1 − p)fB (n − 1) Bibliography To encode the initial conditions we add δ(n) to the equation for fA; this will result in fA(0) = 1. In matrix form, we can write the equations as:  f (n)   1 − p p   f (n − 1)   δ(n)  A = A + fB (n) p 1 − p fB (n − 1) 0 | {z } P These are called the Chapman-Kolmogorov equations. The matrix P is the transition matrix. Taking Z-transforms, we obtain

     −1    FA(z) 1 − p p z FA(z) 1 = −1 + FB (z) p 1 − p z FB (z) 0 | {z } P ECE750-TXB Markov Chains IV Lecture 12: Markov Chains and their Applications

Todd L. Rearranging, Veldhuizen [email protected]   Bibliography          1 − p p −1 1 0  FA(z) 1  z −  =  p 1 − p 0 1  FB (z) 0 | {z } | {z } | {z } | {z } P I F x

which we can write as the equation

(Pz−1 − I)F = x (1)

or,

z−1(P − zI)F = x (2)

ECE750-TXB Markov Chains V Lecture 12: Markov Chains and I If you are familiar with eigenvalues, the term (P − zI) should their Applications look conspicuous. Note that we can solve Eqn. (1) by Todd L. −1 Veldhuizen left-multiplying through by z(P − zI) to obtain [email protected] −1 F = z(P − zI) x Bibliography Furthermore, (P − zI)−1 can be written in terms of the −1 adj K adjoint and determinant: recall that K = |K| , where |K| is the determinant. We can then write the solution as: z · adj (P − zI) F = x |P − zI| So, the poles of the functions in F will occur at values of z where

|P − zI| = 0

Compare this to the characteristic equation for the eigenvalues of P:

|P − λI| = 0 ECE750-TXB Markov Chains VI Lecture 12: Markov Chains and their Applications

Todd L. I the poles are located at λ1, λ2, ··· , where the λi are Veldhuizen the eigenvalues of the transition matrix P. [email protected]

Bibliography Solving for FA(z), we obtain

(1 − (1 − p)z−1) F (z) = A (1 − z−1)(1 − (1 − 2p)z−1)

This has poles at z = 1 and 1 − 2p, and a zero at 1 − p.

I The pole at z = 1 reflects the limiting distribution of −1 c the chain. (Recall that Z [ 1−z−1 ] = c · u(n).) I The pole at 1 − 2p produces transient behaviour (so long as 0 < p < 1.)

I If p = 0 or p = 1, the zero at 1 − p cancels one of the poles.

ECE750-TXB Markov Chains VII Lecture 12: Markov Chains and Taking the inverse transform, we obtain. their Applications Todd L. 1 1 Veldhuizen f (n) = + (1 − 2p)n [email protected] A 2 2 |{z} | {z } Bibliography pole z=1 pole z=1−2p 1 1 f (n) = − (1 − 2p)n B 2 2

where we have used fB (n) = 1 − fA(n). Depending on the value of p, this two-state Markov chain can exhibit five distinct behaviours: 1. p = 0: the machine always stays in state 0. The only possible sequence is AAAAAA ··· . 2. p = 1: always get a strict alternation between states: ABABABAB ··· 1 3. p < 2 : get monotone, exponentially fast approach to 1 limiting distribution fA(n) = fB (n) = 2 . 1 1 4. p = 2 : get fA(n) = fB (n) = 2 for n ≥ 1. Every sequence of A0s and B0s is equally likely. ECE750-TXB Markov Chains VIII Lecture 12: Markov Chains and 1 their Applications 5. p > 2 : get oscillating decay to limiting distribution of 1 Todd L. fA(n) = fB (n) = 2 . Veldhuizen [email protected] I Markov Chains.

I A (finite) Markov chain is a set of states Bibliography S = {s1,..., sn} together with a matrix P of transition probabilities pij of moving from state si to state sj . ω I The sample space is Ω = S , i.e., infinite sequences of states.

I For an initial distribution u, the distribution after n steps is Pnu. (n) I Write pij for the probability of going from state i to state j in n steps. (Note: this is the entry (i,j) from the matrix Pn.) + (n) Write i → j if there is an n > 0 such that pij > 0, i.e., j can be reached from i. If i →+ j and j →+ i, we say i and j are communicating, and we can write i ↔ j. The relation ↔ is an

ECE750-TXB Markov Chains IX Lecture 12: Markov Chains and equivalence relation, and it partitions the states into their Applications

classes of states that communicate with each other. Todd L. Classification of Markov chains: Veldhuizen I [email protected] Markov chain U kkk UUU Bibliography kkk UUUU kkk UU Reducible Irreducible/Ergodic ii RRR iiii RRR iiii RRR Aperiodic/Mixing Periodic

I Irreducible: all states are communicating. This means there is only one class of long-term behaviours. If a chain is irreducible, there is a limiting distribution u such that Pu = u, and the chain spends a proportion of time ui in state i. This chain is irreducible: 1 % A B O

/.-,()*+1 C o /.-,()*+1

/.-,()*+ ECE750-TXB Markov Chains X Lecture 12: Markov Chains and T their Applications and the distribution u =  1 1 1  is a limiting 3 3 3 Todd L. distribution. Veldhuizen In particular, for a sample sequence, [email protected]

1 Bibliography ui = limn→∞ n (#times in state i) with probability 1. (This is a Ces`aro limit: the probability of being in state i after n steps might not converge — it might be 0, 0, 1, 0, 0, 1,... — but the Ces`aro limit does converge.) An irreducible Markov chain is ergodic — meaning that sample space average coincides with time averages (with probability 1). (Note that some authors use “ergodic” as a synonym for “mixing,” which is not quite correct.)

I A Markov chain is aperiodic if for all i, j,

(n) gcd{n : pij > 0} = 1 otherwise, the chain is called periodic.

ECE750-TXB Markov Chains XI Lecture 12: Markov Chains and I If a chain is not irreducible, it is called reducible. There their Applications

are multiple equivalence classes under ↔; means there Todd L. is more than one possible long-term behaviour. A Veldhuizen [email protected] simple example is:

1 1 Bibliography 1 1 Õ 2 2  A o B / C

I An irreducible chain is mixing if for any initial distribution p0, /.-,()*+ /.-,()*+ /.-,()*+ n lim A p0 = u n→∞ where u is the limiting distribution. This chain is mixing:

3/4

1  A n 4 B Q 1 1 = 4 4 /.-,()*+  /.-,()*+ 3/4 C q 3/4

/.-,()*+ ECE750-TXB Markov Chains XII Lecture 12: Markov Chains and their Applications

Todd L. A chain that is mixing “forgets” its initial conditions, in Veldhuizen n [email protected] the sense that the distribution A p0 is asymptotically independent of the distribution p0. Bibliography I An absorbing state is one that communicates only with itself:

1  A

A chain is called absorbing if every state communicates with an absorbing state.

I Applications discussed in lecture:

I PageRank (Google)

I Anomoly detection

ECE750-TXB Convergence Time of Markov Chains I Lecture 12: Markov Chains and their Applications

I Recall that the z-transform of the Todd L. Veldhuizen Chapman-Kolmogorov equations has the form [email protected]

N(z) Bibliography F = |P−zI | x

where P is the transition matrix, N(z) is some polynomial in z, and x encodes the initial conditions.

I The inverse z-transform of fi (n) will have terms N(z) corresponding to the poles of |P−zI | ; for example,

n n fi (n) = ui + α2(λ2) + α3(λ3) + ···

where λk are the poles; equivalently the eigenvalues of the transition matrix P. By convention we number the eigenvalues so that

|λ1| ≥ |λ2| ≥ |λ3| ≥ · · · ECE750-TXB Convergence Time of Markov Chains II Lecture 12: Markov Chains and their Applications The largest eigenvalue, λ , always satisfies λ = 1, 1 1 Todd L. which generates the limiting distribution term u . Veldhuizen i [email protected] I The rate of convergence to the limiting distribution is Bibliography governed by |λ2|, the magnitude of the second-largest pole/eigenvalue. (Or, the first largest eigenvalue with |λi | < 1, if there are multiple eigenvalues equal to 1.) For example, if λ2 = 0.99, then get a term of the form n α2(0.99) , which has a half-life of 4n ≈ 69 (very slow convergence.) If λ2 = 0.7, then half-life is 4n ≈ 2 (very fast convergence.)

I |λ2| is sometimes referred to as the SLEM: Second Largest Eigenvalue Modulus. (Modulus = magnitude).

I In designing randomized algorithms that can be modelled as Markov chains, we can optimize the convergence time by minimizing |λ2|.

ECE750-TXB Leader Election I Lecture 12: Markov Chains and their Applications Scenario: have some number of computers that can Todd L. I Veldhuizen communicate with each other. We want them to [email protected]

randomly elect one of them to be the leader. Each Bibliography computer must run the same algorithm.

I One method for leader election is the following:

I Initially, each machine thinks it is the leader.

I At each time step, machines broadcast whether they think they are the leader or not.

I If a machine thinks it is the leader, but some other machine also does, it flips a coin to decide whether to drop out or stay in the leadership race.

I The process is finished when there is only one machine that thinks it is the leader.

I If nobody thinks they are the leader, start all over again with everyone thinking they are the leader.

I Goal: minimize the amount of time needed to elect a leader. ECE750-TXB Leader Election II Lecture 12: Markov Chains and I Implies: make the Markov chain converge to a stable their Applications

configuration as quickly as possible. Todd L. I Implies: choose the dropout probability p in order to Veldhuizen [email protected] make the secondary pole as close to the origin as possible. Bibliography

I E.g., with 2 players. I 4 states: 11 (everyone thinks they could be a leader), 10,01 (one leader), 00 (both drop out).

(1−p)2  11 H B p(1−p) || BB p(1−p) || BB || BB ~|| B 1 01 1 p2 10 1 1 m

Ø 00

ECE750-TXB Leader Election III Lecture 12: Markov Chains and their Applications We will show that the average time to elect a leader is Todd L. I √ Veldhuizen 1 [email protected] minimized when p = 2 (3 − 5) ≈ 0.381966. I Equations: Bibliography

2 3 2 2 3 2 3 2 3 f00(n) 0 0 0 p f00(n − 1) 0 6 f01(n) 7 6 0 1 0 p(1 − p) 7 6 f01(n − 1) 7 6 0 7 6 7 = 6 7 6 7 + 6 7 4 f10(n) 5 4 0 0 1 p(1 − p) 5 4 f10(n − 1) 5 4 0 5 2 f11(n) 1 0 0 (1 − p) f11(n − 1) δ(n)

I The eigenvalues/pole locations are:

λ1 = 1

λ2 = 1 1 2 1 1 p 4 3 2 λ3,4 = 2 p − p + 2 ± 2 p − 4p + 10p − 4p + 1 (via Maple.) ECE750-TXB Leader Election IV Lecture 12: Markov Chains and their Applications

Todd L. Veldhuizen [email protected]

Bibliography I The positive pole dominates. To find the p that dλ3 minimizes it, set dp = 0 to obtain √ 1 p = 2 (3 − 5)

I This yields λ3 ≈ 0.618, corresponding to a half-life of about 1.44. (Choosing p = 1/2, an unbiased coin flip, would yield λ3 ≈ 0.64, and half-life ≈ 1.55.)

ECE750-TXB Bibliography I Lecture 12: Markov Chains and their Applications

Todd L. Veldhuizen [email protected]

Bibliography ECE750-TXB Lecture 13: Information Theory

Todd L. ECE750-TXB Lecture 13: Information Veldhuizen Theory [email protected] Bibliography

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 22, 2007

ECE750-TXB Entropy I Lecture 13: Information Theory

I The central concept of information theory is entropy, Todd L. Veldhuizen which measures: [email protected] 1. The “amount of randomness” in a distribution; 2. How many bits are required, on average, to represent an Bibliography object, assuming the distribution is known to both the encoder and decoder.

I Entropy is a functional from distributions to R: if µ is a distribution, then the entropy H(µ) is a real number describing “how random” the distribution µ is.

I The following requirements lead to a unique definition of entropy (up to a multiplicative constant). Suppose µ is a distribution on {1, 2,..., n}; we can treat µ as a n vector of R , i.e.,

µ = [p1 p2 ··· pn]

where p1 is the probability of outcome 1, etc. ECE750-TXB Entropy II Lecture 13: Information 1. Continuity: H(µ) should be a continuous function of µ. Theory Using a δ −  definition of continuity, for each µ and Todd L. Veldhuizen each  > 0 there is a δ > 0 such that [email protected]

kµ − µ0k < δ implies kH(µ) − H(µ0)k <  Bibliography

2. Maximality: H(µ) attains its maximum value when 1 1 1 µ = [ n n ··· n ], i.e., a uniform distribution is the “most random.” 3. Additivity: If µ and µ0 are two distributions then

H(µ × µ0) = H(µ) + H(µ0)

i.e., the entropy of a product of probability spaces is the sum of their entropy. 4. Expandability: If we expand the distribution µ from the domain {1, 2,..., n} to the domain {1, 2,..., n + 1}, then

H([p1 p2 ··· pn]) = H([p1 p2 ··· pn 0])

ECE750-TXB Entropy III Lecture 13: Information Theory

I The unique function satisfying these conditions is Todd L. Veldhuizen n [email protected] X H(µ) = β −p log p i 2 i Bibliography i=1 where 0 log 0 = 0 to satisfy continuity. The constant β is usually taken to be 1 so that the entropy can be interpreted as “the number of bits needed to represent an outcome.” 1 1 1 I Example. Let µ = [ n n ··· n ]. Then

n ˝ X (µ) = −pi log pi i=1 1 1 = n · (− n log n ) = log n ECE750-TXB Entropy IV Lecture 13: Information Theory

Todd L. With a uniform distribution on n outcomes, log n bits Veldhuizen are needed to represent an outcome. The uniform [email protected]

distribution is the only distribution on n outcomes that Bibliography has entropy log n.

I Example. Let µ = [0 0 1 0 0 ··· 0]. Then

H(µ) = (n − 1)(−0 log 0) − (1 log 1) = 0

The only distributions with H(µ) = 0 are those where a single outcome has probability 1. 1 1 I Example. Let µ = [ 2 2 ]. Then H(µ) = 1: one bit is required to represent the outcome of a uniform distribution on two outcomes (e.g., a fair coin flip.)

ECE750-TXB Entropy V Lecture 13: Information I Example. Let µ = [p (1 − p)]. (A Bernoulli RV with Theory probability p.) Then Todd L. Veldhuizen [email protected] H(µ) = −p log p − (1 − p) log(1 − p) Bibliography

Entropy of a Bernoulli random variable 1 0.9 H 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Probability p ECE750-TXB Noiseless coding theorem I Lecture 13: Information Theory

∗ Todd L. I Recall {0, 1} is the set of finite binary strings. Veldhuizen ∗ [email protected] I Let C ⊆ {0, 1} be a prefix-free set of codewords. (Prefix-free means no codeword occurs as a prefix of Bibliography another codeword; see the lecture on tries.)

I A code for µ is a function c : dom µ → C. (Note that we can treat c as a random variable.)

I The average code length of c is X c = E[c] = µ(x)|c(x)| x∈dom µ

I Shannon’s noiseless coding theorem states that: 1. c ≥ H(µ). That is, no code can achieve an average code length less than the entropy. 2. There exists a code achieving c ≤ 1 + H(µ).

ECE750-TXB Applications of information theory to algorithms Lecture 13: Information Theory

Todd L. Veldhuizen [email protected]

I Information theory is applied in algorithm design and Bibliography analysis in several ways: 1. Deriving lower bounds on time or space required for an algorithm or data structure using a uniform distribution. (Usually called “information-theoretic lower bound.”) 2. Designing structures for searching that exploit the entropy of the distribution on keys. 3. The noiseless coding theorem can be used to derive probability bounds. (Often such arguments are phrased in terms of Kolmogorov complexity; the technique is called “The Incompressibility Method.”) ECE750-TXB Information-theoretic lower bounds I Lecture 13: Information Theory

Todd L. I Information theory provides a quick method for obtain Veldhuizen lower bounds, i.e., Ω(·)-bounds, on time and space. [email protected]

I Example: sorting an array. Consider the problem of Bibliography sorting an array of n integers using comparison operations and swaps.

I Assume no two elements of the array are equal.

I Assume comparisons such as “a[i] ≤ a[j]” are the only method of obtaining information about the input array ordering.

I There are n! orderings of the input array, but only one possible output ordering. 00 I Each comparison test of the form “a[i] ≤ a[j] yields a true-or-false answer, i.e., at most one bit of information about the ordering of the input array. I To distinguish amongst the n! possible orderings of the input array, at least log n! tests are required, on average. This can be established in several ways:

ECE750-TXB Information-theoretic lower bounds II Lecture 13: Information Theory

Todd L. Veldhuizen [email protected] I Decision tree: consider the sequence of comparisons performed by a sorting algorithm as a path through a Bibliography tree, where each tree node represents a comparison test; the left branch is taken if a[i] > a[j], and the right branch is taken if a[i] ≤ a[j]. Each leaf of the tree is a rearrangement (sorting) of the input array. Since there are n! possible derangements of the input array, the tree must have ≥ n! leaves; since it is a binary tree, it must be of depth ≥ log n!. I It must be possible to recover the initial array ordering from the sequence of test outcomes; so, the sequence of test outcomes constitutes a “code” for input array orderings. Apply the noiseless coding theorem: at least log n! comparisons are required, on average. ECE750-TXB Information-theoretic lower bounds III Lecture 13: Information Theory

Todd L. I After sorting the input array, all information about the Veldhuizen original ordering is lost. If we wanted to “run the [email protected] algorithm backwards” and recover the original input Bibliography array, we would need log n! bits of information to reproduce the original, unsorted array. We say that sorting the array incurs an “irreversibility cost” of log n! bits. RAM and Turing-machine models allow at most O(1) bits of information to be “erased” at each step. Therefore Ω(log n!) = Ω(n log n) time steps are required. (These ideas are stock-in-trade of “thermodynamics of computation,” which investigates (among other things) the minimum amount of heat that must be produced when computations are performed.) From one of these three arguments, one can conclude that any comparison-based sorting algorithm requires Ω(log n!) = Ω(n log n) worst-case time.

ECE750-TXB Information-theoretic lower bounds IV Lecture 13: Information Theory

Todd L. I It is possible to sort in o(n log n) time in certain Veldhuizen circumstances: for example, to sort a large array of [email protected] values in the range [0, 255] one can simply maintain an Bibliography array of 256 counters, scan the array to build a histogram, and then expand the histogram into a sorted array. This can be done with O(n) operations.

I However, due to either the second or third argument above (the coding argument or the reversibility cost argument), it is never possible to sort an array in less than O(H) operations, where H is the entropy of the ordering of the input array.

I Example: Binary Decision Diagrams (BDDs). BDDs are a very popular representation for boolean functions.

I A boolean function on k variables is a function k f (x1,..., xk ): {0, 1} → {0, 1}. ECE750-TXB Information-theoretic lower bounds V Lecture 13: Information Theory I Example: the majority function on three-variables MAJ(x, y, z) is true when a majority of its inputs are Todd L. Veldhuizen true: [email protected]

x y z MAJ(x, y, z) Bibliography 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1

k I Note that a boolean function on k variables has 2 possible input combinations (hence, 8 lines in the above truth table.)

I BDDs are popular because they can often represent a boolean function very compactly, e.g., linear in the number of variables. Can they always do this?

ECE750-TXB Information-theoretic lower bounds VI Lecture 13: Information Theory I Put a uniform distribution on boolean functions of k k Todd L. variables. There are 2 possible input combinations; a Veldhuizen boolean function can be true or false for each input [email protected] 2k combination; hence there are 2 boolean functions on Bibliography k variables.

I Under a uniform distribution, the entropy is 2k k log2 2 = 2 . I Any representation of boolean functions can be viewed as a “code” to represent the boolean function in memory. Applying the noiseless coding theorem, we get

c ≥ 2k

That is, any representation for boolean functions requires an average code length of 2k bits — exponential in the number of inputs.

I We can therefore say: any representation of boolean functions requires Ω(2k ) bits per function, on average, with a uniform distribution. ECE750-TXB Information-theoretic lower bounds VII Lecture 13: Information Theory

Todd L. Veldhuizen [email protected]

Bibliography

I The fact that BDDs often achieve small representations suggests that: 1. the distribution on boolean functions used in practice has quite low entropy; 2. BDDs are a reasonably efficient code for that distribution.

ECE750-TXB Entropy and searching I Lecture 13: Information Theory

I Scenario: retrieving records from a database. Todd L. Veldhuizen [email protected] I Let K be a set of keys, and n = |K|. We have seen that

binary search, or binary search trees, yield worst-case Bibliography time of O(log n).

I It turns out that the best average-case performance that can be achieved does not depend on n per se, but rather on the entropy of the input distribution on keys.

I Suppose µ is a probability distribution on K, indicating the frequency with which keys are requested.

I In some applications, search problems have highly nonuniform distributions on input keys.

I Often the probability distribution on very large key sets follows a distribution where the mth most popular key has probability ≺ m−1 (cf. Zipf’s law, Chris Anderson’s article The Long Tail.).

I Let H = H(µ) be the entropy of the input distribution. ECE750-TXB Entropy and searching II Lecture 13: Information Theory

Todd L. Veldhuizen [email protected] I We can achieve search times of O(H) by placing

commonly requested keys close to the root of the tree. Bibliography

I Example: suppose we are performing dictionary lookups on the following set of keys:

Key Probability 1 entropy 2 1 caffeine 4 1 Markov 16 1 thermodynamics 16 1 convex 16 1 stationary 16

ECE750-TXB Entropy and searching III Lecture 13: Information Theory

I We can use the following search tree: Todd L. Veldhuizen entropy [email protected] RR nnnn RR caffeine stationary Bibliography o l UUU ooo lll UU convex markov thermodynamics

I We have placed the most commonly requested keys (entropy, caffeine) close to the root.

I Average depth is

1 1 1 1 2 (1) + 4 (2) + 16 (2) + 3 · 16 (3) = 1.69

I There are good algorithms to design such search trees offline (i.e., when the distribution is known): one can build binary search trees that achieve optimal search times e.g. [1]. ECE750-TXB Entropy and searching IV Lecture 13: Information Theory

I However, in some situations it is impractical to build Todd L. Veldhuizen such trees offline, since the contents of the database are [email protected] changing, or the key distribution is not stationary (e.g., every day brings different “hot topics” that people are Bibliography searching for.)

I Splay trees [2] are fascinating binary search trees that reorganize themselves in response to input key distributions. The underlying idea is very simple: each time a key is requested, it is moved to the root of the tree. In this way, popular keys tend to hang out close to the root.

I Splay trees are known to be optimal for a static distribution on keys. Their performance for nonstationary distributions is a longstanding open question: the “dynamic optimality conjecture.”

ECE750-TXB Bibliography I Lecture 13: Information Theory

Todd L. Veldhuizen [email protected]

Bibliography [1] Kurt Mehlhorn. Nearly optimal binary search trees. Acta Informat., 5(4):287–295, 1975. bib pdf

[2] Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652–686, July 1985. bib pdf ECE750-TXB Lecture 14: Typical Inputs

Todd L. Veldhuizen ECE750-TXB Lecture 14: Typical Inputs [email protected]

Bibliography

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

February 27, 2007

ECE750-TXB Asymptotic Distributions I Lecture 14: Typical Inputs

Todd L. Veldhuizen I Scenario for average-case analysis: [email protected] I n is the input “size” Bibliography I Inputs: (Kn)n∈N is a family of sets indexed by n, giving the possible inputs to an algorithm of size n. I For each n, there is a probability distribution µn on inputs.

I Example: if an algorithm operates on binary strings, we could choose size to mean “length of the string,” and n Kn = {0, 1} (strings of length n.) I An asymptotic distribution is a family of probability distributions (µn)n∈N where µn is a probability measure on the sample space Kn.

I When our meaning is obvious, we will write µn(w) for the probability of the input w ∈ Kn. (If µn is a measure then to be fastidious we should write µn({w}), where {w} is the event that the outcome is w. But, writing µ(w) is clearer.) ECE750-TXB Asymptotic Distributions II Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

Bibliography I Example: for binary strings of length n, the uniform distribution on {0, 1}n is defined by

1 µ (w) = n 2n

n since there are |Kn| = 2 binary strings of length n.

I To design algorithms that behave well on average, it helps to know what properties are “typical” for the distribution of inputs.

ECE750-TXB Sets of asymptotic measure 1 I Lecture 14: Typical Inputs

S Todd L. Let K = Kn be all possible inputs of any length. I i∈N Veldhuizen [email protected] I For a class of inputs A ⊆ K, the asymptotic measure of A, if it exists, is given by Bibliography

µ∞(A) = lim µn(A ∩ Kn) n→∞

I Note that in general the limit may not exist. For example, taking Kn to be binary strings, the probability of the set

A = {w ∈ {0, 1}∗ : |w| is even}

alternates between 0 and 1:

µ3(A ∩ K3) = 0

µ4(A ∩ K4) = 1

µ5(A ∩ K5) = 0 ECE750-TXB Sets of asymptotic measure 1 II Lecture 14: Typical Inputs . . Todd L. Veldhuizen [email protected] and so the limit fails to exist. Bibliography I If µ∞(A) = 1, we can say

I A has asymptotic measure 1;

I A is almost surely true;

I A happens almost surely;

I The phrases with probability 1, almost certain, almost always, and with high probability are also used.

I The abbreviation a.s. is commonly used for almost surely.

I Let’s look at a few examples of almost sure properties of random binary strings: 1. Runs of 1’s 2. The balance of 0’s and 1’s 3. The position of the first nonzero bit 4. The number of prime divisors of the string when interpreted as a base-2 integer.

ECE750-TXB Runs in random strings I Lecture 14: Typical Inputs

Todd L. I Example: Runs of 1’s in binary strings. Veldhuizen n [email protected] I Let Kn = {0, 1} be binary strings of length n, and µn the uniform distribution. Bibliography

I Define the random variable R : Kn → N to be the length of the longest run of 1’s. For example,

R(0100111110100) = 5

I Recall that in previous lectures we obtained a concentration inequality for R:

Pr(R ≥ t) ≤ (n − t + 1)2−t

We used the union bound: let Xi be the event that a run of t 1’s starts at position i; then

Pr(R ≥ t) = µ(X1 ∪ · · · Xn−t+1) ECE750-TXB Runs in random strings II Lecture 14: Typical Inputs n−t+1 Todd L. X Veldhuizen ≤ µ(Xi ) [email protected] i=1 Bibliography = (n − t + 1)2−t

where we are requiring that t ≤ n of course.

I Assume t ≺ n, set Pr(R ≥ t) = 1 and take logarithms:  t  t ≤ log(n − t + 1) = log n − Θ n

I Choosing t(n) = log n + δ, we obtain

Pr(R ≥ log n + δ) ≤ (n − log n − δ + 1)2− log n−δ 1 −δ = n (n − log n − δ + 1)2 = (1 − o(1))2−δ

ECE750-TXB Runs in random strings III Lecture 14: Typical Inputs

Todd L. Veldhuizen Conversely, [email protected]

Pr(R < log n + δ) = 1 − Pr(R ≥ log n + δ) Bibliography > 1 − (1 − o(1))2−δ

I If δ ∈ ω(1) then Pr(R < log n + δ) → 1.

I We can say “Almost surely, a binary string chosen uniformly at random does not have a run of length log n + ω(1).”

I Define

Aδ ≡ {w ∈ K : longest run length is < log |w| + δ(|w|)}

(Aδ is a family of sets of strings, indexed by a function δ.) ECE750-TXB Runs in random strings IV Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

I For any function δ, Bibliography

−δ(n) µn(Aδ ∩ Kn) > 1 − (1 − o(1))2

A less sharp, but clearer statement is:

−δ(n) µn(Aδ ∩ Kn) = 1 − O(2 )

Note δ ∈ ω(1) implies µ∞(Aδ) = 1.

I Aδ is an example of what we shall call a typical set.

ECE750-TXB Balance of 0’s and 1’s I Lecture 14: Typical Inputs

Todd L. I Example: Balance of 0’s and 1’s in a string. Veldhuizen [email protected] I Choose a binary string of length n uniformly at random, Bibliography and define the random variables Y1,..., Yn by: ( +1 if the i thbit is 1 Yi = −1 if the i thbit is 0

Then E[Yi ] = 0, and

2 Var[Yi ] = E[(Yi − E[Yi ]) ] = 1 Pn I Let Y = i=1 Yi . I Y can be interpreted as a “random walk” on Z, where each bit of the string indicates whether to move up or down.

I |Y | is the discrepancy between the number of zeros and ones. ECE750-TXB Balance of 0’s and 1’s II Lecture 14: Typical Inputs

Todd L. The expectation and variance of Y are: Veldhuizen I [email protected]

E[Y ] = 0 Bibliography n X Var[Y ] = Var[Yi ] = n i=1

I To bound the discrepancy |Y | we can use: Theorem (Chernoff inequality)

Let Y1,..., Yn be discrete, independent random variables Pn with E[Yi ] = 0 and |Yi | ≤ 1 for all i. Let Y = i=1 Xi , and σ2 = Var[Y ] be the variance of Y . Then

2 Pr(|Y | ≥ λσ) ≤ 2e−λ /4

ECE750-TXB Balance of 0’s and 1’s III Lecture 14: 2 Typical Inputs I Applying the Chernoff inequality with σ = n, we obtain Todd L. √ −λ2/4 Veldhuizen Pr(|Y | ≥ λ n) ≤ 2e [email protected]

Bibliography I Let’s work the right-hand side of the inequality into the form 2−δ. Setting 2−δ = 2e−λ2/4 and solving we obtain

λ = 2pln 2(1 + δ)

I Substituting, Pr(|Y | ≥ 2pn(δ + 1) ln 2) ≤ 2−δ

I Let Bδ be the set of binary strings satisfying this bound: n p o Bδ = w ∈ K : discrepancy < 2 |w|(δ + 1) ln 2

I As in the previous example,

−δ µn(Bδ ∩ Kn) = 1 − O(2 ) ECE750-TXB First nonzero bit I Lecture 14: Typical Inputs

Todd L. Example: First nonzero bit in a string. Veldhuizen I [email protected] I As before consider binary strings of length n under a uniform distribution. Bibliography

I Let Y be an R.V. indicating the position of the first nonzero bit: for example,

Y (000010110111) = 5

1 I Y has a geometric distribution with probability p = 2 : 1 [Y ] = = 2 E 1 − p δ k X 1 Pr(Y ≤ δ) = = 1 − 2−δ 2 k=1

ECE750-TXB First nonzero bit II Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

Bibliography

I Let Cδ be strings whose first nonzero bit is at position ≤ δ; then

−δ µn(Cδ ∩ Kn) = 1 − 2

I Almost surely, a binary string of length n has a 1 in a position ≤ f (n) for any f ∈ ω(1). ECE750-TXB Erd¨os-Kac theorem I Lecture 14: Typical Inputs

Todd L. I Example: Number of prime divisors. Veldhuizen [email protected] I Let w be a binary string of length n chosen uniformly at random. Bibliography

I We can interpret w as a number (written in base 2): for example, given the string w = 010011, we can take

0100112 = 19

I Let W be a random variable counting the number of prime divisors of w.

I The Erd¨os-Kac theorem [1] states that the distribution of W converges to a normal distribution:

E[W ] = ln n + ln ln 2 + o(1) ! Z b 1 W − E[W ] 1 − t2 Pr a ≤ p ≤ b = √ e 2 dt + o(1) E[W ] 2π a

ECE750-TXB Erd¨os-Kac theorem II Lecture 14: Typical Inputs

I Choosing a = −b and integrating the normal Todd L. Veldhuizen distribution, [email protected]

!   Bibliography W − E[W ] b Pr −b ≤ p ≤ b = 1 − erfc √ E[W ] 2

I We employ the following inequality, found on the internet (MathWorld) so it must be true:

2 e−α2 erfc(α) < √ √ π α + α2 + 2

I This yields

! b2 W − [W ] 2 e− 2 Pr −b ≤ E ≤ b > 1 − √ p π q 2 E[W ] √b + b + 2 2 2 ECE750-TXB Erd¨os-Kac theorem III Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected] I If b = b(n) = ω(1) then Bibliography !  b2  W − E[W ] − Pr −b ≤ p ≤ b > 1 − O e 2 E[W ]

where we have deliberately made the asymptotic bound less sharp to make the next step easier: setting

   b2  O 2−δ = O e− 2

√ we obtain b = 2δ ln 2.

ECE750-TXB Erd¨os-Kac theorem IV Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected] Therefore the number of prime divisors W satisfies I Bibliography

 p   −δ Pr |W − E[W ]| ≤ 2δ ln 2 · E[W ] > 1 − O 2 (1)

where

E[W ] = ln n + ln ln 2 + o(1)

I Let Dδ be the set of strings w ∈ K satisfying Eqn. (1), where n = |w|. ECE750-TXB Typical sets I Lecture 14: Typical Inputs The following definition of typical sets is loosely inspired by a similar Todd L. Veldhuizen idea in information theory, but using a parameter δ resembling the [email protected] “randomness deficiency” of Kolmogorov complexity [3, 2]. Bibliography Definition Let Aδ be a family of sets indexed by functions δ : N → R. We say Aδ is typical if

−δ(n) µn(Aδ ∩ Kn) = 1 − O(2 )

I We will call Aδ a typical set, even though strictly speaking it is a family of sets indexed by δ.

I The following properties are straightforward: 1. If δ ∈ ω(1) then µ∞(Aδ) = 1. 2. If Aδ ⊆ Bδ, and Aδ is a typical set, then so is Bδ. S 3. The set of all possible inputs Kδ = K = Kn is n∈N typical.

ECE750-TXB Typical sets II Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

I A typical set represents an almost sure property with an Bibliography exponential concentration inequality:

I Every input is in Aδ almost surely when δ ∈ ω(1); −δ I The probability of not being in Aδ falls off as O(2 ).

I The intersection Aδ ∩ Bδ ∩ Cδ ∩ ... of any finite number of typical sets is also typical. We prove this for the intersection of two sets; any finite number follows by induction. Proposition

If Aδ and Bδ are typical, so is Cδ = Aδ ∩ Bδ. ECE750-TXB Typical sets III Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

I We will use the following elementary probability identity. Bibliography

Pr(α ∧ β) = Pr(¬¬(α ∧ β)) = 1 − Pr(¬(α ∧ β)) = 1 − Pr(¬α ∨ ¬β) = 1 − [Pr(¬α) + Pr(¬β) − Pr((¬α) ∧ (¬β))] = 1 − (1 − Pr(α)) − (1 − Pr(β)) + Pr((¬α) ∧ (¬β)) = Pr(α) + Pr(β) − 1 + Pr(¬α ∧ ¬β)

Proof.

ECE750-TXB Typical sets IV Lecture 14: Typical Inputs

Todd L. Let Aδ,n = Aδ ∩ Kn, and similarly for Bδ,n. Write Aδ,n for the Veldhuizen [email protected] complement Kn \ Aδ,n. We start from the following identity:

Bibliography µ(Aδ,n ∩ Bδ,n) = µ(Aδ,n) + µ(Bδ,n) − 1 + µ(Aδ,n ∩ Bδ,n)

Note that

−δ −δ µ(Aδ,n) = 1 − µ(Aδ,n) = 1 − (1 − O(2 )) = O(2 )

and similarly for µ(Bδ,n). Since µ(Aδ,n ∩ Bδ,n) ≤ max(µ(Aδ,n), µ(Bδ,n)),

−δ µ(Aδ,n ∩ Bδ,n) = O(2 )

Therefore

µ(Aδ,n ∩ Bδ,n) = µ(Aδ,n) + µ(Bδ,n) −1 + µ(Aδ,n ∩ Bδ,n) | {z } | {z } | {z } =1−O(2−δ ) =1−O(2−δ ) O(2−δ ) = 1 − O(2−δ) ECE750-TXB Typical sets V Lecture 14: Typical Inputs

Todd L. Veldhuizen I Binary strings chosen uniformly at random have all of [email protected] the following properties, almost surely, for any δ ∈ ω(1): Bibliography 1. A run of 1’s no longer than log n + δ; 2. The discrepancy between the number of 0’s and 1’s is less than p4n(δ + 1) ln 2; 3. The first nonzero bit appears at a position ≤ δ; 4. When viewed as a base-2 integer, has ln n + ln ln 2 prime divisors ±δ1/2p2 ln 2(ln n + ln ln 2). −δ −3 I For example, choosing δ = 10 (2 ≈ 10 ), a 1024-bit string has, with fairly high probability: 1. A run of ≤ 20 bits; 2. A discrepancy of ≤ 176 bits; 3. A 1 in the first 10 positions; 4. About 6.5 ± 9.5 prime divisors.

ECE750-TXB Typical sets VI Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

Bibliography

I (Note that the constant factors associated with the concentration inequality 1 − O(2−δ) may change when we take intersections of typical sets. For these examples I am just using δ = 10 and hiding the constant factor inside the waffly “fairly high probability.”) ECE750-TXB Typical sets as a filter Lecture 14: Typical Inputs

I We have established the following properties of typical Todd L. sets: Veldhuizen [email protected] 1. If Aδ is typical, and Aδ ⊆ Bδ, then Bδ is typical. 2. If Aδ and Bδ are typical, then Aδ ∩ Bδ is typical. Bibliography S 3. The set of all inputs Kδ = K = Kn is typical. n∈N 4. The empty set ∅ is not typical. I The typical sets form a mathematical structure called a filter. K I A filter on a set K is a collection F ⊆ 2 of subsets of K satisfying these properties: 1. If A ∈ F and A ⊆ B, then B ∈ F . 2. If A, B ∈ F then (A ∩ B) ∈ F ; 3. K ∈ F ; 4. ∅ 6∈ F . I Filters are a bit abstract, but powerful. One useful application is an ultraproduct, which can be used to 1 construct a single (infinite) structure that embodies the Σ1 1 properties of typical inputs. (Σ1 properties are definable by second-order sentences of the form

∃R1,..., Rk . ψ(R1,..., Rk ) — which includes first-order 1 sentences. For example, χ-colourability of graphs is a Σ1 property.)

ECE750-TXB Typical sets and average-case time I Lecture 14: Typical Inputs

Todd L. I Say an algorithm runs in time O(f (n)) on a typical set Veldhuizen [email protected] Aδ if for any δ ∈ O(1), the algorithm has worst-case performance O(f (n)) on Aδ. Bibliography

I Question: does running in time O(f (n)) on a typical set imply average-case time O(f (n))?

I Answer: not necessarily — it’s easy to construct counterexamples. Consider the following algorithm on strings:

function Broken(w) if w = 111 ··· 11 then wait for 22|w| seconds return

I It returns right away, unless the string is all 1’s, in n which case it takes O(22 ) time (where n = |w|). ECE750-TXB Typical sets and average-case time II Lecture 14: Typical Inputs

Todd L. I So, it runs in O(1) time on a typical set. (Using, for Veldhuizen [email protected] example, the set Aδ of strings with runs of length

< log |w| + δ.) Bibliography −n −n 2n 2n−n I Average time is (1 − 2 ) · c + 2 · O(2 ) = O(2 ). Doubly-exponential!

I Suppose the worst-case running time of the algorithm can be expressed in the form O(g(n, δ)): note that O(g(n, O(1))) gives worst-case time on a typical set.

I The average-case time is then:

log |Kn| X T (n) = O(2−δ)O(g(n, δ)) δ=0

log |Kn| X = O(2−δg(n, δ)) δ=0

ECE750-TXB Typical sets and average-case time III Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

Bibliography

P∞ −k c Note that anything of the form k=0 2 k where c ∈ O(1) converges to a constant — an exponential concentration swallows any polynomial. g(n,δ) I If g(n,O(1)) is at most polynomial in δ, then worst-case time on the typical set equals average-case time. ECE750-TXB Example: No-carry adder I Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected]

I The no-carry adder is a simple algorithm for adding Bibliography binary numbers.

I Let x0, y0 be n-bit integers. The no-carry adder repeats the following iteration:

xi+1 = xi ⊕ yi

yi+1 = (xi &yi ) LSH 1

where ⊕ is bitwise XOR, & is bitwise AND, and LSH 1 shifts left by one bit. At each iteration xi holds a partial sum, and yi holds carry bits. The iteration continues until yi = 0.

ECE750-TXB Example: No-carry adder II Lecture 14: Typical Inputs

I Example: to calculate the sum of Todd L. Veldhuizen [email protected] x0 = 011011102 Bibliography y0 = 000000102

the following steps occur:

x1 = 011011002 y1 = 000001002 x2 = 011010002 y2 = 000010002 x3 = 011000002 y3 = 000100002 x4 = 011100002 y4 = 000000002

I How many iterations are required? ECE750-TXB Example: No-carry adder III Lecture 14: Typical Inputs The number of iterations is determined by the length of I Todd L. the longest “carry sequence,” i.e., the longest span Veldhuizen [email protected] across which a carry must be propagated. Bibliography I For there to be a carry sequence of length t, there must be a bit position where x0 and y0 are both 1, followed by t − 1 positions where x0 and y0 have opposite bits:

x0 = 01101110

y0 = 00000010

I The probability of a carry sequence of length t is easily bounded by employing the union bound: let Zi be the event that x0, y0 match in bit positions i through i + t − 2. Then [ X Pr( Zi ) ≤ Pr(Zi ) = (n − t + 1)2−t+1

ECE750-TXB Example: No-carry adder IV Lecture 14: This is very close to the equation for a run of 1’s; using Typical Inputs Todd L. t = log n + δ, we obtain Veldhuizen [email protected] [ −δ Pr( Zi ) ≤ 1 − O(2 ) Bibliography

I So, the number of iterations is O(g(n, δ)) where g(n, δ) = log n + δ

I To calculate the average case: n X T (n) = O(2−δ)O(log n + δ) δ=0 n X = O(2−δ log n + δ2−δ) δ=0 = O(log n)

P −δ P −δ since δ log n2 = log n · δ 2 = log n, and P −δ δ δ2 = O(1). ECE750-TXB Bibliography I Lecture 14: Typical Inputs

Todd L. Veldhuizen [email protected] [1] P. Erd¨osand M. Kac. Bibliography The Gaussian law of errors in the theory of additive number theoretic functions. Amer. J. Math., 62:738–742, 1940. bib pdf

[2] M. Li and P. Vit´anyi. An introduction to Kolmogorov complexity and its applications. Springer-Verlag, New York, 2nd edition, 1997. bib [3] V. G. Vovk. The Kolmogorov-Stout law of the iterated logarithm. Mat. Zametki, 44(1):27–37, 154, 1988. bib ECE750-TXB Lecture 16: Randomized Algorithms

Todd L. ECE750-TXB Lecture 16: Randomized Veldhuizen Algorithms [email protected] Bibliography

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

March 6, 2007

ECE750-TXB Stochastic algorithms and data structures I Lecture 16: Randomized Algorithms I A stochastic algorithm or data structure is one with Todd L. Veldhuizen access to a stream of random bits (a.k.a. coin flips). [email protected] These random bits can be used to make or influence I Bibliography decisions about how to proceed. The intended effect might be to:

I Avoid worst cases

I Achieve an average-case performance even for arbitrary inputs;

I Use the random bits to guess answers, if good answers are plentiful.

I In understanding stochastic algorithms/data structures there are two distributions to keep in mind: 1. The distribution of inputs; 2. The distribution of the random bits being used.

I Some possibly familiar examples of stochastic algorithms include simulated annealing, genetic algorithms, Kernighan-Lin graph partitioning, etc. ECE750-TXB Stochastic algorithms and data structures II Lecture 16: Randomized Algorithms

Todd L. Veldhuizen I A randomized algorithm (or data structure) is one that [email protected] offers good performance for any input, with high probability. i.e., there are no classes of inputs/operation Bibliography sequences for which the performance is asymptotically poor.

I A classic application of randomization is to Hoare’s QuickSort. To sort an array A[1 ... n]: 1. If n = 1 then done. 2. Otherwise, choose a pivot element A[i] by examining some finite number of elements of the array. 3. Partition the array into three parts: items > A[i], items < A[i], and items = A[i]. 4. Recursively sort the first two partitions, and merge the resulting arrays.

ECE750-TXB Stochastic algorithms and data structures III Lecture 16: Randomized Algorithms

I If the pivot element is chosen deterministically, then we Todd L. 2 Veldhuizen can force the algorithm to take Θ(n ) time by designing [email protected] the input array carefully. For example, a common Bibliography heuristic is “median of three”: choose the pivot to be the median of A[1], A[bn/2c], A[n]. By placing the maximum elements of the array in these positions, the array is partitioned into subarrays of size n − 3, 3, and 0. Repeating this design recursively yields an array for which QuickSort requires Θ(n2) time.

I In Randomized Quicksort, one choses the pivot element uniformly at random from 1 ... n. Then, it is impossible to design a worst case array input without knowing in advance the random bits being used to choose the pivot.

I Performance for randomized algorithms is usually measured as “worst average case”: ECE750-TXB Stochastic algorithms and data structures IV Lecture 16: Randomized Algorithms I The time required for an input w, which we write T (w), Todd L. is no longer a deterministic function, but a random Veldhuizen variable of the coin flip sequence used by the algorithm. [email protected] Random bits (coin flips) s = 011001101100... Bibliography

input w Randomized Algorithm

I We measure the time required by the algorithm as

maxw∈Kn Es [T (w)]

I The maximum over all inputs w ∈ Kn of length n I of the expectation with respect to the random bit sequence s of the running time.

I The input distribution is ignored: one is concerned with the worst-case (with respect to inputs) of the average time (with respect to the random bits).

ECE750-TXB Randomized Equality Protocol I Lecture 16: Randomized Algorithms

Todd L. reliable Machine A communication Machine B Veldhuizen File of n bits File of n bits [email protected]

Bibliography I Consider the problem of maintaining a mirror of a large database across a reliable network connection. Both machine A and B have a copy of the database, and we wish to determine whether the files are the same.

I Any algorithm achieving zero error for arbitrary files must transmit ≥ n bits: one can do no better than just transmitting the entire file from machine A to B.

I Why? Each bit transmitted can be thought of as the outcome of some test performed on a file. If t tests are performed, and t < n, then there are 2t test outcomes and 2n > 2t possible files; by pigeonhole there must be two different files with the same test outcomes. ECE750-TXB Randomized Equality Protocol II Lecture 16: Randomized Algorithms I Note that if we transmit, e.g., an md5 checksum, there Todd L. exist pairs of files that are different but have the same Veldhuizen checksums, called hash collisions. (In fact there are [email protected]

growing databases one can access on the internet to Bibliography attempt to produce md5 hash collisions.)

I There is a simple randomized algorithm that: 1. Transmits O(log n) bits; 2. Achieves an astronomically low error probability, and this probability can be made as low as desired; 3. It is impossible to produce “hash collisions” that reliably cause the algorithm to wrongly report files are equal when they are not.

I Randomized Equality Protocol: I Alice has a file x = x0x1 ··· xn−1, and Bob has a file y = y0y1 ··· yn−1.(xi , yi are bits; we interpret x and y as large integers.) 2 I Alice chooses a prime p uniformly at random in [2, n ]. (This prime can be represented in ≤ 2 log n bits.)

ECE750-TXB Randomized Equality Protocol III Lecture 16: Randomized Algorithms

I Alice computes Todd L. Veldhuizen [email protected] s = x mod p Bibliography and transmits s and p to Bob. (This requires ≤ 2d2 log ne bits, plus change.)

I Bob computes

q = y mod p

If q = s, Bob outputs “x = y.” If q 6= s, Bob outputs “x 6= y.” 16 I Note that for a file of 10 bytes (≈ 900 Tb), the amount of data transmitted is ≈ 256 bytes.

I To analyze the error, we take the usual “worst-case average” approach: for the worst possible choice of files, what is the average probability of error? ECE750-TXB Randomized Equality Protocol IV Lecture 16: Randomized Algorithms

I Say a prime p is ‘bad’ for (x, y) if x mod p = y Todd L. Veldhuizen mod p, but x 6= y. Otherwise, say p is ‘good’ for (x, y). [email protected] Our general approach is to prove that the ‘good’ primes Bibliography vastly outnumber the bad ones, and so our chance of picking a ‘good’ prime is high.

I The probability that an error occurs is

#bad primes in [2, n2] #primes in [2, n2]

2 I The number of primes in [2, n ] is

n2 π(n2) ∼ Li(n2) ∼ ln n2 (Prime number theorem; Li is the logarithmic integral.)

ECE750-TXB Randomized Equality Protocol V Lecture 16: Randomized I An error occurs when x 6= y but x, y are the same Algorithms modulo p, i.e., we can write Todd L. Veldhuizen [email protected] x = x0 · p + s Bibliography y = y 0 · p + s

for some integers x0, y 0. Then, p divides (x − y), since x − y = x0 · p + s − (y 0 · p + s) = (x0 − y 0) · p. Let r = |x − y|. Since r ≤ 2n, r has ≤ n − 1 prime divisors. The probability  of p being a prime divisor of w is therefore n − 1 n 2 ln n  ≤ ∼ = π(n2) n2 n ln n2 Therefore the probability of error is 2 ln n  ≤ (1 − o(1)) n ECE750-TXB Randomized Equality Protocol VI Lecture 16: Randomized For example, if n = 1016, the error probability is Algorithms −14 Todd L. ≈ 10 . Veldhuizen [email protected] I This is a specific example of a general pattern: “abundance of witnesses.” The principle is that if x 6= y Bibliography and p does not divide (x − y), then p is a “witness” to the fact “x 6= y.” There are lots of witnesses, so if we choose a potential witness (a prime) at random, we’re likely to find one.

I To get an even lower error, we can repeat the protocol k times: Alice chooses k primes uniformly at random 2 from [1, n ] and transmits x mod pi for each prime. With k independent trials, and failure probability  in each, the probability of k failures is ≤ k . For example, with n = 1016 and k = 10, by sending ≈ 2 kb of data, we can obtain a probability of error ≈ 10−141. This is an example of success amplification.

ECE750-TXB Classification of randomized algorithms I Lecture 16: Randomized Algorithms

Todd L. Veldhuizen [email protected] Stochastic algorithms: use random bits in some way I Bibliography 1. Las Vegas algorithms: no error; use coin flips to avoid worst cases; get good worst-case expected time. 2. Monte Carlo algorithms: allow some probability of error.

2.1 One-sided Monte Carlo (1MC): a NO answer is always correct, a YES answer has probability of error (i.e., false positives are possible). 2.2 Bounded error Monte Carlo (2MC): computes a 1 function f (w) with probability ≥ 2 + δ of being correct, δ > 0. 2.3 Unbounded error Monte Carlo (UMC). ECE750-TXB One-sided Monte Carlo I Lecture 16: Randomized Algorithms

Todd L. I Recall that a decision problem is described by some set Veldhuizen L; we are asked to decide “Is w ∈ L?” [email protected]

I An algorithm A is a One-sided Monte Carlo when: Bibliography 1 I If x ∈ L then Prob(A(x) = 1) ≥ 2 . I If x 6= L then Prob(A(x) = 0) = 1.

I The randomized-equality protocol we saw was a one-sided Monte Carlo algorithm:

I It had zero probability of error if the files were equal, and some probability of error when the files were unequal.

I To match the definition of one-sided MC, we could take the set being decided to be pairs of files of n bits that differ. (A NO answer to the decision problem = the files are equal.) 1 I Since the probability of error is < 2 , can get an error δ with t ≤ − log δ repetitions.

ECE750-TXB Bounded-Error Monte Carlo I Lecture 16: Randomized Algorithms

Todd L. Also known as Two-Sided Monte Carlo. Veldhuizen I [email protected] ∗ ∗ I Computes a function f :Σ → Σ , e.g., a function of binary strings. Bibliography 1 I Probability of being correct is ≥ 2 + , for some  > 0 constant.

I Since  > 0, to obtain an error probability < δ we need only a constant number of iterations, independent of n.

I If  ∈ o(1) then might need exponentially many repetitions (in n) to achieve an error probability < δ.

I Success amplification:

I Run the algorithm t times.

I If an output appears at least dt/2e times, output it (i.e., majority vote).

I Otherwise, output “?” (algorithm fails.) ECE750-TXB Bounded-Error Monte Carlo II Lecture 16: Randomized Algorithms

Todd L. Veldhuizen [email protected]

Bibliography I A tedious analysis shows that to achieve an error probability < δ, it suffices to choose

2 ln δ t ≥ ln(1 − 42)

I Note this formula does not depend on the length of the input.

ECE750-TXB Unbounded Error Monte Carlo Lecture 16: Randomized Algorithms

Todd L. Veldhuizen I Have probability 1/2 + (n) of being correct, i.e., better [email protected]

than chance (but possibly not much better!) Bibliography I Using the same formula as before, to obtain an error δ, the number of repetitions required is 2 ln δ t ≥ ln(1 − 42(n))

If  ∈ o(1), then 1 − 42(n) → 1, and ln(1 − 42(n)) → 0. So, t ∈ ω(1). −2 I Need t ∈ Ω( ) to keep error bounded.

I Could be that exponentially many repetitions of the algorithm are required. ECE750-TXB Bibliography I Lecture 16: Randomized Algorithms

Todd L. Veldhuizen [email protected] [1] Rajiv Gupta, Scott A. Smolka, and Shaji Bhaskar. On randomization in sequential and distributed Bibliography algorithms. ACM Comput. Surv., 26(1):7–86, 1994. bib pdf

[2] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, Cambridge, 1997 edition, 1995. bib [3] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Comput. Surv., 28(1):33–37, 1996. bib ECE750-TXB Lecture 17: Algorithms for binary relations and graphs

ECE750-TXB Lecture 17: Algorithms for Todd L. Veldhuizen binary relations and graphs [email protected]

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

March 8, 2007

ECE750-TXB Binary Relations I Lecture 17: Algorithms for binary relations and graphs

2 Todd L. I Recall that a binary relation on a set X is a set R ⊆ X . Veldhuizen [email protected] I We may interpret a binary relation as a directed graph G = (X , R).

I Some common axioms relations may satisfy: 1. Transitive (T ):

∀x, y, z . (R(x, y) ∧ R(y, z) → R(x, z))

/ #/ ;/ If there is a path from x to z, there is an edge from x to z.     ECE750-TXB Binary Relations II Lecture 17: Algorithms for 2. Reflexive: ∀x . R(x, x) binary relations and graphs

Todd L. Veldhuizen < / b [email protected] Every vertex has an edge to itself. 3. Symmetric (S)   ∀x, y . R(x, y) → R(y, x)

 Z If there is an edge from x to y, there is an edge from y to x.   Usually one draws the graph without arrows:

and it is called simply a “graph” rather than a directed graph.  

ECE750-TXB Binary Relations III Lecture 17: Algorithms for 4. Antisymmetric (A) binary relations and graphs

Todd L. ∀x, y . R(x, y) ∧ R(y, x) → (x = y) Veldhuizen [email protected] When the relation is reflexive, transitive and also antisymmetric, it is a partial order.

I A rough classification of binary relations:

Binary Relation/Directed Graph

Graph (S) Preorder/Quasiorder (T,R)

Equivalence (T,R,S) Partial order/Poset (T,R,A)

Tree order

Total order ECE750-TXB Binary Relations IV Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected]

I Good algorithms for managing the common classes of binary relations are known. If you can identify the abstract relation(s) underlying a problem, this may lead you directly to efficient algorithms.

ECE750-TXB Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected]

Part I

Equivalence Relations ECE750-TXB Equivalence relations and partitions I Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen I An equivalence relation ∼ is a binary relation that is [email protected] reflexive, transitive, and symmetric. (The most familiar example: equality, “=”).

I Pictured as a graph, an equivalence relation is a collection of cliques:

a M b e f MMM qq >> Ð MM qqq >> Ð MqMq >> ÐÐ qq MMM > Ð qqq MM > ÐÐ c q d g

2 I For an equivalence ∼ ⊆ X , we write

I [a]∼ = {b ∈ X : a ∼ b} for the equivalence class of a;

ECE750-TXB Equivalence relations and partitions II Lecture 17: Algorithms for binary relations and graphs

I X / ∼ for the set of equivalence classes induced by ∼: Todd L. Veldhuizen [email protected] X / ∼ = {[a]∼ : a ∈ X }

X / ∼ is a partition. (Recall that a partition of a set X is a collection of subsets Y1,..., Yk of X that are S pairwise disjoint and satisfy Yi = X .)

I Example: In the above figure, the equivalence classes are {{a, b, c, d}, {e, f , g}}.

I Example: take N with a ∼ b ≡ (a mod 5 = b mod 5). The equivalence classes N/ ∼ are {{0, 5, 10,...}, {1, 6, 11,...},..., {4, 9, 14,...}}.

I Common algorithmic problems we encounter with equivalence classes:

I Answering queries of the form “Is a ∼ b?” ECE750-TXB Equivalence relations and partitions III Lecture 17: Algorithms for binary relations and graphs

I Maintaining an equivalence relation as we progressively Todd L. decide objects are equivalent. (This results from an Veldhuizen [email protected] inductively defined equivalence relation.) Example: the Nelson-Oppen method for equational reasoning [7].

I Maintaining an equivalence relation as we progressively decide objects are not equivalent. (This results from a co-inductive definition of equivalence [6].) Example: minimizing states of a DFA [4], maintaining bisimulations, congruence closure [3].

I A system of representatives is the primary means for efficient manipulation of equivalence relations.

I A system of representatives for ∼ is a function s :(X / ∼) → X choosing a single element from each block of the partition, such that

a ∼ b if and only if s(a) = s(b)

ECE750-TXB Equivalence relations and partitions IV Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen I Example: to reason about equivalence of integers [email protected] modulo 5, we could choose the representatives 0, 1, 2, 3, and 4. The integer 1 represents the equivalence class [1]∼ = {1, 6, 11, 16,...}. I With a means to quickly compute representatives, we can test whether a ∼ b by computing the representatives of the equivalence classes [a]∼ and [b]∼, then using equality.

I If the equivalence relation is static, one can precompute a system of representatives as e.g., a table. If the equivalence relation is discovered dynamically, more sophisticated methods are needed. ECE750-TXB Disjoint Set Union I Lecture 17: Algorithms for binary relations and graphs

I Disjoint Set Union is algorithms-speak for maintaining Todd L. an inductively-defined equivalence relation: Veldhuizen [email protected] I Initially we have a set of objects, none of which are known to be equivalent.

I We gradually discover that objects are equivalent, and we wish to maintain a representation of the equivalence relation that lets us quickly answer queries of the form “Is a ∼ b?”

I Interface:

I union(a, b): include a ∼ b in the equivalence relation

I find(a): returns an equivalence class representative (ECR) for a.

I There is wonderfully elegant data structure due to Tarjan [8] that performs these operations in O(nα(n)) time, where α(n) ≤ 3 for n less than (cosmologists’ best estimate of) the number of particles in the universe.

ECE750-TXB Disjoint Set Union II Lecture 17: Algorithms for binary relations I Tarjan’s data structure maintains the equivalence and graphs relation on the set X as a forest — a collection of trees. Todd L. Veldhuizen Each node in a tree is an element of the set X , each [email protected] tree is an equivalence class, and each root is an equivalence class representative.

b e Ñ@ O ^>> O ÑÑ >> ÑÑ > a c d f O g

A forest representation of the equivalence classes {{a, b, c, d}, {e, f , g}}.

I Each element has a pointer to its parent; to determine the equivalence class representative, we follow the parent pointers to the root of the tree. ECE750-TXB Disjoint Set Union III Lecture 17: Algorithms for binary relations I The efficiency of the representation depends on how and graphs deep the trees are. To keep the trees shallow, two Todd L. Veldhuizen techniques are employed: (i) path compression; and (ii) [email protected] ‘union by rank.’

I Record representation: for each element x ∈ X , we track

I parent(x): a pointer to the parent of x, or a pointer to itself if it is the root (alternately, a null pointer can be used.)

I rank(x): indicates how deep trees are (but, not depth per se).

I Pseudocode for find(a): find (a) if parent(a) 6= a then parent(a) ← find(parent(a)) return parent(a)

ECE750-TXB Disjoint Set Union IV Lecture 17: Algorithms for binary relations and graphs This recursively follows the parent pointers up to the Todd L. root, then rewrites all the parent pointers so they point Veldhuizen [email protected] directly at the root, called “path compression”:

f f Ñ@ ^== Ñ@ O ^== ÑÑ == ÑÑ == ÑÑ = ÑÑ = d e d c e O c

Left: tree. Right: after calling find(c). I A simple way to implement union(a,b): just make the root of a’s tree have b as a parent. union(a,b) parent( find (a)) ← b ECE750-TXB Disjoint Set Union V Lecture 17: Algorithms for binary relations and graphs

However, this can lead to poorly balanced trees. For better Todd L. asymptotic efficiency, one can track how deep the trees are and Veldhuizen [email protected] always make the deeper tree the parent of the shallower tree: called “union by rank.” union(a,b) pa ← find(a) pb ← find(b) if pa=pb then return

if rank(pa) > rank(pb) then parent(pb) ← pa else parent(pa) ← pb if (rank(pa) = rank(pb)) rank(pb) ← rank(pa) + 1

ECE750-TXB Disjoint Set Union VI Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen I Tarjan proved that using both path compression and union by [email protected] rank, a sequence of n calls to union and find requires O(nα(n)) time, where α(n) ≤ 3 for

.2 9 . > 2. => n ≤ 22 65536 > ;>

i.e., a tower of 65536 powers-of-two. The function α(n) is the ‘inverse’ of the Ackermann function; see CLR [2] or [8] for details.

I For any practical purpose, the time required by Tarjan’s algorithm is indistinguishable from O(n) for a sequence of n operations; or O(1) per operation amortized time (to come.) ECE750-TXB Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected]

Bibliography Part II

Graphs

ECE750-TXB Representation of Graphs I Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected] I Here are four common methods of representing graphs. Bibliography I If the graph is large (e.g., infinite), the structure is not known beforehand, etc., we may choose an implicit representation for the graph, where vertices and edges are computed on-the-fly as needed. For example, the graph G = (N, E) where (x, y) ∈ E if and only if y divides x, is an infinite graph where the edges can be computed on the fly by factorization.

I An explicit representation is one where we directly encode the structure of the graph in a data structure. Some common methods for this: ECE750-TXB Representation of Graphs II Lecture 17: Algorithms for binary relations and graphs I Adjacency matrix: an n × n matrix A of 0’s and 1’s, Todd L. Veldhuizen with Aij = 1 if and only if vi , vj ∈ E. Row i indicates [email protected] the out edges for vertex i, and column i indicates the in Bibliography edges.

 0 1 1 0   0 0 0 1  A =    0 0 0 1  0 0 0 0

b / d O O

a / c

ECE750-TXB Representation of Graphs III Lecture 17: Algorithms for binary relations and graphs Adjacency lists: each vertex maintains a set of vertices Todd L. I Veldhuizen to/from which there is an edge e.g. [email protected]

Bibliography out(a) = {b, c} out(b) = {d} out(c) = {d} out(d) = ∅

I If the graph structure is static (i.e., not changing as the algorithm runs), it is common to represent lists of in- and out- edges as vectors, for efficiency.

I For more elaborate algorithms on e.g. weighted graphs, a representation of this sort is commonly used: ECE750-TXB Representation of Graphs IV Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected]

public class Edge { Bibliography Vertex x, y; double weight; }

public class Vertex { Set out; Set in; }

ECE750-TXB Depth-First Search I Lecture 17: Algorithms for binary relations and graphs

Todd L. I One of the commonest operations on a graph is to visit Veldhuizen the vertices of the graph one by one in some desired [email protected]

order. This is commonly called a search. Bibliography

I In a depth-first search, we explore along a single path into the graph as far as we can until no new vertices can be reached; then we return to some earlier point where new vertices are still reachable and continue. (Think of exploring a maze.)

I Example of a depth-first search (yellow) starting at the center vertex of this graph: ECE750-TXB Depth-First Search II Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen I As we visit each new vertex, we perform some action [email protected] there. The choice of action depends on what we hope to accomplish; for now we will just call it “visiting the Bibliography vertex,” but later we will see examples of specific useful actions. We might choose to visit the vertex the first time we see it (preorder), or the last time we see it (postorder)

I Here is a recursive implementation of depth-first search. It uses a set Seen to track which vertices have been visited. One can also include a flag field as part of the vertex data structure that can be “marked” to indicate the vertex has been seen.

ECE750-TXB Depth-First Search III Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen dfs(x) [email protected] dfs(x, ∅) Bibliography dfs(x, Seen) if x 6∈ Seen Seen ← Seen ∪ {x} preorderVisit (x) // Do something For each edge (x, y), dfs(y,Seen) postorderVisit (x) // Do something

I This search is easily implemented in a nonrecursive version, using a stack data structure to keep track of the current path into the graph: ECE750-TXB Depth-First Search IV Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected] dfs(x) Seen = ∅ Bibliography Stack S push(S,x) while S is not empty, y ← pop(S) if y 6∈ Seen then Seen ← Seen ∪ { y } preorderVisit (y) for each edge (y, z), push(S,z)

ECE750-TXB Topological Sort I Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen I A Directed Acyclic Graph (DAG) is a graph in which [email protected] there are no cycles (i.e., paths from a vertex to itself.) Bibliography I The reflexive, transitive closure of a DAG is a partial order. (If you add to a DAG an edge (x, y) whenever there is a path from x to y, plus self-loops (x, x), the resulting edge relation is a partial order: reflexive, transitive, and anti-symmetric.)

I Every finite partial order can be extended to a total order: i.e., if v is a partial order on a finite set, there is a total order ≤ such that (x v y) ⇒ (x ≤ y); or, more obtusely, v ⊆ ≤. (Axiom of choice implies this for infinite sets also.) ECE750-TXB Topological Sort II Lecture 17: 2 Algorithms for I Example: let V = N (pairs of natural numbers), and binary relations for all i, j, put edges (i, j) → (i + 1, j) and and graphs Todd L. (i, j) → (i, j + 1): Veldhuizen [email protected] ...... O O O Bibliography O / O / O / ··· / / / ··· O O O / / / ···    Then the transitive reflexive closure of this graph is a partial order v where (i,j) v (i 0, j0) if and only if i ≤ i 0 and j ≤ j0: (2, 0) (1, 1) (0, 2) III uu III uu II uuu II uuu (1, 0) (0, 1) III uu II uuu (0, 0)

ECE750-TXB Topological Sort III Lecture 17: Algorithms for binary relations and graphs

Todd L. One way to extend v to a total order is: Veldhuizen [email protected]

. Bibliography . O aB BBB O B ?_ BBB ?? aB ! ? ?_ BBB ?? ??  / /    An example of what computer scientists call     “dovetailing.”

I Topological sort is a method for obtaining a total-order extension of a partial order. ECE750-TXB Topological Sort IV Lecture 17: Algorithms for binary relations and graphs I Example: Suppose we want to evaluate a digital circuit: a Todd L. d b Veldhuizen [email protected]

e c Bibliography Build a graph where signals are vertices, and an edge indicates that one signal depends upon another (a ‘dependence graph’):

e

d c

a b

ECE750-TXB Topological Sort V Lecture 17: Algorithms for binary relations The transitive, reflexive closure of this graph yields an and graphs

order w, where e.g., ‘e w d’ means signal e can be Todd L. Veldhuizen evaluated only after signal d. [email protected] Extending w to a total order ≥ gives us a valid order in which to evaluate the signals, e.g., Bibliography

e ≥ d ≥ c ≥ b ≥ a

If we evaluate signals in the order a, b, c, d, e we respect the dependencies.

I Other examples:

I Ordering the presentation of topics in a course or paper.

I Solving equations

I Makefiles

I Planning (keeping track of task dependencies)

I Spreadsheets and dataflow languages [5]

I Ordering static initializers in programming languages

I Dynamization of static algorithms e.g. [1] ECE750-TXB Topological Sort VI Lecture 17: Algorithms for binary relations I Here is an algorithm for topological sort based on and graphs

depth-first search. Note that there are many ways in Todd L. Veldhuizen which a partial order can be extended to a total order; [email protected] this is just one method. Bibliography TopologicalSort (V,E) Set visited; List order;

for x ∈ V dfs(x, visited , order)

dfs(x, visited , order) if x 6∈ visited visited .add(x) for each out edge (x,y) dfs(y, visited , order) order . insertBack(x)

ECE750-TXB Topological Sort VII Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected]

Bibliography I We search the dependence graph depth-first, visiting vertices postorder at which time we insert them at the back of the list.

I Example: for the circuit example, a depth-first search might visit the vertices in the order a, b, d, c, e. ECE750-TXB Connected components of undirected graph I Lecture 17: Algorithms for binary relations and graphs

I Defn: A set of vertices Y ⊆ V is connected if for every Todd L. Veldhuizen a, b ∈ Y there is a path from a to b. Y is a maximal [email protected] connected component if it cannot be enlarged, i.e., for Bibliography any connected set of vertices Y 0 with Y ⊆ Y 0, Y = Y 0.

I Note that the connected components of a graph form a partition of the vertices:

g d Ñ Ð == ÑÑ ÐÐ = a b c e q

The connected components are {{a, b, g, q}, {c, d, e}}.

I Using Tarjan’s disjoint set union, there is a very simple algorithm for connected components:

ECE750-TXB Connected components of undirected graph II Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [email protected]

1. Have a parent pointer and rank associated with each Bibliography vertex (e.g., by creating a separate record for each vertex, or by storing these fields directly in the vertex data structure.) 2. For each edge (a, b), call union(a, b). No searching is necessary! The complexity is O(|E + V |α(|E + V |)), ‘practically’ linear in the number of vertices and edges. ECE750-TXB Bibliography I Lecture 17: Algorithms for binary relations and graphs

Todd L. Veldhuizen [1] Umut A. Acar, Guy E. Blelloch, Robert Harper, Jorge L. [email protected] Vittes, and Shan Leung Maverick Woo. Dynamizing static algorithms, with applications to Bibliography dynamic trees and history independence. In SODA ’04: Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 531–540, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics. bib pdf

[2] Thomas H. Cormen, Charles E. Leiserson, and Ronald R. Rivest. Intoduction to algorithms. McGraw Hill, 1991. bib

ECE750-TXB Bibliography II Lecture 17: Algorithms for binary relations and graphs

[3] Peter J. Downey, Ravi Sethi, and Robert Endre Tarjan. Todd L. Veldhuizen Variations on the common subexpression problem. [email protected] Journal of the ACM (JACM), 27(4):758–771, 1980. bib pdf Bibliography

[4] J. E. Hopcroft. An n log n algorithm for minimizing the states in a finite-automaton. In Z. Kohavi, editor, Theory of Machines and Computations, pages 189–196. Academic Press, 1971. bib [5] Wesley M. Johnston, J. R. Paul Hanna, and Richard J. Millar. Advances in dataflow programming languages. ACM Comput. Surv., 36(1):1–34, 2004. bib pdf ECE750-TXB Bibliography III Lecture 17: Algorithms for binary relations and graphs

Todd L. [6] Y. N. Moschovakis. Veldhuizen [email protected] Elementary Induction on Abstract Structures. North-Holland, Amsterdam, 1974. bib Bibliography [7] Greg Nelson and Derek C. Oppen. Fast decision procedures based on congruence closure. Journal of the ACM (JACM), 27(2):356–364, 1980. bib pdf

[8] R. E. Tarjan. Efficiency of a good but not linear disjoint set union algorithm. Journal of the ACM (JACM), 22:215–225, 1975. bib pdf ECE750-TXB Lecture 18: Graph Algorithms

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

March 13, 2007

Weighted Graphs

I A weighted graph is a triple G = (V , E, w) where w is a weight function: often

w : E → R ∪ {+∞} + w : E → Q

I Often w(x, y) > 0 and represents a “distance” or “score.”

I Example: vertices are cities, and edges represent driving times between adjacent cities. Distance metric on graphs I

I If edge weights are positive, we can define a distance metric (or quasimetric) on the graph. You are familiar with Euclidean 2 distance, for example, in R : q 2 2 d(x, z) = (x1 − z1) + (x2 − z2)

We can define distances in graphs in such a way that they share many of the useful properties of Euclidean distance. 2 I A distance metric d : V → R satisfies: 1. d(x, y) ≥ 0 2. d(x, y) = 0 if and only if x = y 3. d(x, y) = d(y, x) (symmetry) 4. d(x, y) + d(y, z) ≥ d(x, z) (triangle inequality) If the symmetry axiom (3) is omitted, d is called a quasimetric. (For weighted directed graphs, a quasimetric may be appropriate.)

Distance metric on graphs II

2 I A set V together with a distance metric d : V → R is called a metric space.

I A connected graph with nonnegative edge weights can be turned into a metric space: 1. Define the length of a path to be the sum of edge weights along the path; 2. Define d(x, x) = 0 and d(x, y) to be the minimum path length from x to y. 2 I In R we can define open and closed discs: p {(x, y): x2 + y 2 < r} p {(x, y): x2 + y 2 ≤ r} Distance metric on graphs III

I In a metric space (V , d) we can define open and closed balls:

Br (x) = {y ∈ V : d(x, y) < r}

E.g. if we construct a graph of settlements in Ontario where edges indicate roads and weights are driving times, then a ball is e.g., settlements that are within two hours of Waterloo.

Breadth-first search I

I Breadth-first search is another method to visit all the vertices of a graph. Conceptually, we put weights of 1 on each edge. Then from some starting vertex x, we consider balls of radius r around x and take r → ∞; we visit vertices in the order they are added to the ball. Breadth-first search II

I Basic scheme: we maintain a queue of vertices that are just outside the current ball. BFS(V,E,x) Seen = ∅ Queue Q Enqueue(Q,x) While Q is not empty, Get y = next element in queue. if y 6∈ Seen, Seen ← Seen ∪{y} Visit y For each edge y, z ∈ E, Enqueue(Q,z)

This algorithm is linear in the number of edges.

Single-source shortest paths I

I The BFS algorithm is easily modified to solve the following problem: given a connected graph with nonnegative edge weights and a specified vertex x, compute d(x, y) for all y ∈ V . That is, find the length of the shortest path from x to every other vertex in the graph.

I Intuition: again, consider balls centered around x, but use edge weights. We want to visit vertices in order of their distance from x, so we modify the BFS algorithm to use a priority queue. We put pairs (z, d) into the priority queue, where z is a vertex, and d is the length of some path from x to z. The priority queue orders (z, d) pairs by d, using e.g., a min heap, so that at each step we can efficiently retrieve the next closest vertex to x. Single-source shortest paths II

SSSP(V,E,x,w) Seen ← ∅ PriorityQueue PQ. Put (x, 0) in PQ. While PQ is not empty, Get (y,d) from PQ (least element). if y 6∈ Seen then Seen ← Seen ∪ { y } Visit (y,d) For each edge (y,z), put (z,d+w(y,z)) in PQ.

I Time complexity: if we use a min-heap, this achieves O(|E| log |E|) time. It is possible to get this down to O(|V | log |V | + |E|) by the use of somewhat exotic data structures such as Fibonacci heaps.

Transitive Closure I

I Let G = (V , E) be a graph. 0 ∗ ∗ I The transitive closure of G is G = (V , E ) where (x, y) ∈ E if there is a path from x to y in G.

I Define T (E) = {(x, y) : path from x to y in E}. ∗ I Then E = T (E). I T is a closure operator: 1. E ⊆ T (E) (nondecreasing) 2. (E1 ⊆ E2) ⇒ (T (E1) ⊆ T (E2)) (monotone) 3. T (T (E)) = T (E) (idempotent/fixpoint)

I The complexity of transitive closure is closely linked to that of matrix multiplication. Transitive Closure II

I There is a path of length 2 from i to j if there is some vertex k such that E(i, k) ∧ E(k, j). We can write this as: _ E 2(i, j) = E(i, k) ∧ E(k, j) k∈V W W where k∈V is a disjunction over all vertices k ∈ V ( is to ∨ as P is to +, Q is to ×, etc.)

     =  ÒÒ ==  ÒÒ == ÒÒ == ÒÒ  = i j // /  //  /   //  //  //  /

 Transitive Closure III

I Compare to matrix multiplication: if B = AA, then X bij = aik · akj k

I If A is the adjacency matrix of the graph, then to find paths of length 2 we can compute the matrix product A2 in the boolean ring (B, +, ·, 0, 1) where B = {0, 1}, addition is disjunction (α + β) ≡ (α ∨ β), and multiplication is conjunction (α · β) ≡ (α ∧ β).

I To find paths of any length, we can write

A∗ = I + A + A2 + A3 + ··· (1)

where we need only compute terms up to An−1, where n = |V |, the number of vertices. With the leading I term, Eqn. (1) gives a reflexive transitive closure. The difference between transitive closure and reflexive-transitive closure is trivial: the latter has E ∗(x, x) for every x. Transitive Closure IV 5 I The obvious method of evaluating Eqn. (1) requires O(n ) time. If we write

A∗ = I + A(I + A(I + A(I + A(I + ··· )))) (2)

we can compute A∗ with O(n4) operations. 8 2 2 2 I By using power trees, e.g., A = (((A ) ) ) we can compute the transitive closure with O(n3 log n) operations; or O(nγ log n), where γ is the exponent of e.g. Strassen matrix multiplication.

I There is a simple algorithm (Warshall’s Algorithm) that computes transitive closure from the adjacency matrix in O(n3) time.

I In practice, the best way to compute transitive closure depends strongly on the anticipated structure of the input graph, its size, density, planarity, etc. There is a large literature on algorithms for TC.

Transitive Closure V

I Perhaps surprisingly, there is an algorithm computing transitive closure in O(n2) average time, for a uniform distribution on graphs.

I The G(n, p) random graph model is a distribution on graphs of n vertices where each edge is present independently with probability p. Choosing p = 1/2 gives a uniform distribution on graphs. 1 2 I In G(n, 2 ), transitive closure can be computed in O(n ) time on average.

I The reason why: with probability 1, every vertex is at most two steps away from every other vertex:

I Let x, y ∈ V be vertices; there are n − 2 choices of intermediate vertices to make a path of length 2 from x to y. 1 With each intermediate vertex w, we have a probability 4 of having both the edge (x, w) and (w, y).

I The probability of there being no w such that 3 n−2 E(x, w) ∧ E(w, y) is ( 4 ) . Transitive Closure VI

n I There are 2 choices of x and y. Let Z be the event, “there exist (x, y) such that there is no w where E(x, w) ∧ E(w, y).” Then, using the union bound,

n n−2 Pr(Z) ≤ 3  2 4 2 3 n = O(n 4 ) = o(1) This probability goes to 0 very fast as n → ∞.

I To turn this insight into an algorithm, consider paths of length 1, 2,.... For each pair (x, y), consider all possible intermediate sequences of vertices; stop when an intermediate sequence is found that gives a path from x to y. Then, stop when a path is found between every pair of vertices.

I To find paths of length 2, the number of intermediate vertices that need to be examined follows a geometric distribution, 1 −1 with a mean of ( 4 ) = 4 vertices.

Transitive Closure VII

I And, with probability tending to 1, we can stop after only considering paths of length 2. Because the probability is converging exponentially, we get an average time complexity of O(n2).

I That transitive closure can be computed quickly in the 1 G(n, 2 ) random graph model is an instance of a much deeper pattern arising from “zero-one laws” in finite model theory.

I Note that we can write transitive closure as an iteration of a first-order sentence:

E 0(x, y) = ⊥ E k+1(x, y) = (x = y) ∨ E k (x, y) ∨ ∃w . E k (x, w) ∧ E(w, y)

“There is a path of length ≤ k + 1 from x to y if there is a path of length k, or there is a vertex w so there is a path of length k from x to w, and an edge from w to y.”

I This is an example of a FO+lfp (first order logic with least fixpoint) definition. Transitive Closure VIII

I For every FO+lfp definable relation, there is an FO definable relation (i.e., without iteration) that is equivalent with probability tending to 1 as n → ∞, in the G(n, 1/2) random graph model. For example, the “approximate” transitive closure given by:

Eˆ ∗(x, y) = (x = y) ∨ ∃w . E(x, w) ∧ E(w, y)

is equal to the real transitive closure E ∗ with asymptotic probability 1.

I cf. Almost-everywhere equivalence [2].

Strongly Connected Components I

I Strongly connected components is the directed graph analogue of connected components.

I Consider this set of equations:

x = 3 + y y = x − 2 z = x + 4w w = z − 8

We can solve such systems of linear equations with Gaussian elimination in time O(n3). Strongly Connected Components II

I If we look at the dependence graph, we discover something useful:

x ! y O `

z w ` We do not need to solve the whole system at once; instead we can first solve {x, y} and then solve {z, w}. {{x, y}, {z, w}} are the strongly connected components.

I A subset of vertices Y ⊆ V is strongly connected if for each x, y ∈ Y , there is a path from x to y, and a path from y to x. I Write x . y if there is a path from x to y. I . is transitive and reflexive, but not necessarily antisymmetric. I . is a preorder. I Any preorder can be decomposed into:

Strongly Connected Components III

1. An equivalence relation ∼, where V / ∼ are the strongly connected components:

x ∼ y ≡ (x . y) ∧ (y . x) 2. A partial order ≤ on V / ∼.

[x]∼ ≤ [y]∼ ≡ x . y

(This is a common method of constructing hierarchies. For example, in and structural complexity theory, a reducibility relation defines a preorder on problems, and the resulting partial order ≤ on interreducible problems gives a hierarchy of what are sometimes called ‘degrees.’ The class of NP-complete problems we shall see later can be constructed this way.) Strongly Connected Components IV

I Example: for the set of equations above, we obtain the equivalence relation ∼ given by the partition {{x, y}, {z, w}}, and the corresponding partial order is given by this Hasse diagram:

{x, y}

{z, w}

I The graph obtained by merging all the elements of the strongly connected components into “supernodes”, and keeping whatever edges remain between supernodes, is called a condensation graph.

I Strongly connected components can sometimes be used to decompose a problem into a collection of smaller problems that can be solved in order. For example, to solve a system of linear equations efficiently, one can

Strongly Connected Components V

1. Compute the strongly connected components; 2. Compute a topological sort of the resulting condensation graph; 3. Solve the smaller systems of equations according to the topological sort order. This technique is used in program analysis and compilers, for efficient solution of lattice equations.

I A simple (but not terribly efficient) method to compute the strongly connected components: 1. Compute the transitive closure E ∗; 2. For each (x, y), if both E ∗(x, y) and E ∗(y, x), then do union(x, y) with e.g., Tarjan’s disjoint set union.

I However there is a much more efficient algorithm due to Kosaraju, which you can find in CLR [1], which uses two passes of depth-first search. This is quite suitable for an explicit graph representation. Spanning trees I

I A spanning tree of a connected, undirected graph G = (V , E) is a subset T ⊆ E such that for all x, y ∈ V , there is a unique path from x to y through T .

The yellow edges form a spanning tree.

I Proposition: |T | = |V | − 1. (By induction on the number of vertices.)

I Proposition: If |T | = |V | − 1 and (V , T ) is connected, it is a spanning tree.

I These two properties make finding a spanning tree dead easy: any set of |V | − 1 edges that do not contain a cycle form a spanning tree.

I Generic spanning tree algorithm:

Spanning trees II

1. Set T = ∅. 2. Pick an edge e ∈ E. 3. If T ∪ {e} has no cycle, set T ← T ∪ {e}. 4. If |T | < |V | − 1, go to 2. Minimum Spanning Tree (MST) I

I Problem: given a weighted graph (V , E, w), find a spanning tree T for G such that X w(x, y) (x,y)∈T

is minimized.

I Kruskal’s algorithm: always pick the lowest-cost edge that does not cause a cycle. (This is an example of a greedy algorithm.) PriorityQueue Q of edges, ordered ascending by weight. Put each edge in Q. While |T | 6= |V | − 1, Take edge e from Q. If T ∪ {e} has no cycle, set T ← T ∪ {e}.

Minimum Spanning Tree (MST) II

I To detect cycles, keep track of connectivity of T with Tarjan’s disjoint set union: PriorityQueue Q of edges, ordered ascending by weight. Put each edge in Q. While |T | 6= |V | − 1, Take edge e = (x, y) from queue. If find (x) 6= find(y), union(x,y) T ← T ∪ {e}

I The time required is O(|E| log |E|): dominated by the time to maintain the heap. Bibliography I

[1] Thomas H. Cormen, Charles E. Leiserson, and Ronald R. Rivest. Intoduction to algorithms. McGraw Hill, 1991. bib [2] Lauri Hella, Phokion G. Kolaitis, and Kerkko Luosto. Almost everywhere equivalence of logics in finite model theory. The Bulletin of Symbolic Logic, 2(4):422–443, December 1996. bib pdf ps ECE750-TXB Lecture 19: Greedy Algorithms and Dynamic Programming

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

March 15, 2007

Multi-stage Decision Problems I

I A multi-stage decision problem is one where we can think of the problem as having a tree structure: each tree node represents a decision point, and each decision leads us to another decision to be made.

I For example: choose as many elements of the set {1, 3, 7, 9, 13} as possible without exceeeding 21. We can view this as a tree:

{}

{1} {3} {7} {9} {13}

{1,3} {1,7} {1,9} {1,13}

Here we have decided to include 1 in the set, which opens up a further set of choices. Greedy algorithms I

I A greedy algorithm always chooses the most attractive option available to it: each decision is made based on its immediate reward.

I For example, we might pick elements of the set {1, 3, 7, 9, 13} according to their benefit-cost ratio: picking 3, for example, increases the size of the set by 1 (benefit) while increasing the sum by 3 (cost), so we would assign it a benefit-cost ratio of 1 3 . Choosing elements according to this scheme gives the set {1, 3, 7, 9}.

I Kruskal’s minimum spanning tree algorithm is another example of a greedy algorithm: at each step it chooses the edge with minimum cost, without thinking ahead as to how that might affect future decisions.

Greedy algorithms II

I Greedy algorithms only rarely give optimal answers. Surprisingly often they give “reasonably-good” answers; for example, greedy algorithms are one method of obtaining approximate solutions to NP-hard optimization problems.

I There is a greedy algorithm for constructing a prefix-free code that yields an optimal code, called a Huffman code.

I Recall that the entropy of a discrete distribution on a set of symbols S is given by X H(µ) = −µ(si ) log µ(si ) s∈S

And, from Shannon’s noiseless coding theorem, we know there exists a prefix code achieving an average code length of c ≤ H(µ) + 1.

I Huffman’s algorithm uses a greedy method to construct an optimal code: Greedy algorithms III 1. Start with a collection of singleton trees, one for each symbol. For each tree we keep track of its probability; initially each singleton tree has the probability of its symbol. 2. While there is more than one tree, 2.1 Choose two trees with least probabilities; 2.2 Combine the two trees into one by making a new root whose left child is one tree, and whose right child is the other. The probability of the new tree is the sum of the probabilities of the subtrees. We label the edge to the left subtree with “0”, and the edge to the right subtree with “1”. The end result is a trie giving an optimal prefix code. I Example: consider this set of symbols Symbol Probability a 0.3 b 0.2 c 0.2 d 0.1 e 0.1 f 0.05 g 0.05

Greedy algorithms IV

This distribution has H(µ) = 2.546. Huffman’s algorithm proceeds this way: The two symbols with least probability are f and g, so these are combined into a little tree with combined probability 0.1. We then have several choices of what to do next; we choose to combine d and e. Our collection of trees then looks like this: 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 a 0.2 0 1 0.1 b c 0 1 d e f g We then combine the subtrees d-e and f-g, and so forth. The final result is this tree: Greedy algorithms V

1.0 0.9 0.8 0 1 0.7 0.6 0.5 0 1 0.4

0.3 0 1 0 a 1 0.2 0 1 0.1 b c 0 1 d e f g Which gives us the code table:

Symbol Code a 10 b 00 c 01 d 1100 e 1101 f 1110 g 1111

Greedy algorithms VI

This achieves an average code length of c = 2.6, only slightly more than the entropy of 2.546.

I Huffman’s algorithm is a rare exception: greedy algorithms rarely give optimal answers because they are myopic.

I For a greedy algorithm to give an optimal answer, a problem must have an underlying ‘matroid’ structure [4].

I Dynamic programming may succeed where greedy algorithms fail to yield an optimal answer; it can be more farsighted. Variational Problems I

I Certain optimization problems can be formulated as routefinding problems, possibly in an abstract sense. For example, finding a shortest driving route between two cities; or finding a “least cost” path through a multistage decision problem.

I We consider first the continuous analogue of these problems, sometimes called trajectory optimization.

I Consider the problem of designing a tobogganning hill (or, if you like, a downhill ski slope.)

Variational Problems II

I Making the usual unrealistic assumptions (no friction, no air resistance, no brakes), we want to design a hill that lets people travel from A to B in the shortest time possible: A

B

I This is a classical problem solved by Bernoulli, called the Brachistochrone problem. It launched the study of the calculus of variations, a method for solving continuous, infinite-dimensional optimization problems such as finding a curve of optimal shape. Variational Problems III

I In a variational problem, one has a functional to minimize or maximize. A functional maps functions to real values, typically via an integral. For our tobogganning problem, we want to find a function (say) y(x), and the cost function would look something like: Z C(x, y, y˙ )dx P

dy wherey ˙ = dx , and the functional assigns to each curve y(x) the time required to travel from A to B.

I Using the calculus of variations, one can obtain from the functional a differential equation that can be solved to find an optimal path; the process is analogous to the way one can find the extrema of a convex d function F (z) by setting dz F (z) = 0. However, our interest is in discrete analogues of variational problems, which can be solved efficiently by dynamic programming.

Variational Problems IV

I Variational problems possess two important properties common to a wide class of problems that can be tackled with dynamic programming: 1. We can write the cost of a path as a sum of the cost of subpaths; for example, the time required to descend the hill is the sum of the time required to reach the midpoint, plus the time to travel from the midpoint to the end. 2. Any subpath of an optimal path is optimal. Otherwise, we could excise the suboptimal subpath and replace it with an optimal subpath and decrease the overall cost, contradicting the premiss that the path is optimal.

I Dynamic programming can be used to solve discretized versions of variational problems; this was one of the earliest applications. Discretizing such a problem results in a discrete routefinding problem. Variational Problems V

I In a discrete routefinding problem, we have a state space S, and in each state there are moves we can make, each with an associated cost. Given a pair of states a, b ∈ S, we want to find a minimal cost path between them. Abstractly, we can think of S as a weighted graph, where states are vertices, edges are moves, and each edge has a cost.

I For example, this graph shows driving distances between some towns close to Waterloo:

Variational Problems VI

Waterloo

27 30

Cambridge 53 Guelph

27 76 17

Brantford Milton 52

37 50

Hamilton

We might want to find the shortest path between, say, Waterloo and Hamilton. Variational Problems VII

I For a more abstract example, suppose we have a sequence of operations we wish to perform on a data structure: inserting keys, finding keys, and iterating through the keys, for example. Some operations can be performed very efficiently on a linked list, for example, inserting a key. Others can be performed quickly on a binary search tree, such as finding a key. If we have a long sequence of inserts followed by a long sequence of find operations, it might make sense to start with a linked list, and then switch to a binary search tree. How can we determine the optimal times to switch configurations of the data structure? This is an example of a metrical task system [2], in which one has a sequence of tasks to perform, and configurations the system can switch between (e.g., linked list and BST). Each configuration has different costs for performing the tasks; and switching between configurations has a specified cost.

Variational Problems VIII For the linked list/BST example, the possible strategies form a graph like this:

Task # 1 2 3 4 ... n

... Linked List 8 /@ /@ /@ / / > ppp >> ppp >> ppp > NN      ? NN ÐÐ NNN ÐÐ NN Ð Ð Ð ... ÐÐ BST & / / / / /  Before performing each task, we have the option of switching      between a linked list and a tree. Edges are weighted with the cost of performing a task in a given configuration, or with the cost of switching between configurations. Finding an optimal strategy for switching between representations consists of finding a minimal path from the start vertex (far left) to the end vertex (far right). Dynamic Programming I

I Consider finding the shortest path from a to b through the following graph:

2 3 a / d / f

1 4 2    c / e / 3 2 b

I There are only two edges incident on b. An optimal path will either:

I Go from a to f, then take the edge from f to b;

I Go from a to e, then take the edge from e to b.

I Write d(x, y) for the shortest path between vertices. Using the above idea, we can write this equation for d(a, b):

d(a, b) = min(2 + d(a, f ), 2 + d(a, e))

Dynamic Programming II

I We could turn this idea into a recursive procedure. (Assume the graph is acyclic.) d(r ,s) if r=s then // Base case return 0 otherwise dist ← ∞ for each edge (x,s ), dist ← min(dist, d(r,x) + weight(x,s)) return dist Dynamic Programming III

I However, this procedure would solve the same subproblems over and over again. Graphs of this form would require exponential time: ! ! a x y z b ? = ? = We would call d(a, z) twice, d(a, y) four times, d(a, x) eight times, etc. I Instead of writing a recursive function, consider the following system of equations for the example graph shown earlier.

d(a, b) = min(2 + d(a, f ), 2 + d(a, e)) d(a, f ) = min(3 + d(a, d)) d(a, e) = min(4 + d(a, d), 3 + d(a, c)) d(a, c) = min(1 + d(a, a)) d(a, d) = min(2 + d(a, a))

Dynamic Programming IV

I These are called the Bellman equations in honour of Richard Bellman, who invented dynamic programming [1]. The reason why it is called ‘dynamic programming’ is amusing; see [3].

I Incidentally: these equations use only the operations min and +, under which the integers form an algebraic structure sometimes called a tropical semiring [5], also called min-plus algebras (or max-plus algebras). Tropical, because a person strongly associated with them is Imre Simon, who did his PhD at Waterloo in the 1970’s and became a professor at the University of S˜aoPaulo!

I Now draw a dependence graph for the values d(r, s), where edges go from terms on the left-hand side of an equation to things appearing on the right-hand side:

d(a, a) o d(a, d) o d(a, f ) O O O

d(a, c) o d(a, e) o d(a, b) Dynamic Programming V

I Using topological sort, we obtain an order in which we can solve the equations. (For example, perform a depth-first search starting from d(a, b).)

d(a, a) = 0 d(a, d) = 2 d(a, c) = 1 d(a, e) = min(4 + 2, 3 + 1) = 4 d(a, f ) = 5 d(a, b) = min(2 + 5, 2 + 4) = 6

So, the shortest path from a to b is of length 6.

I A few things to note:

I We only had to evaluate each equation once; the number of calculations was linear in the number of edges. This is a vast improvement over the potentially exponential recursive version!

I The dependence graph looks just like the original graph, but with the edges reversed.

Dynamic Programming VI

I In solving the equations, we actually find a shortest path from a to every vertex. If we draw these paths all on the same graph, we get the directed-graph analogue of a spanning tree, called an arborescence:

2 3 a / d / f

1  c / e / b 3 2

I The shortest path from a to b contains the shortest paths from c to b, a to e, etc.

I The recursive procedure described earlier would achieve the same efficiency as the equations approach if we maintained a cache of results of calling d(r, s), and consulted this cache for the answer each time the procedure was called. This is called memoization. Dynamic Programming VII

I Finding shortest paths in a graph is a canonical example of dynamic programming. Dynamic programming can be applied when problems have the following properties: 1. Optimal substructure: an optimal solution is composed of optimal solutions to subproblems. 2. Overlapping subproblems: a recursive solution would solve the same subproblems repeatedly.

I A dynamic programming solution usually starts from some principle describing how solutions to subproblems can be combined into solutions to the entire problem. The Bellman equations shown earlier are an example of this.

I One exploits overlapping subproblems by remembering answers to subproblems. This can be done by memoization (caching function return values), or by maintaining a table or other data structure of answers to subproblems.

I Some applications of dynamic programming:

I route finding

Dynamic Programming VIII

I trajectory optimization

I solving discretized versions of variational problems

I finding optimal query strategies for relational databases

I Viterbi algorithm (Hidden Markov Models)

I metrical task systems (offline form)

I parsing

I approximation algorithms for NP-hard problems, for example, the knapsack problem

I calculating edit distances (e.g., the minimal number of local changes required to transform one string or tree into another.) Bibliography I

[1] Richard Bellman. On the theory of dynamic programming. Proc. Nat. Acad. Sci. U. S. A., 38:716–719, 1952. bib pdf

[2] Allan Borodin, Nathan Linial, and Michael E. Saks. An optimal on-line algorithm for metrical task system. J. ACM, 39(4):745–763, 1992. bib pdf

[3] Stuart Dreyfus. Richard Bellman on the birth of dynamic programming. Oper. Res., 50(1):48–51, 2002. 50th anniversary issue of Operations Research. bib pdf

[4] Jack Edmonds. Matroids and the greedy algorithm. Math. Programming, 1:127–136, 1971. bib pdf

Bibliography II

[5] Jean-Eric Pin. Tropical semirings. In Idempotency (Bristol, 1994), volume 11 of Publ. Newton Inst., pages 50–69. Cambridge Univ. Press, Cambridge, 1998. bib pdf ECE750-TXB Lecture 20: Amortization, Online algorithms

Todd L. ECE750-TXB Lecture 20: Amortization, Veldhuizen Online algorithms [email protected]

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

March 20, 2007

ECE750-TXB Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected]

Part I

Amortized Analysis ECE750-TXB Amortized Analysis I Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected] I In accounting, amortization is a method for spreading a large lump-sum amount over a period of time by breaking it into smaller payments; for example, when one purchases a house the mortgage payments are arranged according to an amortization schedule.

I In algorithm analysis, amortization looks at the total cost of a sequence of operations, averaged over the number of operations. If a single operation is very expensive, that expense can be “amortized” over a long sequence of operations.

I If a sequence of m operations takes O(f (m)) time, we say the amortized cost per operation is O(m−1f (m)).

ECE750-TXB Amortized Analysis II Lecture 20: Amortization, I If the worst case time per operation is O(g(m)), this Online algorithms implies the amortized time is O(g(m)) also, but the Todd L. Veldhuizen converse is not true: in amortized analysis we can allow [email protected] a small number of very expensive operations, so long as that expense is “averaged out” (amortized) over a sufficiently long sequence of operations.

I Example: recall the binary tree iterator: class BSTIterator implements Iterator { Stack stack ;

public BSTIterator(BSTNode t) { stack = new Stack(); fathom(t); }

public boolean hasNext() { ECE750-TXB Amortized Analysis III Lecture 20: Amortization, return !stack .empty(); Online algorithms } Todd L. Veldhuizen [email protected] public Object next() { BSTNode t = (BSTNode)stack.pop(); if (t . right child != null) fathom(t. right child ); return t ; }

void fathom(BSTNode t) { do { stack .push(t); t = t. left child ; } while (t != null ); } }

ECE750-TXB Amortized Analysis IV Lecture 20: Amortization, Online algorithms I If the binary tree contains n elements and is balanced, Todd L. then the fathom() operation takes no more than Veldhuizen O(logn) time; and there must be at least some nodes of [email protected] depth ≥ c log n. Therefore each invokation of next() requires Θ(log n) time in the worst case.

I However, iterating through the entire tree by a sequence of next() operations requires O(1) amortized time: 1. The iterator visits each node at most twice: once when it is pushed onto the stack by fathom(), and once when it is popped from the stack by next(). 2. The total time spent in the next() and fathom() methods is linear in the number of pushes and pops onto the stack. 3. Assuming push and pop operations take O(1) time (e.g., linked list implementation of stack), the total cost of iterating through the tree is O(n). 4. Therefore the amortized cost of the iteration is O(n−1n) = O(1). ECE750-TXB Dynamic Arrays I Lecture 20: Amortization, Online algorithms I Arrays have two advantages over more complex data Todd L. Veldhuizen structures: (1) They are very fast to access, both for [email protected] random access and for iterating through the contents of the array. (2) They are very efficient in memory use.

I For example, to store a set of n single-precision floating-point values (4 bytes apiece), 1. An array requires 4n + O(1) bytes; 2. A binary search tree requires ≈ 16n + O(1) bytes: for each tree node, need 4 bytes for the float, 2*4 bytes for the left/right child pointers. And typically small objects such as this are padded up to an alignment boundary, e.g., 16 bytes. So, a tree can take 3-4 times as much memory as an array, for storing small objects.

I However, appending items to an array can be very inefficient: if the array is full, one must usually allocate a new, larger array and copy the elements over, for a cost of O(n).

ECE750-TXB Dynamic Arrays II Lecture 20: Amortization, Online algorithms

Todd L. I A dynamic array is one that keeps room for extra Veldhuizen [email protected] elements, resizing itself according to a schedule that yields an O(1) amortized time for append operations, despite the occasional operation taking O(n) time.

I The array maintains 1. A size (number of elements in the array) 2. A capacity (allocated size of the array) 3. The array itself.

I When an append operation is performed, size is incremented; if size exceeds capacity then: 1. Allocate a new array of size f (capacity), where f is to be determined; 2. Copy the elements 1..size to the new array 3. Set capacity = the new capacity. ECE750-TXB Dynamic Arrays III Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected] I Now analyze the amortized time complexity. Consider a sequence of n insert operations, starting from an empty array. Each time we hit the array capacity, we incur a cost of O(n); other append operations incur only an O(1) cost. Time required for append

1 2 3 4 5 6 7 8 9 10 ...

Operation #

ECE750-TXB Dynamic Arrays IV Lecture 20: Amortization, I Suppose the array has an initial capacity of 1. The cost Online algorithms of the resizings will be Todd L. Veldhuizen X [email protected] f (k)(1) k : f (k)(1)≤n

where f (0)(x) = x, and f (i+1)(x) = f (f (i)(x)).

I If we take f (k) = k + 16, i.e., we increase the capacity of the array by 16 elements each time we run out of room, then the total cost is O(n2).

I If however we choose f (k) = βk, with β > 1, then using the geometric series formula, we have a total cost of

m+1 X k β − β β = β − 1 k : βk ≤n m=logβ n β(n − 1) = β − 1 ECE750-TXB Dynamic Arrays V Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected]

= O(n)

So, the amortized cost of appends into the dynamic array is O(n−1n) = O(1).

I Increasing the capacity of the array by, say, 5% each time we run out of space leads to an O(1) amortized time for appends.

ECE750-TXB Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected]

Bibliography Part II

Online Algorithms ECE750-TXB Online Algorithms I Lecture 20: Amortization, Online algorithms Consider the problem of assigning restaurant patrons to Todd L. I Veldhuizen tables. If you know in advance who wants to eat dinner [email protected]

at your restaurant, when they want to arrive, how long Bibliography they will stay, and how much they will spend, you can figure out in advance what subset of patrons to accept to maximize your revenue.

I In the real world, people just show up at restaurants expecting to be fed: when each party arrives, you must decide whether or not you can seat them; and once given a table, you can’t evict them before they are finished eating to make room for someone else.

I The difference is that between an offline and online problem.

I In an offline scenario, we know the entire sequence of requests that will be made in advance; in principle we can use this knowledge to find an optimal solution.

ECE750-TXB Online Algorithms II Lecture 20: Amortization, Online algorithms I In an online scenario, we are presented with requests one Todd L. at a time, and we must commit to a decision without Veldhuizen knowing what the subsequent requests might be. [email protected]

I Many realistic problems are online, for example: Bibliography

I Page replacement policies in operating systems and caches;

I Call routing in networks;

I Memory allocation;

I Data structure operations.

I The field of online algorithms studies such problems, in particular how online solutions compare to their optimal offline versions [1].

I In an online problem, one is presented with a request sequence I = (σ1, σ2, . . . , σn), and each σi must be handled with no knowledge of σi+1, σi+2, . . . , σn. Once a decision of how to handle request σi is made, it cannot be altered. ECE750-TXB Online Algorithms III Lecture 20: Amortization, I We characterize performance by assigning a cost to a Online algorithms

sequence of decisions. Todd L. Veldhuizen I Write OPT(I ) for the optimal offline solution. [email protected]

I If ALG(I ) is an online algorithm, we say ALG is an Bibliography (asymptotic) c-approximation algorithm if for all legal request sequences I ,

ALG(I ) − c · OPT(I ) ≤ α

for some constant α not depending on I . If α = 0, ALG is a c-approximation algorithm, and we have

ALG(I ) ≤ c OPT(I )

The online algorithm yields a cost that is at most c times the optimal cost.

I c is called the competitive ratio.

ECE750-TXB Example: Load Balancing I Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected] I Consider the following load balancing problem: we have n jobs to complete. Each job j ∈ {1,..., n} requires Bibliography time T (j). We have m machines, each equally capable.

I We want to assign the jobs to machines so that we are finished all jobs as quickly as possible.

I In an offline version, we would know T [j] in advance. In an online version, we must assign job j to a machine knowing only the jobs 1..j − 1 and what machines they were assigned to.

I An assignment A : {1,..., n} → {1,..., m} where A(j) = i means job j is assigned to machine i. ECE750-TXB Example: Load Balancing II Lecture 20: Amortization, Online algorithms

Todd L. I The makespan is how long we must wait until all jobs Veldhuizen are finished: [email protected] X Bibliography Makespan(A) = maxi T (j) j : A(j)=i

to machine m.

I There is an online greedy algorithm that provides a competitive ratio of 2 (i.e., the schedule chosen takes at most twice as long as the optimal offline version.) The algorithm is simple:

I Always assign job j to a machine with an earliest finishing time.

I The proof of the competitive ratio stems from two observations:

ECE750-TXB Example: Load Balancing III Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected] 1. If the sum of the times of all jobs is, say, 60 minutes, and we have 3 machines, an optimal schedule can’t Bibliography possibly require less than 60/3 = 20 minutes. In general:

1 X OPT ≥ m T (j) j

2. The optimal schedule has to be at least as long as the longest job:

OPT ≥ max T (j) j ECE750-TXB Example: Load Balancing IV Lecture 20: Amortization, Online algorithms I Suppose machine i has the longest running time, and j Todd L. Veldhuizen is the last job assigned to machine i. [email protected]

Bibliography

i j

(A) (B) The above diagram shows a schedule with 3 machines.

I During the time up to the beginning of job j, all the machines are in full use (region (A) in the figure above.) Otherwise, there would be a machine finishing earlier than the one to which j has been assigned. The P sum of all the times in region (A) is ≤ j T (j). Since 1 P OPT ≥ m j T (j), the length of time up until job j starts is ≤ OPT.

ECE750-TXB Example: Load Balancing V Lecture 20: Amortization, Online algorithms

Todd L. I The length of region (B) in the above figure is the Veldhuizen [email protected] duration of the job j. Since OPT ≥ maxj T (j), region (B) is also ≤ OPT in duration. Bibliography

I Therefore the time until all the jobs is finished is ≤ 2 · OPT.

I The greedy algorithm yields a schedule at most twice as long as the optimal offline solution; we say it is “2-competitive.”

I The current best known online algorithm is 1.9201-competitive; it is known that no deterministic algorithm can have a competitive ratio better than 1.88. There is a randomized algorithm achieving a competitive ratio of 1.916. ECE750-TXB Bibliography I Lecture 20: Amortization, Online algorithms

Todd L. Veldhuizen [email protected]

Bibliography

[1] Allan Borodin and Ran El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, Cambridge, 1998. bib ECE750-TXB Lecture 21: Memory hierarchy and locality

Todd L. ECE750-TXB Lecture 21: Memory hierarchy Veldhuizen and locality [email protected] Memory hierarchy

Memory layout Todd L. Veldhuizen Locality [email protected] Bibliography

Electrical & Computer Engineering University of Waterloo Canada

March 22, 2007

ECE750-TXB Lecture 21: Memory hierarchy I So far we have primarily been concerned with and locality asymptotic efficiency of algorithms. Having chosen an Todd L. Veldhuizen algorithm that is efficient in theory, there are still [email protected]

serious engineering challenges to getting high Memory hierarchy

performance in practice. Memory layout I To obtain decent performance on problems involving Locality nontrivial amounts of data, you need to understand Bibliography 1. How the memory hierarchy works; 2. How to take advantage of this.

I Historical performance of typical desktop machine:

Year Clock cycle DRAM Disk (ns) latency (ns) latency (ms) 1980 500 375 87 1990 50 100 28 2000 1 60 8 ECE750-TXB Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy

Memory layout

Locality

Bibliography In two decades:

I CPU: 500x faster clock cycle.

I Main memory: 6x faster.

I Disk: 10x faster.

I It can now take hundreds of clock cycles to access data in main memory: “Memory is the new disk.”

I To mitigate the widening gap between CPU speed and memory access time, an elaborate memory hierarchy has evolved.

ECE750-TXB Lecture 21: Memory hierarchy and locality

I Example memory hierarchy of a desktop machine: Todd L. Bandwidth Latency Size Block size (B) Veldhuizen L1 Cache 20 Gb/s 1 ns 16kb - [email protected] L2 Cache 10 Gb/s 8 ns 1Mb - Memory hierarchy Main memory 2 Gb/s 200 ns 2Gb 64 bytes Memory layout Disk 0.08 Gb/s 106 ns 400Gb 1024 bytes Locality

I Bandwidth is the rate at which data can be transferred Bibliography in a sustained, bulk manner, scanning through contiguous memory locations. Main memory can supply data at 1/10th the rate of L1 cache.

I Latency is the amount of time that elapses between a request for data and the start of its arrival. Disk is a million times slower than L1 cache.

I Block size is the “chunk size” in which data is transferred up to the next-fastest level of the hierarchy. ECE750-TXB Lecture 21: Memory hierarchy I The speed at which a program can run is always and locality determined by some bottleneck: the cpu, main memory, Todd L. Veldhuizen the disk. Programs where the bottleneck is main [email protected]

memory are called “memory bound” — most of the Memory hierarchy

execution time is spent waiting for data to arrive from Memory layout

main memory or disk. Locality

I A desktop machine with, say, a 2 GHz cpu, effectively Bibliography runs much slower if it is memory bound: e.g.,

I If L2 cache is the bottleneck: effective speed 1 GHz (about a Cray Y-MP supercomputer, circa 1988).

I If main memory is the bottleneck: effective speed 200 MHz (about a Cray 1 supercomputer, circa 1976).

I If disk is the bottleneck: effective speed between < 1 MHz (about a 1981 IBM PC) and ≈ 80 MHz, depending on access patterns.

ECE750-TXB Latency and throughput I Lecture 21: Memory hierarchy and locality

Todd L. I Many operations (transfers of data from disk to Veldhuizen memory, floating-point vector operations, network [email protected] communication, etc.) follow a characteristic Memory hierarchy

performance curve: slow for a small number of items, Memory layout

faster for a large number of items. Locality

I Let R∞ be the asymptotic rate achievable (e.g., Bibliography bandwidth) I Let t0 be the latency. n I Then, an operation on n items takes time ≈ t0 + . R∞ I The effective rate is n R(n) = n t0 + R∞ R∞ = −1 1 + R∞t0n 2 R∞t0 −2 ∼ R∞ − n + O(n ) ECE750-TXB Latency and throughput II Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen I E.g. with R∞ = 1 and t0 = 10: [email protected]

1 Memory hierarchy 0.9

0.8 Memory layout

0.7 Locality 0.6

0.5 Bibliography

0.4

0.3

0.2

0.1

0 0 20 40 60 80 100 120 140 160 180 200

I A useful parameter: n1/2 is the value at which half the 1 asymptotic rate is attained: R(n1/2) = 2 R∞.

n1/2 = t0R∞

ECE750-TXB Latency and throughput III Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected] I n1/2 gives an approximate “chunk size” required to achieve close to the asymptotic performance. For Memory hierarchy

example, current typical disk has R∞ = 0.08Gb/s and Memory layout

t0 = 4ms. For these parameters, n1/2 = 320000 bytes Locality

— about 320 kb. Bibliography I If you are dealing in chunks substantially smaller than n1/2, actual performance may be a tiny fraction of R∞.

I Example: performance of a FAXPY on a desktop machine. A FAXPY operation looks like this: float X[N], Y[N]; for (int i=0; i < N; ++i) Y[i] = Y[i] + a*X[i]; ECE750-TXB Latency and throughput IV Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy (the term FAXPY comes from the BLAS library, a living Memory layout relic from Fortran 77.) Locality Bibliography The following graph shows the millions of floating point operations per second (Mflops/s) versus the length of the vectors, N. Each plateau corresponds to a level of the memory hierarchy.

ECE750-TXB Latency and throughput V Lecture 21: Memory hierarchy and locality DAXPY Benchmark 900 Todd L. Vector Veldhuizen

800 [email protected]

700 Memory hierarchy

Memory layout 600 Locality

500

Mflops/s Bibliography

400

300

200

100 0 1 2 3 4 5 6 7 8 10 10 10 10 10 10 10 10 10 Vector length

The first part of the curve follows a typical latency-throughput

curve (i.e., R∞, n1/2). It reaches a plateau corresponding to the L1 cache. Performance then drops to a second plateau, when the vectors fit in L2 cache. The third plateau is for main memory. ECE750-TXB Latency and throughput VI Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy

Memory layout

Locality The abrupt drop at the far right of the graph indicates that the Bibliography vectors no longer fit in memory, and the operating system is paging data to disk.

ECE750-TXB Units of memory transfer I Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy I Memory is transferred between levels of the hierarchy in minimal block sizes: Memory layout Locality I Between CPU and cache: a word, typically 64 or 32 Bibliography bits.

I Between cache and main memory: a cache line, typically 32, 64 or 128 bytes. (Pentium 4 has 64-byte cache lines.)

I Between main memory and disk: a disk block or page, often 512 bytes to 4096 bytes for desktop machines, but sometimes 32kb-256kb for serving large media files. ECE750-TXB Units of memory transfer II Lecture 21: Memory hierarchy and locality

Byte Todd L. Veldhuizen [email protected] Word Cache Line 64 bits 64 bytes Memory hierarchy Memory layout

Locality

Bibliography

Disk Block 1024 bytes

I The ideal memory access pattern for good performance:

I Appropriate-sized chunks (cache lines, blocks);

I in contiguous memory locations, e.g., reading a long sequence of disk pages that are stored consecutively.

I Doing as much work as possible on a given piece of data before moving on to other memory locations.

ECE750-TXB Units of memory transfer III Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy I Memory layout of arrays and data structures is a crucial Memory layout performance issue. Locality

I Memory layout is largely beyond your control in Bibliography languages like Java that provide a programming model far removed from the actual machine.

I We’ll look at some examples in C/C++, where memory layout can be controlled at quite a fine level of detail. ECE750-TXB Matrix Multiplication I Lecture 21: Memory hierarchy and locality

Todd L. I Example: how to write a matrix-multiplication Veldhuizen algorithm that will perform well? [email protected] Memory hierarchy I NB: In general, you would be well advised to use a matrix Memory layout multiplication routine from a high-performance library, and not Locality waste time tuning your own. (But, the principles are worth Bibliography knowing.)

I Here is a naive matrix multiplication: for i=1 to n for j=1 to n for k=1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)

I If the matrices are small enough to fit in cache, this may perform acceptably.

I For large matrices, this is disastrously slow.

ECE750-TXB Matrix Multiplication II Lecture 21: Memory hierarchy and locality

Todd L. I Assuming C-style (row major) arrays, the layout of the Veldhuizen B matrix in memory looks like this: [email protected] B(1,1) B(1,2) B(1,3) B(2,1) B(2,2) B(2,8) Memory hierarchy B(3,1) B(4,1) Memory layout B(5,1) etc. Locality

Bibliography

k

I Each element of the array is a double that takes 8 bytes.

I The innermost loop (k) is iterating down a column of the B matrix.

I Light-blue shows a cache line of 64 bytes. It lies along a row of the matrix. ECE750-TXB Matrix Multiplication III Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected] I If 64*n is greater than the cache size, each element of B accessed in the innermost loop will bring in an entire Memory hierarchy cache line from memory (64 bytes), of which only 8 Memory layout bytes will be used. The remaining 56 bytes will be Locality discarded before they are used. Bibliography

I The innermost loop travels over a row of A, so it will travel along (rather than across) cache lines.

I Each step of the innermost loop will bring in ≈ 64 + 8 bytes from memory on average, but only use 16 bytes of it = an efficiency of only 22%. (By transposing the B matrix, we could make it run 4.5 times faster.)

ECE750-TXB Matrix Multiplication IV Lecture 21: Memory hierarchy and locality I A generally useful strategy is blocking (also called Todd L. tiling): perform the matrix multiplication over Veldhuizen submatrices of size m × m, where m is chosen so that [email protected] 2 8m  cache size. Memory hierarchy     Memory layout A11 A12 ... B11 B12 ... Locality

 A21 A22   B21 B22  Bibliography  .   .  . .

If each submatrix is of size m × m, then a multiplication 2 such as A11B11 requires 16m bytes from memory, but performs 2m3 floating point operations. The number of floating point operations per memory access is 16m2 8 2m3 = m . As m is made bigger, the overhead of accessing main memory can be made very small: the expensive memory access is amortized over a large amount of computation. ECE750-TXB Matrix Multiplication V Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy

Memory layout I Linear algebra libraries (e.g. ATLAS) can automatically Locality tune themselves to a machine’s memory hierarchy by Bibliography choosing appropriate block sizes, unrollings of inner loops, etc. [2, 9].

ECE750-TXB Example: Linked lists I Lecture 21: Memory hierarchy and locality

Todd L. I Designing data structures that will perform well on Veldhuizen memory hierarchies is nontrivial: [email protected]

I Data structures are often built from small pieces (e.g., Memory hierarchy

nodes, vertices) that may be smaller than a cache line. Memory layout

I Manipulation of data structures often entails following Locality

pointers unpredictably through memory, so that Bibliography prefetching hardware is unable to anticipate memory locations that will be needed.

I Many techniques developed in the past for data structures on disk (called external memory data structures [1, 8]) are increasingly relevant as techniques for managing data structures in main memory!

I Example: Linked lists are useful for maintaining lists of unpredictable size, e.g., queues, hash table chaining, etc.

I Recall that a simple linked list has a structure such as ECE750-TXB Example: Linked lists II Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected] struct ListNode { float data; Memory hierarchy ListNode* next; Memory layout }; Locality Bibliography (Here we are storing a list of floating-point values.)

I The memory layout of this structure in a 64-byte cache line: data next

ECE750-TXB Example: Linked lists III Lecture 21: Memory hierarchy and locality I Each ListNode contains 8 bytes of data. But, if it is Todd L. Veldhuizen brought in from main memory we will read an entire [email protected] cache line— say, 64 bytes of data. Unless we’re lucky, Memory hierarchy those extra 54 bytes contain nothing of value; we only Memory layout

get 4 bytes of actual useful data (the data field) for 64 Locality

bytes read from memory, an efficiency of only Bibliography 4/64 = 6.25%.

I This style of data structure is among the worst possible for reading from main memory:

I Iterating through a list means “pointer-chasing”: following pointers to unpredictable memory locations. Some processors can exploit constant-stride memory access patterns and prefetch data, but are powerless against irregular access patterns.

I Each list node accessed brings in 64 bytes of memory, only 4 bytes of which may be useful! ECE750-TXB Example: Linked lists IV Lecture 21: Memory hierarchy and locality

Todd L. I A better layout is to compromise between a linked list Veldhuizen and array: have linked list nodes that contain little [email protected]

arrays, sized so that each list node fills a cache line: Memory hierarchy struct ListNode2 { Memory layout float data[14]; Locality int count; Bibliography ListNode2* next; }; This version stores up to 14 pieces of data per node, and uses the count field to keep track of how full it is. The ListNode structure is exactly 64 bytes long (assuming 4-byte int, 4-byte pointers). Get 56 bytes of useful data for every 64 byte cache-line: about 87.5% efficiency if most of the nodes are full.

ECE750-TXB Example: Linked lists V Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected] The following graph shows benchmark results comparing I Memory hierarchy the performance of an array, a linked list, and linked Memory layout

lists of arrays (sized so that each element of the list fills Locality

1, 4, or 16 cache lines.) The operation being timed is to Bibliography scan through the list and add all the data elements: ListNode* p = first; do { s += p->data; p = p->next; } while (p != 0); ECE750-TXB Example: Linked lists VI Lecture 21: Memory hierarchy and locality

450 Todd L. Veldhuizen [email protected] 400

Memory hierarchy 350 Memory layout

Locality 300 Bibliography

250

200

150 Access rate (millions of items per second)

Array 100 Linked List List of arrays (1 cacheline) List of arrays (4 cachelines) List of arrays (16 cachelines) 50

0 1 2 3 4 5 6 7 8 10 10 10 10 10 10 10 10 Number of items in list

ECE750-TXB Example: Linked lists VII Lecture 21: Memory hierarchy and locality

Todd L. I In the L1 cache, the linked list is very fast. But, Veldhuizen compare performance to other regions of the memory [email protected] hierarchy (numbers are millions of items per second): Array Linked List Linked List of Memory hierarchy arrays (1 cache line)Memory layout L1 cache 411 411 317 Locality L2 cache 406 90 285 Bibliography Main memory 377 7 84 In L2 cache, the list-of-arrays is 3 times faster; in main memory it is 12 times faster.

I In main memory, the linked list is 50 times slower than the array.

I The regular memory access pattern for the array allows the memory prefetching hardware to predict what memory will be needed in advance, resulting in the high performance for the array. I Lessons: ECE750-TXB Example: Linked lists VIII Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy I Data structure nodes of size less than one cache line can cause serious performance problems when working Memory layout out-of-cache. Locality Bibliography I Performance can sometimes be improved by making “fatter” nodes that are one or more cache lines long.

I Nothing beats accessing a contiguous sequence of memory locations (an array). For this reason, many cache-efficient data structures “bottom out” to an array of some size at the last level.

ECE750-TXB Locality I Lecture 21: Memory hierarchy and locality

Todd L. I Efficient operation of the memory hierarchy rests on Veldhuizen two basic strategies: [email protected]

I Caching memory contents in the hope that memory, Memory hierarchy

once used, is likely to be soon revisited. Memory layout

I Anticipating where the attention of the program is likely Locality

to wander, and prefetching those memory contents into Bibliography cache.

I The caches and prefetching strategies of the memory hierarchy are only effective if memory access patterns exhibit some degree of locality of reference.

I Temporal locality: if a memory location is accessed at time t, it is more likely to be referenced at times close to t.

I Spatial locality: if memory location x is referenced, one is more likely to access memory locations ‘close’ to x.

I Denning’s notion of a working set [3, 4]: ECE750-TXB Locality II Lecture 21: Memory hierarchy and locality I Let W (t, τ) be the items accessed during the time Todd L. interval (t − τ, t). This is called the working set (or Veldhuizen locality set.) [email protected]

I Denning’s thesis was that programs progress through a Memory hierarchy series of working sets or locales, and while in that locale do not make (many) references outside of it. Optimal Memory layout cache management then consists of guaranteeing that Locality the locale is present in high-speed memory when a Bibliography program needs it.

I Working sets are a model that programs often adhere to. One of the early triumphs of the concept was explaining why multiprocess systems when overloaded ground to a halt (thrashing) instead of degrading gracefully: thrashing occurred when there was not room in memory for the working sets of the processes at the same time.

I The memory hierarchy has evolved around the concept of a working set, with the result that “programs have small working sets” has gone from being a descriptive statement to a prescription for good performance.

ECE750-TXB Locality III Lecture 21: Memory hierarchy and locality

I Let ω(t, τ) = |W (t, τ)| be the number of distinct items Todd L. Veldhuizen accessed in the interval (t − τ, t) — the size of the [email protected] working set. If ω(t, τ) grows rapidly as a function of τ, then caching of recently used data will have little effect. Memory hierarchy If however it levels off, then a high “hit rate” in the Memory layout cache can be achieved. Locality Bibliography caching ω(t, τ) ineffective

Cache size

good locality of reference

τ ECE750-TXB Locality IV Lecture 21: Memory hierarchy I Algorithms can sometimes be manipulated to improve and locality their locality of reference. The matrix multiplication Todd L. Veldhuizen example seen earlier was an example: by multiplying [email protected] 3 m × m blocks instead of columns, one can do O(m ) Memory hierarchy 2 work while reading only O(m ) memory locations. Memory layout

I Good compilers will try to do this automatically for Locality certain simple arrangements of loops, but the general Bibliography problem is too difficult for automated solutions.

I Some standard strategies for increasing locality:

I Blocking and tiling: decompose multidimensional data sets into a collection of “blocks” or “tiles”. Example:

If a large image is stored on disk column-by-column, then rendering a small region (dotted line) requires

ECE750-TXB Locality V Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

retrieving many disk pages (left). If instead the image is Memory hierarchy broken into tiles, many fewer pages are needed (right). Memory layout

I Iteration-space tiling or pipelining: in certain multistep Locality algorithms, instead of performing one step over the Bibliography entire structure, one can perform several steps over each local region.

I Space-filling curves: if an algorithm requires traversing some multidimensional space, one can sometimes improve locality by following a space-filling curve, for example: ECE750-TXB Locality VI Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

Memory hierarchy

Memory layout

Locality

Bibliography

I Graph partitioning: if a dataset lacks a clear geometric structure, one can sometimes still achieve results similar to tiling by partitioning the dataset into regions that have small “perimeters” relative to the amount of data they contain. There are numerous good algorithms for this, in particular from the VLSI and parallel computing communities [7, 5, 6]. Indeed, the idea of “tiling” can

ECE750-TXB Locality VII Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen be regarded as a special case of graph partitioning for [email protected] very regular graphs. Memory hierarchy I Vertical partitioning: the term comes from databases, but the concept applies to the design of memory layouts Memory layout also. A simple example is layout of complex-valued Locality arrays, where each element has a real and imaginary Bibliography component. Could store pairs (a, b) for each array element to represent a + bi, or could have two separate arrays, one for the real and one for the imaginary component. These two layouts have very different performance characteristics! Another common situation where vertical partitioning can apply: suppose a computation proceeds in phases, and different information is needed during each phase. E.g., in a graph algorithm one might have: ECE750-TXB Locality VIII Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected]

struct Edge { Memory hierarchy

Node* from; Memory layout

Node* to; Locality

float weight; /* Needed in phase one */ Bibliography Edge* parent; /* Phase two: union-find */ }; Instead of having all the data needed for each phase in one structure, it might be better to break up the Edge object into several structures, one for each phase of the algorithm.

ECE750-TXB External Memory Data Structures I Lecture 21: Memory hierarchy and locality

Todd L. External memory data structures (also known as Veldhuizen I [email protected] out-of-core data structures) are used when the data set is too large to fit in memory, and must be stored on Memory hierarchy disk. Databases are the primary application. Memory layout Locality

I Some good surveys: [1, 8]. Bibliography I Basic concerns: 6 I Disk has very high latency: often ≈ 10 clock cycles.

I Bandwidth is lower than main memory, perhaps 1/10 or 1/20th the rate.

I Block sizes are very large (multiples of 1kbyte) compared to main memory.

I Only a small fraction of the data can be in memory at once; only a minute fraction can be in cache.

I Disk space is cheap (in dollar cost).

I Basic coping strategies: ECE750-TXB External Memory Data Structures II Lecture 21: Memory hierarchy and locality I High latency = very large n1/2 ⇒ “fat” data structure Todd L. nodes that contain a lot of data, stored contiguously in Veldhuizen arrays. [email protected] 6 I The high latency (≈ 10 clock cycles) means that a lot Memory hierarchy of computational resources can be expended to decide on the best strategy for storing and retrieving data: Memory layout Locality I External memory data structures often manage their own I/O, rather than relying on the operating system’s Bibliography page management, since they can better predict what pages to prefetch and evict. I Databases expend a lot of effort to find a good strategy for answering a query, since a bad strategy can be disastrously costly.

I To improve the rate of data transfer, data compression and parallel I/O (parallel disk model) are sometimes used.

I Since disk space is cheap, it can pay off to duplicate data in several different forms, each appropriate for a certain class of queries.

ECE750-TXB External Memory Data Structures III Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [email protected] I Classic example: the B-tree, a binary search tree suitable for external memory. Memory hierarchy

I Each node has Θ(B) children, and the tree is kept Memory layout Locality balanced, so that the height is roughly logB N. I Supported operations: find, insert, delete, and range Bibliography search (retrieve all records in an interval [k1, k2].) I When insertions cause a node to overflow, it is split into two half-full nodes, which can cause the parent node to overflow and split, etc.

I Typically the branching factor is very large, so that a massive data set can be stored in a B-tree of height 3 or 4, for example. ECE750-TXB Summary: Principles for effective use of the Lecture 21: Memory hierarchy memory hierarchy I and locality Todd L. Veldhuizen [email protected] I Algorithms should exhibit locality of reference; when

possible, rearrange the order in which computations are Memory hierarchy

performed so as to reuse data that will be in cache. Memory layout

I The ability of the memory hierarchy to cache and Locality prefetch data depends on predictable access patterns. Bibliography When working out-of-cache:

I Performance is best when a long, contiguous sequence of memory locations are accessed (e.g., scanning an array).

I Performance is worst when memory accesses are unpredictable, accessing many small pieces of data scattered through memory in an unpredictable pattern. (e.g., pointer-chasing through linked lists, binary search trees, etc.)

I Design data layouts so that:

I Items used together are stored together.

ECE750-TXB Summary: Principles for effective use of the Lecture 21: Memory hierarchy memory hierarchy II and locality Todd L. Veldhuizen [email protected]

Memory hierarchy

I Items not used together are stored apart. Memory layout

I If nodes of a data structure reside at a level of the Locality

memory hierarchy where transfer blocks are of size B, Bibliography then those nodes should be of size k · B when possible, where k > 1. i.e., for main-memory data structures, use nodes that are several cache lines long; for data structures on disk, use nodes that are several pages in size.

I Maximize the amount of useful information in each block. ECE750-TXB Bibliography I Lecture 21: Memory hierarchy and locality

Todd L. [1] Lars Arge. Veldhuizen External memory data structures. [email protected]

In Handbook of massive data sets, pages 313–357. Memory hierarchy Kluwer Academic Publishers, Norwell, MA, USA, 2002. Memory layout bib pdf Locality Bibliography [2] Jeff Bilmes, Krste Asanovi´c,Chee–whye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July 1997. bib [3] Peter J. Denning. The working set model for program behavior. Commun. ACM, 11(5):323–333, 1968. bib pdf

ECE750-TXB Bibliography II Lecture 21: Memory hierarchy and locality

Todd L. Veldhuizen [4] Peter J. Denning. [email protected] The locality principle. Memory hierarchy In J. Barria, editor, Communication Networks and Memory layout

Computer Systems, pages 43–67. Imperial College Press, Locality

2006. bib pdf Bibliography [5] Josep D´ıaz, Jordi Petit, and Maria Serna. A survey of graph layout problems. ACM Comput. Surv., 34(3):313–356, 2002. bib pdf

[6] Ulrich Elsner. Graph partitioning: A survey. Technical Report S393, Technische Universit¨at Chemnitz, 1997. bib pdf ECE750-TXB Bibliography III Lecture 21: Memory hierarchy and locality

Todd L. [7] Bruce Hendrickson and Robert Leland. Veldhuizen A multilevel algorithm for partitioning graphs. [email protected]

In Supercomputing ’95: Proceedings of the 1995 Memory hierarchy

ACM/IEEE conference on Supercomputing (CDROM), Memory layout page 28, New York, NY, USA, 1995. ACM Press. bib Locality pdf Bibliography

[8] Jeffrey Scott Vitter. External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv., 33(2):209–271, 2001. bib pdf

[9] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1–2):3–35, January 2001. bib ECE750-TXB Lecture 22: NP-Complete problems

Todd L. ECE750-TXB Lecture 22: NP-Complete Veldhuizen problems [email protected]

Todd L. Veldhuizen [email protected]

Electrical & Computer Engineering University of Waterloo Canada

March 27, 2007

ECE750-TXB Tractable problems I Lecture 22: NP-Complete problems I Recall that P is the class of decision problems (i.e., Todd L. Veldhuizen those with yes/no answers) that can be answered in [email protected] time O(nc ) on a Turing machine, c a constant.

I In the 1960s, proposals were made to identify the class P with “feasible” problems i.e., problems for which computers can obtain an answer in a reasonable amount of time. (One early proponent of this idea was Jack Edmonds, a professor at Waterloo and a pioneer of combinatorial optimization.)

I This idea of ‘P = feasible’ has gained widespread acceptance, of course modulo the common-sense caveats that must accompany such a blanket generalization: algorithms requiring 10100 · n2 or O(n1000) time are unlikely to be practical. However, most ‘natural’ problems in P (those that arise from questions of interest, as opposed to those artificially ECE750-TXB Tractable problems II Lecture 22: NP-Complete problems

Todd L. Veldhuizen constructed by e.g., diagonalization) have low [email protected] exponents such as O(n), O(n2), O(n3), etc. Very few interesting problems in P have best-known exponents of n10 or more. So, equating P with practical algorithms is a quite reasonable generalization.

I However, many very important problems are not known to be in P. Some of these problems are conceptually very close to problems that are known to be in P:

I We have seen a polynomial-time algorithm to find the shortest path between two vertices in a graph. What about finding the longest path between two vertices? Nobody knows if this problem can be solved in polynomial time.

ECE750-TXB Tractable problems III Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected] I An Euler tour of a connected graph is a path that starts and ends at the same vertex and traverses each edge exactly once. Testing whether a graph has an Euler tour is trivial: a graph has an Euler tour if and only if every vertex has an even number of edges incident on it. This can clearly be tested in polynomial time! On the other hand, a Hamiltonian circuit is a path that starts and ends at the same vertex and visits each vertex exactly once. It is not known whether Hamiltonian-circuit has a polynomial time decision procedure.

I Many of the interesting problems not known to be in P are easily seen to belong to the class NP, which ECE750-TXB Tractable problems IV Lecture 22: NP-Complete problems

informally is the class of problems whose an- Todd L. Veldhuizen swers we can check in polynomial time. More technically: [email protected] NP is the class of decision problems where, if the answer is YES, there exists a certificate (a string in some alphabet) that can be checked in polynomial time by a Turing machine.

I The terms proof and witness are commonly used in place of certificate.

I NP stands for nondeterministic polynomial time.

I A deterministic machine is one whose state transition relation allows only one possible trace: every state is succeeded by a uniquely defined state, so that a trace looks like

/ / / ···

  

ECE750-TXB Tractable problems V Lecture 22: NP-Complete problems

Todd L. Veldhuizen I A nondeterministic machine can branch simultaneously [email protected] into several successor states, and the semantics are typically defined so that the machine accepts an input if any of its branches accept. A trace is a tree branching into the future:

/ ··· ÕB ÕÕ ÕÕ Õ / / ··· ÕB  ÕÕ ÕÕ Õ / ··· <<  ÒA << ÒÒ < ÒÒ  <<  << < ···  /

 ECE750-TXB Tractable problems VI Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected] I Do not confuse nondeterminism with randomness! If you know what threads are, a good conceptual model is a multithreaded program where each thread can create another new thread whenever it wants, and all threads run simultaneously with no slowdown, as if there were an infinite number of processors and no resource contention.

I A nondeterministic machine can simultaneously explore all possible certificates, and halt with a YES answer if it finds one that certifies the answer is YES.

I The following diagram illustrates the containment relationships between the classes we will visit in this lecture:

ECE750-TXB Tractable problems VII Lecture 22: NP-Complete problems NP hard Todd L. Veldhuizen [email protected]

NP−Complete co−NP NP

P

I Note that NP contains the class P: every problem in P is automatically an NP problem also.

I Whether NP is a strict superset of P is a central problem in complexity theory, and one of the Clay Institute Millenium Problems. It could be that: 1. P 6= NP; or 2. P = NP; or ECE750-TXB Tractable problems VIII Lecture 22: NP-Complete problems

Todd L. Veldhuizen 3. It could be that the truth of ‘P = NP’ is independent [email protected] of the usual (ZF) set theory axioms, as the continuum hypothesis and axiom of choice were found to be.

I It is widely believed that P 6= NP, i.e., that hard problems in NP require superpolynomial time.

I Note that ‘superpolynomial’ includes, but is not equivalent to, exponential time. The class of superpolynomial subexponential functions includes f (n) n functions of the form n where 1 ≺ f ≺ ln n . For example, n1+log log n is superpolynomial and subexponential.

I A didactic ditty on the theme of superpolylogarithmic subexponential functions may be found in [5].

ECE750-TXB Circuit-Satisfiability I Lecture 22: NP-Complete problems I Example: Circuit-satisfiability. Given a single-output Todd L. boolean circuit composed of AND, OR, and NOT gates, Veldhuizen is there a setting of the input signals causing the output [email protected] to be true? (If so, the circuit is said to be satisfiable. a b

d c

I Clearly we can answer this problem by enumerating a truth table of all possible inputs: a b c d 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 1 ← ECE750-TXB Circuit-Satisfiability II Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected] This requires 2n time, where n is the number of input signals.

I A ‘certificate’ that the circuit is satisfiable is a setting of the input signals — {a = 0, b = 1, c = 1} — causing the output to be true. We can say that {a = 0, b = 1, c = 1} is a witness to the satisfiability of the circuit.

I A verifier can check very efficiently (on a RAM model in, say, O(m) time where m is the number of gates) that this combination of inputs does, in fact, produce d = 1. Hence the problem is in NP.

ECE750-TXB The class co-NP I Lecture 22: NP-Complete problems

Todd L. I The class co-NP consists of decision problems having a Veldhuizen NO certificate (proof,witness) that can be checked in [email protected] polynomial time.

I Example: circuit equivalence. Decision problem: given two single-output boolean circuits, do they compute the same function? a b

d c

d'

Are d and d0 always the same? ECE750-TXB The class co-NP II Lecture 22: NP-Complete problems

I Again, we can decide whether the two circuits are Todd L. equivalent by enumerating the truth table: Veldhuizen [email protected] a b c d d 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 1 0 ←

They differ on the input {a = 1, b = 0, c = 1}. The answer to the decision problem is NO.

I The input signal setting {a = 1, b = 0, c = 1} is a witness to the fact that the two circuits are not equivalent.

ECE750-TXB The class co-NP III Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

I Note that the complement of an NP set is a co-NP set, and vice versa.

I Example: the set of circuits that are not satisfiable is a co-NP set. (The complement of satisfiable circuits.)

I Example: the set of pairs of circuits that are inequivalent is an NP set. (The complement of equivalent pairs of circuits.)

I It is believed that P 6= NP ∩ co-NP, but this is not known. ECE750-TXB Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

Part I

Some examples of NP problems

ECE750-TXB Decision vs. optimization problems I Lecture 22: NP-Complete problems The classes P and NP are defined in terms of decision Todd L. I Veldhuizen problems with YES/NO answers. [email protected]

I Many of the interesting problems in NP are optimization problems, where one aims to minimize or maximize an objective function. Such problems can be expressed in two versions: 1. The decision version: Does there exist a ... of cost less than/more than ...? (This has a YES/NO answer, and can be said to be in the class NP if the objective function can be evaluated in polynomial time, etc.) 2. The optimization version: Find a ... of minimal/maximal cost/profit. (There is a class NPO of NP optimization problems we shall see next lecture, in which problems are expressed in this form.)

I The class of NP-hard optimization problems includes many of high industrial relevance:

I VLSI layout ECE750-TXB Decision vs. optimization problems II Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected] I optimizing compilers (register allocation, alias analysis)

I choosing warehouse locations

I vehicle routing (e.g., designing delivery routes)

I designing networks (fibre, highways, subways, public transit)

I spectrum allocation on wireless networks

I airline schedule planning

I packing shipping containers

I choosing an investment portfolio

I verification of circuits

I mixed-integer programming

I (simplified models of) protein folding

I and thousands more!

ECE750-TXB Example: Graph colouring I Lecture 22: NP-Complete problems

Todd L. Veldhuizen I Graph colouring: given a graph G, an r-colouring of G [email protected] is a function χ : V → {1, 2,..., r} from vertices to “colours” 1, 2,..., r such that no two adjacent vertices have the same colour: if (v1, v2) ∈ E, then χ(v1) 6= χ(v2).

I The least r such that an r-colouring is possible is called the chromatic number of G.

I Graph colouring is an NP problem: a witness to the decision problem “Does there exist a colouring with ≤ r colours?” is an r-colouring of the graph.

I Graph colouring has numerous practical applications. It is an excellent tool for reasoning about contention for resources. ECE750-TXB Example: Graph colouring II Lecture 22: NP-Complete problems I For example, one of the last steps of compiling a Todd L. program is to emit assembly code. The program will run Veldhuizen faster if variables are placed in registers, rather than in [email protected] stack- or heap-locations. In a typical register allocation scenario, one has a sequence of operations such as this:

y := 3 x := y + 1 z := x + 2 w := x + 1 u := z ∗ w

Each variable has a liveness range: from its initial assignment to its last use. We want to place these variables in registers in such a way that a register is never simultaneously being used to store two variables. (Here, the registers are the resource under contention.)

ECE750-TXB Example: Graph colouring III Lecture 22: NP-Complete problems I This problem can be solved by graph coloring [1]: Todd L. I Construct a graph whose vertices are variables, and Veldhuizen there is an edge (x, y) if the liveness ranges of x and y [email protected] overlap. Liveness ranges: y = 3 y x = y + 1 z = x + 2 x w = x + 1 z w u = z * w u return u

I Registers are colours: a valid colouring ensures that if the liveness ranges of two variables overlap, they will be in different registers.

y

x

z

w

u ECE750-TXB Example: Graph colouring IV Lecture 22: NP-Complete problems I In this example, three registers suffice: put the ‘red’ Todd L. variables in r1, the ‘green’ variables in r2, and the ‘blue’ Veldhuizen [email protected] variable in r3.

r1 ← 3 r2 ← +(r1, 1) r1 ← +(r2, 2) r3 ← +(r2, 1) r2 ← +(r1, r3)

I Some other examples of colouring:

I Spectrum allocation: given a set of transmitter locations, construct a graph where the vertices are transmitters and edges indicate pairs of transmitters close enough to interfere with each other if assigned the same frequency. Then, the chromatic number of this graph tells you how many different frequencies are required for none of the towers to interfere with each

ECE750-TXB Example: Graph colouring V Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

other. A colouring gives you a frequency assignment that achieves this.

I Circuit layout: given a planar layout of a circuit, build a graph where vertices represent wires and there is an edge between two wires if they cross. Then, the chromatic number tells you how many layers suffice to route all the signals with none of them crossing. (This also works for subways!) ECE750-TXB Example: Steiner networks I Lecture 22: NP-Complete problems

Todd L. Veldhuizen I Suppose a company has a number of downtown [email protected] locations it wants to connect with a highspeed fiber network. Suppose fibre costs ≈ $100/m to lay in a downtown area, and can only travel along streets.

street location

I Problem: find a minimal-cost network connecting the locations.

street location

ECE750-TXB Example: Steiner networks II Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

I This is the rectilinear Steiner tree problem, which has been used in VLSI layout to route signals. (‘Rectilinear’ here refers to the constraint that edges travel only up-down or left-right.)

I Finding a minimal Steiner tree is an optimization problem, the decision version of which (Does there exist a network of cost < x?) is in NP. ECE750-TXB Example: Graph Partitioning I Lecture 22: NP-Complete problems Problem: Given a weighted graph G = (V , E, w), where I Todd L. + w : E → , partition V into V1, V2 where Veldhuizen Q [email protected] ||V1| − |V2|| ≤ 1 and so that the total weight of cut edges is minimized: X minimize w(v1, v2)

(v1,v2)∈E : v1∈V1,v2∈V2

I This has numerous applications in parallel load balancing, circuit layout, improving locality of reference, etc.

ECE750-TXB Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

Bibliography Part II

NP-Complete problems ECE750-TXB Propositional Satisfiability I Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

Bibliography I Recall that propositional logic consists of Boolean variables V = {p1, p2,...} Literals pi , ¬pi Formulas ϕ ::= pi | ϕ ∧ ϕ | ϕ ∨ ϕ | ¬ϕ

I A formula in conjunctive normal form (CNF) is a conjunction of disjunctions, e.g.,

(p1 ∨ p2) ∧ (p2 ∨ p3 ∨ p4) ∧ (p1 ∨ p3)

I > is true, ⊥ is false.

ECE750-TXB Propositional Satisfiability II Lecture 22: NP-Complete problems I A truth assignment is a substitution ρ : V → {>, ⊥}. Todd L. We extend ρ to a function ρ on formulas: Veldhuizen [email protected]

ρ(pi ) = ρ(pi ) Bibliography ( > if ρ(ϕ) = ⊥ ρ(¬ϕ) = ⊥ if ρ(ϕ) = > ( > if ρ(α) = > or ρ(β) = > ρ(α ∨ β) = ⊥ otherwise ( > if ρ(α) = > and ρ(β) = > ρ(α ∧ β) = ⊥ otherwise

I The propositional satisfiability problem, or SAT: given a propositional formula ϕ with free variables p1,..., pk , decide whether there exists a truth assignment ρ such that ρ(ϕ) = >. ECE750-TXB Propositional Satisfiability III Lecture 22: NP-Complete problems I Obviously SAT and Circuit-satisfiability are closely related. We can formalize this by means of a “Karp Todd L. Veldhuizen reduction” (or many-one reduction): [email protected]

I Let us make the simplifying assertion that all decision Bibliography problem instances can be expressed as binary strings. For example, a boolean circuit can be described by a binary string that enumerates the gates and their interconnections in some prefix-free code. Letting Z ⊆ Σ∗ be the set of all strings representing circuits that are satisfiable, the decision problem “Is this circuit satisfiable?” becomes “Is the string x (representing the circuit) in the set Z?”

I A many-one, or Karp, reduction from a problem Y to a problem Z is a function r :Σ∗ → Σ∗ computable in polynomial time such that y ∈ Y if and only if r(y) ∈ Z.

I Intuitively, we take a problem of one sort and turn it into a problem of a different sort, while preserving the YES/NO outcome of the decision.

ECE750-TXB Propositional Satisfiability IV Lecture 22: NP-Complete problems

Todd L. I For example, let Y be the set of binary strings Veldhuizen representing satisfiable boolean formulas, and Z be [email protected]

binary strings representing satisfiable circuits. To reduce Bibliography Y to Z, we just convert the boolean formulas into a circuit: r(α ∧ β) becomes an AND gate whose inputs are r(α) and r(β), and so forth. This translation from a boolean formula to a circuit can be done in polynomial time: SAT is many-one reducible to Circuit-SAT.

I Note: If Y is Karp-reducible to Z, then a polynomial-time solution for Z would imply a polynomial time solution for Y . (We can turn instances of Y into instances of Z in polynomial time.) ECE750-TXB Propositional Satisfiability V Lecture 22: NP-Complete problems

I Karp-reducibility is a preorder: transitive and reflexive. Todd L. Veldhuizen We can write [email protected]

P Bibliography Y ≤m Z

to mean Y is Karp-reducible to Z. (The P stands for polynomial time, and the m stands for many-one.)

I Recall that whenever we have a preorder, we can turn it into an equivalence relation and a partial order (a hierarchy): P P I Define X ∼ Y to mean (X ≤m Y ) ∧ (Y ≤m X ). (This is an equivalence relation, whose equivalence classes consist of problems that are interreducible. These equivalence classes are sometimes called degrees.) I Then define [X ]∼ ≤ [Z]∼ on equivalence classes by P [X ]∼ ≤ [Z]∼ iff X ≤m Z. I The class P is the least element of the partial order.

ECE750-TXB Propositional Satisfiability VI Lecture 22: NP-Complete I If we restrict ourselves to problems in NP, this poset problems has a maximal element: the class NPC of NP-complete Todd L. Veldhuizen problems. [email protected] I It is known that if P 6= NP, then there are equivalence classes strictly between P and NP, called Bibliography NP-intermediate problems. Proving the existence of such a class would immediately imply P 6= NP. P I Defn: X is NP-hard if, for any Y ∈ NP, Y ≤m X . (Note: X is not required to be in NP. For example, the halting problem is NP-hard.)

I Defn: X is NP-complete if: 1. X ∈ NP; and 2. X is NP-hard.

I Roughly speaking, NPC (the set of NP-complete problems) consists of the very hardest problems known to be in NP; if we could solve just one of those problems in polynomial time, then every problem in NP could be solved in polynomial time, which would imply P = NP. ECE750-TXB Propositional Satisfiability VII Lecture 22: NP-Complete problems First NP-comleteness proof was the Cook-Levin Todd L. I Veldhuizen theorem (1971): propositional satisfiability (SAT) is an [email protected] NP-complete problem. (Stephen Cook is a professor at the University of Toronto.) Bibliography

I Proof sketch: encode valid traces of a nondeterministic Turing machine running in s steps and s tape locations as a propositional formula. Set boundary conditions so that if the formula is satisfiable the machine halts with a YES output.

I Import: any problem in NP is Karp-reducible to SAT; a polynomial time algorithm for SAT would imply P = NP.

I To prove a problem X is NP-complete, it is necessary to prove that (i) the problem is NP (this is usually easy); and that (ii) some problem already known to be in NPC can be reduced to X .

ECE750-TXB Propositional Satisfiability VIII Lecture 22: NP-Complete problems

Todd L. I Since the original NPC problems were discovered, many Veldhuizen [email protected] useful problems have been proven to be in this class; there are some 1000+ known NPC problems! Bibliography

I The classic text on the subject: Garey and Johnson [2]. I Why is knowing whether a problem is in NPC important?

I Know not to waste your time looking for a polynomial-time algorithm (unless have a fondness for tilting at windmills).

I There is a rich understanding of how NPC optimization problems can be approximately solved in polynomial time, e.g., to within a small difference from optimality. Identifying a problem as NPC is a first step to figuring out how to best approach its approximation. ECE750-TXB Example NPC proof: Independent set I Lecture 22: NP-Complete problems To determine if a problem is NP-complete, your first Todd L. I Veldhuizen step should be to check a catalogue of known [email protected]

NP-complete problems, such as the online survey “A Bibliography compendium of NP optimization problems”.

I We will look at a simple example of an NP-completeness proof as an example of how such results are obtained. 0 I In a graph G = (V , E), a subset of vertices V ⊆ V are 0 an independent set if for all v1, v2 ∈ V , there is no edge (v1, v2) ∈ E.

The green vertices form an independent set of size 4.

ECE750-TXB Example NPC proof: Independent set II Lecture 22: NP-Complete problems

0 Todd L. I An independent set V is the ‘opposite’ of a clique, in Veldhuizen the sense that the vertices V 0 form a clique in the graph [email protected] 2 (V , V \ E). Bibliography

I The Independent Set problem: given a graph G = (V , E) and a positive integer m, decide if G has an independent set on m vertices.

I Proving NP-completeness requires proving (i) the problem is in NP; (ii) a known NP-complete problem can be reduced to it.

I Clearly, independent set is in NP: a certificate is the set of m vertices forming the independent set, which can be checked in polynomial time.

I We will reduce 3-SAT to Independent Set. (This proof and example are from [3].) ECE750-TXB Example NPC proof: Independent set III Lecture 22: NP-Complete problems I 3-SAT is a restricted form of propositional satisfiability, Todd L. in which formulas are in CNF form, with each Veldhuizen disjunction containing at most three literals, for [email protected] example: Bibliography

(x1 ∨ x2 ∨ x3) ∧ (x1 ∨ x2 ∨ x3) ∧ (x1 ∨ x2 ∨ x3) ∧ (x1 ∨ x2 ∨ x3)

(We write xi as a shortform for ¬xi .) I 3-SAT is known to be an NP-complete problem: every SAT problem can be rewritten into the above form.

I We demonstrate a translation from a 3-SAT problem to an independent set problem, such that an independent set of m vertices exists if and only if the 3-SAT formula is satisfiable.

I For a 3-SAT formula of m clauses (disjunctions), construct a graph where:

I There is a vertex for each occurrence of a literal in the formula;

ECE750-TXB Example NPC proof: Independent set IV Lecture 22: NP-Complete problems

Todd L. I Within each disjunction, there is an edge between each Veldhuizen vertex corresponding to a literal in that disjunction; [email protected]

I There is an edge between every occurrence of a literal Bibliography and its negation, i.e., we connect each x1 to each x1, etc.

I This graph can be constructed in polynomial time.

x x x x 1 1 1 1

x x x x x x x x 2 3 2 3 2 3 2 3

I We then ask: Is there an independent set of size m?

I In the above graph, the green vertices show an independent set of size 4. ECE750-TXB Example NPC proof: Independent set V Lecture 22: NP-Complete problems I Because each clause (disjunction) corresponds to a Todd L. clique (e.g. a triangle) in the graph, any independent Veldhuizen set on m vertices must have at most one vertex from [email protected] each clique. Bibliography

I Choose a truth assignment so that the literal corresponding to each vertex of the independent set is > (true). E.g., if x2 is in the independent set, we choose x2 = ⊥ so that x1 = >.

I Since there is an edge between all literals and their negations, we cannot simultaneously choose xi = > and xi = ⊥; so a consistent truth assignment exists.

I We will have one vertex from each clique = one literal from each clause made true. Therefore each disjunction will evaluate to >, and their conjunction will be > also.

I E.g., in the above example we choose x1 = >, x2 = ⊥, and x3 can be either > or ⊥.

ECE750-TXB Example NPC proof: Independent set VI Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

I The graph has an independent set of size m if and only Bibliography if the 3-SAT formula is satisfiable. Since the reduction can be done in polynomial time, this is a Karp reduction.

I This reduction demonstrates that if we could solve Independent Set in polynomial time, we could solve 3-SAT in polynomial time, which in turn implies we could solve any problem in NP in polynomial time.

I Hence Independent Set is NP-Complete. ECE750-TXB Bibliography I Lecture 22: NP-Complete problems

Todd L. [1] Gregory J. Chaitin, Marc A. Auslander, Ashok K. Veldhuizen Chandra, John Cocke, Martin E. Hopkins, and Peter W. [email protected]

Markstein. Bibliography Register allocation via coloring. Computer Languages, 6(1):47–57, 1981. bib [2] M. R. Garey and D. S. Johnson. Computers and intractability; a guide to the theory of NP-completeness. W.H. Freeman, 1979. bib [3] H. R. Lewis and C. H. Papadimitriou. Elements of the theory of computation. Prentice-Hall, Englewood Cliffs, New Jersey, 1981. bib

[4] bib bib

ECE750-TXB Bibliography II Lecture 22: NP-Complete problems

Todd L. Veldhuizen [email protected]

Bibliography

[5] Alan T. Sherman. On superpolylogarithmic subexponential functions (Part I). SIGACT News, 22(1):65, 1991. bib ps ECE750-TXB Lecture 23: Approximation Algorithms for NP-Complete ECE750-TXB Lecture 23: Approximation problems Todd L. Algorithms for NP-Complete problems Veldhuizen [email protected]

Introduction Todd L. Veldhuizen Minimal Vertex Cover (APX) [email protected] Steiner networks (APX)

Electrical & Computer Engineering Knapsack University of Waterloo (FPTAS) Canada Bibliography

March 29, 2007

ECE750-TXB Approximation Algorithms I Lecture 23: Approximation Algorithms for NP-Complete problems I Recall that Todd L. I P = decision problems, decidable in polynomial time. Veldhuizen [email protected] I NP = decision problems, YES certificates checkable in

polynomial time. Introduction

I NPC = the ‘hardest’ problems in NP; if any of these Minimal Vertex can be solved in polynomial time then P = NP. Cover (APX) Steiner networks I It is believed that P 6= NP: NPC problems are thought (APX) to require superpolynomial time. Knapsack (FPTAS)

I However, for many useful optimization problems we can Bibliography obtain approximate solutions in polynomial time.

I In an optimization problem, we are trying to minimize or maximize some objective function; the corresponding NP decision version is a question “Is there a solution with cost ≤ k?” or “Is there a solution with profit ≥ k?” ECE750-TXB Approximation Algorithms II Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen I The class NPO consists of optimization problems [email protected] whose decision version is in NP. Technically, an NPO problem consists of: Introduction ∗ Minimal Vertex 1. A set of valid problem instances D ⊆ Σ , recognizable Cover (APX) in polynomial time. (Here we take Σ∗ to be binary Steiner networks strings, for simplicity.) (APX) 2. A set of feasible solutions S ⊆ Σ∗, and a polynomial Knapsack time algorithm that decides S given a problem instance. (FPTAS) 3. An objective function J(I , s) that maps an instance Bibliography I ∈ D and a solution s ∈ S to a cost or profit in Q+. 4. An indication of whether we aim to maximize or minimize J.

ECE750-TXB Approximation Algorithms III Lecture 23: Approximation Algorithms for I For example, the problem Maximum Independent Set NP-Complete problems asks for the largest set of vertices in a graph such that Todd L. no two vertices have an edge between them. This is an Veldhuizen [email protected] NPO problem: the problem instances are graphs (suitably encoded as binary strings), feasible solutions Introduction are independent sets, and the objective function Minimal Vertex Cover (APX)

measures the size of an independent set. Steiner networks (APX) I Note that a polynomial time algorithm for an NPO Knapsack optimization problem implies a polynomial time (FPTAS) algorithm for the decision version — e.g., to answer the Bibliography question “Does this graph has an independent set of ≥ 4 vertices?” we can find the largest independent set and see how large it is.

I If the decision version of an optimization problem is NP-complete, then a polynomial time algorithm for the optimization problem would imply P = NP. ECE750-TXB Approximation Algorithms IV Lecture 23: Approximation Algorithms for NP-Complete problems Between P and NP-completeness there are several I Todd L. grades of optimization problems where we can obtain Veldhuizen approximate solutions in polynomial time. Let OPT [email protected]

represent the optimal value of an objective function, Introduction

and suppose for simplicity we seek to minimize the Minimal Vertex objective function. Cover (APX) Steiner networks 1. The class APX (approximable) contains NPO problems (APX)

where it is possible to obtain ≤ δ · OPT in polynomial Knapsack time, for some fixed constant δ. (FPTAS) 2. The class PTAS (Polynomial Time Approximation Bibliography Scheme) contains NPO problems where for any fixed constant  > 0, we can obtain ≤ (1 + ) · OPT in polynomial time. (However, the time required might be nf () where f () → ∞ as  → 0. The algorithm is polynomial only if  is fixed.)

ECE750-TXB Approximation Algorithms V Lecture 23: Approximation 3. The class FPTAS (Fully Polynomial Time Algorithms for NP-Complete Approximation Scheme) contains NPO problems where problems

for any fixed constant  > 0, we can obtain Todd L. ≤ (1 + ) · OPT in time O(nc −d ) where c, d are Veldhuizen [email protected] constants. 4. For maximization problems, replace ≤ (1 + ) · OPT with Introduction ≥ (1 − ) · OPT. Minimal Vertex Cover (APX)

I The containment relations between these classes are Steiner networks illustrated by this diagram: (APX) Knapsack (FPTAS)

NPO Bibliography

APX

PTAS

FPTAS

P ECE750-TXB Approximation Algorithms VI Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected]

Introduction

Minimal Vertex Cover (APX)

Steiner networks (APX)

Knapsack (FPTAS)

Bibliography

ECE750-TXB Example: Vertex cover I Lecture 23: Approximation Algorithms for NP-Complete I Suppose that to improve the performance of a computer problems

network, you want to collect statistics on packets being Todd L. Veldhuizen transmitted. Draw a graph where the vertices are [email protected] computers/routers, and edges are communication links: Introduction

Minimal Vertex Cover (APX)

Steiner networks (APX)

Knapsack (FPTAS)

Bibliography Collecting statistics can slow down the network, and requires installing custom software etc. So, we want to monitor the traffic on as few nodes as possible. We choose as small a set of nodes as possible on which to install the monitoring software, so that each communication link has monitoring software on at least one end, e.g.: ECE750-TXB Example: Vertex cover II Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected]

Introduction

Minimal Vertex If we install the monitoring software on the green nodes, Cover (APX) every communication link has at least one end green. Steiner networks (APX) I This is a vertex cover. Knapsack (FPTAS)

I Minimum Vertex Cover: Given a graph G = (V , E) find Bibliography a set V 0 ⊆ V of minimal size such that every edge has at least one vertex in V 0.

I This problem is in NPO: 1. Instances: graphs. 2. Feasible solutions: subsets of vertices V 0 such that every edge has at least one end in V 0. 3. Objective function: |V 0|.

ECE750-TXB Example: Vertex cover III Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected]

4. It is a minimization problem. Introduction The decision version (Does G have a vertex cover of Minimal Vertex I Cover (APX) 0 ≤ m vertices?) is in NP: a vertex cover V with Steiner networks |V 0| ≤ m is a YES certificate. (APX) Knapsack I Minimum Vertex Cover is known to be NP-complete (in (FPTAS) its decision version). Bibliography ECE750-TXB An approximation algorithm for vertex cover I Lecture 23: Approximation Algorithms for NP-Complete problems I Now let’s see an approximation algorithm for vertex Todd L. cover that lets us get a nearly optimal answer in Veldhuizen polynomial time. [email protected]

Introduction I The approximation we’ll see uses a maximal matching. Minimal Vertex I A matching of a graph G = (V , E) is a subset of edges Cover (APX) 0 E ⊆ E such that no vertex is the endpoint of more Steiner networks (APX) than one edge. Knapsack (FPTAS) I A matching is maximal if it cannot be enlarged. Bibliography

A graph (left) and a maximal matching (right). Picking any more edges would result in a vertex being the endpoint of two edges.

ECE750-TXB An approximation algorithm for vertex cover II Lecture 23: Approximation Algorithms for I The vertices that are endpoints of the edges in the NP-Complete matching form a vertex cover: problems Todd L. Veldhuizen [email protected]

Introduction

Minimal Vertex Cover (APX)

Steiner networks (APX)

The endpoints of the edges in the matching (green) form a vertex Knapsack cover of the graph. (FPTAS) Bibliography I Why? Suppose the endpoints of the maximal matching didn’t form a vertex cover. From the definition of a vertex cover, this

would imply there was an edge (v1, v2) such that neither v1 nor v2 are in the vertex cover. This would imply there was an edge

(v1, v2) such that neither v1 nor v2 are endpoints of an edge in the

maximal matching. So, (v1, v2) could be added to the matching. But, this contradicts the premiss that the matching is maximal. ECE750-TXB An approximation algorithm for vertex cover III Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. I This gives us an approximation algorithm for Minimum Veldhuizen Vertex Cover: [email protected] 0 1. Find a maximal matching E ⊆ E. (This can be done Introduction

by simply considering the edges in arbitrary order, and Minimal Vertex adding them to E 0 if they do not have an endpoint in Cover (APX) common with an edge already in E 0.) Steiner networks (APX) 2. Output the list of vertices that are endpoints of E 0. Knapsack (FPTAS) Theorem Bibliography This approximation algorithm yields a vertex cover of size ≤ 2 · OPT. Proof.

ECE750-TXB An approximation algorithm for vertex cover IV Lecture 23: Approximation Algorithms for NP-Complete Let E 0 ⊆ E be a maximal matching. We prove that problems |E 0| ≤ OPT. Since the vertex cover output is of size 2|E 0|, Todd L. Veldhuizen this establishes that the size of the vertex cover is ≤ 2 · OPT. [email protected] 0 ∗ Suppose to the contrary that |E | > OPT. Let V ⊆ V be a Introduction ∗ vertex cover with |V | = OPT. By pigeonhole there must be Minimal Vertex 0 ∗ Cover (APX) an edge (v1, v2) ∈ E such that neither v1 nor v2 are in V ; ∗ Steiner networks but this contradicts the premiss that V is a vertex (APX)

cover. Knapsack (FPTAS)

Bibliography

Illustration of a contradiction: if the maximal matching contains 4 edges, it is impossible that there could be a vertex cover of only 3 vertices (green), since this would leave an edge uncovered. ECE750-TXB An approximation algorithm for vertex cover V Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected]

Introduction I The maximal matching approximation finds a vertex Minimal Vertex cover that is at most twice optimal. This establishes Cover (APX) that Minimum Vertex Cover is in the class APX. Steiner networks (APX)

I Whether it is possible to better than 2 · OPT is an open Knapsack problem. (FPTAS) Bibliography

ECE750-TXB Steiner trees I Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected] I A network design problem is one of the following form: given a set of points/vertices, find a minimal network Introduction spanning them. For example, we might consider how to Minimal Vertex Cover (APX) build roads between four cities so that the total road Steiner networks length is minimized: (APX) Knapsack (FPTAS)

Bibliography ECE750-TXB Steiner trees II Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. I A more general setting of the Veldhuizen problem is the Minimum Steiner Tree problem on graphs: [email protected]

Given a graph G = (V , E), a positive weight Introduction w(v , v ) on edges, and a subset V 0 ⊆ V of Minimal Vertex 1 2 Cover (APX)

vertices to be connected, find a minimum cost Steiner networks subtree of G that includes all the vertices in (APX) 0 Knapsack V . (FPTAS)

Bibliography I The cost is measured as the sum of the weights of the edges in the tree. 0 I The tree may contain vertices not in V ; these are called Steiner nodes.

ECE750-TXB Steiner trees III Lecture 23: Approximation Algorithms for NP-Complete problems I Example: Todd L. 3 2 Veldhuizen [email protected] a 4 d 1 1 Introduction 1 4 4 Minimal Vertex 1 1 Cover (APX) 4 b c Steiner networks (APX) 2 2 Knapsack (FPTAS) 0 The black vertices are the set V to be connected. Bibliography

I The decision version of this problem is known to be NP-complete.

I However, there is a fast approximation algorithm: 0 1. For each pair v1, v2 ∈ V , compute the shortest path between them, and call this d(v1, v2). ECE750-TXB Steiner trees IV Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. 2. Construct a complete graph G 0 on V 0, having edges for Veldhuizen [email protected] 0 every pair v1, v2 ∈ V , and weights d(v1, v2). 3 a d Introduction 3 3 Minimal Vertex 2 2 Cover (APX) Steiner networks b c (APX) 3 0 Knapsack 3. Find a minimum spanning tree on the graph G . (FPTAS) 3 a d Bibliography 3 3 2 2

b c 3

ECE750-TXB Steiner trees V Lecture 23: Approximation Algorithms for 4. Construct a tree on the original graph G by taking the NP-Complete union of each path represented by an edge of the problems minimum spanning tree on the graph G 0. When Todd L. Veldhuizen necessary, prune edges to remove cycles. [email protected]

3 2 Introduction

a 4 d Minimal Vertex Cover (APX) 1 1 4 1 4 Steiner networks 1 1 (APX) 4 b c Knapsack (FPTAS)

2 2 Bibliography

Theorem This approximation algorithm produces a network of cost ≤ 2 · OPT. The proof involves Euler tours and Hamiltonian cycles; see e.g. [1]. ECE750-TXB Steiner trees VI Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected] This approximation algorithm establishes that Minimum I Introduction

Steiner Tree is in APX. However, it is known that this Minimal Vertex problem is in APX-Complete, which is the ‘hard Cover (APX) Steiner networks problems’ of APX; this implies that Minimum Steiner (APX)

Tree is not in PTAS unless P = NP. Knapsack (FPTAS) There is an approximation algorithm for this problem I Bibliography that achieves (1 + (ln 3)/2) · OPT ≈ 1.55 · OPT.

ECE750-TXB Knapsack I Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. I So far we have seen two examples of problems in APX. Veldhuizen [email protected] Now let’s see one in FPTAS, where we can get to within (1 + ) · OPT in O(nc −d ) time, i.e., time Introduction polynomial in n and 1 . Minimal Vertex  Cover (APX) I In a knapsack problem, one has Steiner networks (APX) 1. A set S = {a1,..., an} of objects; + Knapsack 2. A profit function profit : S → Z ; (FPTAS) + 3. A size function size : S → Z ; Bibliography 4. A knapsack capacity B ∈ Z+. 0 I A feasible solution is a subset S ⊆ S of objects with P s∈S0 size(s) ≤ B. P I The objective is to maximize the profit s∈S0 profit(s). ECE750-TXB Knapsack II Lecture 23: Approximation Algorithms for I Example: here is a knapsack problem with 3 objects NP-Complete problems {a1, a2, a3}. Todd L. Veldhuizen Object size(ai ) profit(ai ) [email protected] a 3 2 1 Introduction

a2 2 1 Minimal Vertex Cover (APX) a3 1 3 Steiner networks (APX) If the capacity of the knapsack is B = 5, the best Knapsack solution is to pick objects a1 and a3, which achieves a (FPTAS) profit of 5 and a size of 4. Bibliography

I Knapsack has numerous real, practical applications:

I Shipping (deciding how to pack shipping containers to maximize profit);

I Multirate networks (given a total available bandwidth and users who bid for various bandwidths at various prices, what subset of users should be chosen to maximize profit?)

ECE750-TXB Knapsack III Lecture 23: Approximation Algorithms for NP-Complete problems

I Capital budgeting (given a fixed amount of money Todd L. available for capital purchases, a set of objects that Veldhuizen [email protected] could be bought, and an estimate of how much their purchase would benefit the company, what subset of Introduction objects should be purchased?) Minimal Vertex Cover (APX) I Web caching (given a fixed amount of memory/disk Steiner networks space on a web cache, web pages of various sizes, and (APX)

estimates of how much latency/bandwidth could be Knapsack saved by caching those pages, what subset of pages (FPTAS) should be cached?) Bibliography

I The naive approach of greedily choosing objects with the best profit/size ratio can be made to perform arbitrarily badly. (Consider for example objects a, b with sizes 1, 100, profits 10, 900, and capacity 100.) ECE750-TXB Knapsack IV Lecture 23: Approximation Algorithms for I Knapsack (in its decision version) is an NP-complete NP-Complete problem. However, it has an FPTAS: we can obtain a problems 3 −1 Todd L. solution that is ≥ (1 − ) · OPT in time O(n  ): with Veldhuizen each doubling of the amount of time we are willing to [email protected]

spend, we can halve the distance between the Introduction approximate answer and the optimal answer. Minimal Vertex Cover (APX) The approximation algorithm we will see is based on I Steiner networks dynamic programming. Let P = maxs∈S profit(s) be (APX) the maximum profit of any item. We can find an exact Knapsack solution to the knapsack problem in time O(n2P): (FPTAS) Bibliography 1. Clearly the optimal solution has to have profit ≤ n · P: number of objects times the maximum profit of any object. 2. We will consider the objects in order, and figure out how much profit we can make with some subset of the first i objects. 3. Let S(i, p) be a subset of {s1,..., si } with profit p and minimum size, or ∅ if no such set exists.

ECE750-TXB Knapsack V Lecture 23: Approximation 4. Let A(i, p) be the size of S(i, p), or +∞ if it doesn’t Algorithms for NP-Complete exist: problems

( Todd L. +∞ if S(i, p) = ∅ Veldhuizen A(i, p) = P [email protected] s∈S(i,p) profit(s) otherwise Introduction

5. Given the solutions to A(i, p) for the first i objects, we Minimal Vertex can obtain the solutions for A(i + 1, p) by considering Cover (APX) either taking or not taking object ai+1, for each value of Steiner networks p: (APX) Knapsack I If profit(ai+1) < p, then (FPTAS)

A(i + 1, p) = min( A(i, p), Bibliography size(ai+1) + A(i, p − profit(ai+1)))

I Otherwise, A(i + 1, p) = A(i, p). 6. Once we have solved all the A(n, p), we choose the maximum p such that A(n, p) < B. This is the optimal answer. (To recover the objects in the optimal set, we need to keep track of them as we build the table; but this is not difficult.) ECE750-TXB Knapsack VI Lecture 23: Approximation Algorithms for I Example: Here is the table of A(i, p) for the example NP-Complete shown earlier: we have the maximum profit of any one problems Todd L. object is P = 3, so we needn’t look farther than a profit Veldhuizen of 9. [email protected]

Introduction i 0 1 2 3 4 5 6 7 8 9 Minimal Vertex 1 0 3 Cover (APX) Steiner networks 2 0 2 3 5 (APX)

3 0 2 3 1 3 4 6 Knapsack (FPTAS)

I The running time is determined by the size of this table, Bibliography which is controlled by n and P. If we could make either n or P smaller, the algorithm would run faster.

I We choose to make P smaller by scaling and rounding the profit values so the table A(i, p) becomes smaller; this yields an approximate answer.

I Approximation algorithm:

ECE750-TXB Knapsack VII Lecture 23: P Approximation 1. Let K = n , where P is the maximum profit of any Algorithms for item. We will be effectively rounding profits down to NP-Complete problems the nearest multiple of K; as  → 0, the rounding has Todd L. less and less effect. Veldhuizen 0 profit(ai ) [email protected] 2. For each object ai , let profit (ai ) = b K c. 3. Use dynamic programming to solve this new problem Introduction 0 and find the most profitable set S . Minimal Vertex 4. Output S 0. Cover (APX) Steiner networks Theorem (APX) Knapsack This approximation algorithm yields a profit of (FPTAS) ≥ (1 − ) · OPT. Bibliography Proof. Let O be a set of objects achieving the optimal profit. Write OPT = profit(O) for the sum of the profits of the objects in O. Since for each object our maximum rounding error of the profit is ≤ K, we have

profit(O) − K · profit0(O) ≤ nK (1) | {z } | {z } OPT profit with rounding ECE750-TXB Knapsack VIII Lecture 23: Approximation Algorithms for Rearranging, NP-Complete problems 0 K · profit (O) ≥ profit(O) − nK (2) Todd L. Veldhuizen [email protected] Since the dynamic programming algorithm yields an optimal solution for the rounded problem, the answer we get (S0) Introduction Minimal Vertex must be at least as good as O under the rounded profits: Cover (APX) Steiner networks profit(S0) ≥ K · profit0(O) (3) (APX) Knapsack ≥ profit(O) − nK from Eqn.(2) (4) (FPTAS) P Bibliography = OPT − P using K = n (5) Since OPT ≥ P (assuming no objects are bigger than the knapsack!),

profit(S0) ≥ (1 − )OPT

ECE750-TXB Knapsack IX Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected]

Introduction

Minimal Vertex I This establishes that Knapsack is in FPTAS: there is an Cover (APX)

approximation algorithm yielding ≥ (1 − ) · OPT in Steiner networks time O(n3−1). (APX) Knapsack (FPTAS)

Bibliography ECE750-TXB Where to go from here? Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. I Vazirani’s text Approximation Algorithms [1] is an Veldhuizen excellent introduction to this area. [email protected]

I There is another text I have not yet had the opportunity Introduction to read by Ausiello et al, Complexity and Minimal Vertex Approximation, that includes a reference guide to Cover (APX) Steiner networks known results on approximation. The reference guide is (APX)

available online and is well worth browsing: Knapsack (FPTAS) I A Compendium of NP-Hard Optimization Problems ∼ Bibliography I http://www.nada.kth.se/ viggo/wwwcompendium/ The compendium is organized hierarchically by problem class (graph theory, network design, etc.) and for each problem lists the best known approximation algorithms. Includes almost 500 references to the literature.

ECE750-TXB Bibliography I Lecture 23: Approximation Algorithms for NP-Complete problems

Todd L. Veldhuizen [email protected]

Introduction

Minimal Vertex [1] Vijay V. Vazirani. Cover (APX) Steiner networks Approximation Algorithms. (APX) Springer-Verlag, 2001. bib Knapsack (FPTAS)

Bibliography