DIVIDE AND CONQUER II

‣ master theorem ‣ integer multiplication ‣ matrix multiplication ‣ convolution and FFT

Lecture slides by Kevin Wayne Copyright © 2005 Pearson-Addison Wesley http://www.cs.princeton.edu/~wayne/kleinberg-tardos

Last updated on 2/6/21 8:29 PM DIVIDE AND CONQUER II

‣ master theorem ‣ integer multiplication ‣ matrix multiplication ‣ convolution and FFT

SECTIONS 4.4–4.6 Divide-and-conquer recurrences

Goal. Recipe for solving common divide-and-conquer recurrences:

n T (n)=aT + f(n) b with T(0) = 0 and T(1) = Θ(1).

Terms. ・a ≥ 1 is the number of subproblems. ・b ≥ 2 is the factor by which the subproblem size decreases. ・f (n) ≥ 0 is the work to divide and combine subproblems.

T (n) Recursion tree. [ assuming n is a power of b ] ・a = branching factor. ・ai = number of subproblems at level i. 1 + log n levels. ・ b T (n / b) T (n / b) ... T (n / b) ・n / bi = size of subproblem at level i...... 3 Divide-and-conquer recurrences: recursion tree

c Suppose T (n) satisfies T (n) = a T (n / b) + n with T (1) = 1, for n a power of b.

T (n) nc

c T (n / b) T (n / b) ... T (n / b) a (n / b)

1 + logb n

2 2 ... 2 2 2 ... 2 2 2 ... 2 2 2 c T (n / b ) T (n / b ) T (n / b ) T (n / b ) T (n / b ) T (n / b ) T (n / b ) T (n / b ) T (n / b ) a (n / b )

i i c ⋮ a (n / b ) ⋮

logb a T (1) T (1) T (1) T (1) T (1) T (1) T (1) T (1) T (1) T (1) ... T (1) T (1) T (1) n

alogbn = nlogba

logb n r = a / b c T (n)=nc ri 4

AAACU3icbVDLSgMxFM2Mr/quunQTLIK6KDMiqIgguHFZwarQaYdMeluDmWRI7ohlmH/wa9zqVwj+iwvT2oWtHkg4nHNvbu5JMiksBsGn58/Mzs0vVBaXlldW19arG5u3VueGQ5Nrqc19wixIoaCJAiXcZwZYmki4Sx4vh/7dExgrtLrBQQbtlPWV6AnO0Elx9eBmT+3T6IyeDy/V4TSyeRoX4jwoO0UkdT9OqCqp6Yi4WgvqwQj0LwnHpEbGaMQb3mbU1TxPQSGXzNpWGGTYLphBwSWUS1FuIWP8kfWh5ahiKdh2MVqqpLtO6dKeNu4opCP1d0fBUmsHaeIqU4YPdtobiv95rRx7J+1CqCxHUPxnUC+XFDUdJkS7wgBHOXCEcSPcXyl/YIZxdDlOTBm9nQGf2KR4zpXgugtTqsRnNKx0KYbTmf0lzcP6aT28PqpdnIzjrJBtskP2SEiOyQW5Ig3SJJy8kFfyRt69D+/L9/3Zn1LfG/dskQn4q99GQrOQ i=0 Divide-and-conquer recurrences: recursion tree analysis

c Suppose T (n) satisfies T (n) = a T (n / b) + n with T (1) = 1, for n a power of b.

c Let r = a / b . Note that r < 1 iff c > logb a.

c cost dominated (n ) r<1 c > logb a by cost of root logb n c i c cost evenly T (n)=n r = (n log n) r =1 c = logb a distributed in tree i=0 log a cost dominated (n b ) r>1 c < logb a

AAAC/nicbVFNj9MwEHXCxy7lq7scuYyoQN1LlSAkdlWKVuLCcZFadqU6jRx32nrrOJHtoFZRDvwaTogrf4QD/wanm5VotyPZenrz3sx4nORSGBsEfz3/3v0HDw8OH7UeP3n67Hn76PiryQrNccQzmemrhBmUQuHICivxKtfI0kTiZbL8VOcvv6E2IlNDu84xStlciZngzDoqbv8ZdtUJ0D4M6ktNOFBTpHEpBkE1KanM5nECqgI9EbeqFk1wLlTJXVtTtWifDhdoWdeZT+ANUIsrW4pZVes1fIAQKB2HmEZbUqhrg9rjGOx33A7Dqj2ejxC2KKppM1Tc7gS9YBNwF4QN6JAmLuIj75hOM16kqCyXzJhxGOQ2Kpm2gkt0rywM5owv2RzHDiqWoonKzf4reO2YKcwy7Y6ysGH/d5QsNWadJk6ZMrswu7ma3JcbF3Z2GpVC5YVFxW8azQoJNoP6M2EqNHIr1w4wroWbFfiCacat+/KtLpvaOfKtl5SrQgmeTXGHlXZlNau3GO7u7C4Yve2d9cIv7zrnp806D8lL8op0SUjek3PymVyQEeFe32Petbf0v/s//J/+rxup7zWeF2Qr/N//AAuD6zQ= by cost of leaves

Geometric series. ・If 0 < r < 1, then 1 + r + r2 + r3 + … + rk ≤ 1 / (1 − r). ・If r = 1, then 1 + r + r2 + r3 + … + rk = k + 1. ・If r > 1, then 1 + r + r2 + r3 + … + rk = (rk+1 − 1) / (r − 1).

5 Divide-and-conquer recurrences: master theorem

Master theorem. Let a ≥ 1, b ≥ 2, and c ≥ 0 and suppose that T (n) is a function on the non-negative integers that satisfies the recurrence n T (n)=aT + (nc) AAACb3icbVBtSxtBEN6c2lq1NeoXQZClQUiohLtSqEUEQRA/KiRVyMUwt5lLFvf2jt25YjjuH/lr/Cb6K/wF3bx8MNGB3X145pmdmSfKlLTk+08Vb2l55dPn1S9r6xtfv21Wt7b/2jQ3AtsiVam5icCikhrbJEnhTWYQkkjhdXR3Ns5f/0NjZapbNMqwm8BAy1gKIEf1quetum7w8JifjC/g4SFvhQpj4vUwNiAKXRZRyUMjB0PiTnnoFD+mT9gaIkFd34pGr1rzm/4k+HsQzECNzeKyt1XZDvupyBPUJBRY2wn8jLoFGJJCYbkW5hYzEHcwwI6DGhK03WKycMkPHNPncWrc0cQn7NuKAhJrR4kb/CABGtrF3Jj8KNfJKT7qFlJnOaEW00ZxrjilfOwe70uDgtTIARBGulm5GILziZzHc10mf2co5jYp7nMtRdrHBVbRPRkonYvBomfvQftn808zuPpVOz2a2bnK9th3VmcB+81O2QW7ZG0m2AN7ZM/spfLq7Xr7Hp9KvcqsZofNhdf4D8cluuQ= b with T(0) = 0 and T(1) = Θ(1), where n / b means either ⎣n / b⎦ or ⎡n / b⎤. Then,

c Case 1. If c > logb a, then T (n) = Θ(n ). c Case 2. If c = logb a, then T (n) = Θ(n log n). log a Case 3. If c < logb a, then T (n) = Θ(n b ).

Pf sketch. ・Prove when b is an integer and n is an exact power of b. ・Extend domain of recurrences to reals (or rationals). ・Deal with floors and ceilings. at most 2 extra levels in recursion tree n/b /b /b < n/b3 +(1/b2 +1/b + 1) 3 n/b +2 6 AAACpHicbZHfT9swEMedbEDX8aOFx71YqyggpDbpJq0TPCDxwgsS08iK1ITKca+theNE9mWiivqH7nH/CU4oEi2cFN9H37uz46/jTAqDnvfPcT983Njcqn2qf97e2d1rNPf/mDTXHAKeylTfxcyAFAoCFCjhLtPAkljCIH64LOuDv6CNSNUtzjOIEjZVYiI4QyuNGkUoOQhJV5PqxjTUFXbpe/iSzmj7vFztwP03ekqPfQs9CzaX6wkNw2HHhySqt+3m8Lq3N2q0vI5XBX0L/hJaZBk3o6azH45TniegkEtmzND3MowKplFwCYt6mBvIGH9gUxhaVCwBExWVSwt6aJUxnaTafgpppb6eKFhizDyJbWfCcGbWa6X4Xm2Y46QfFUJlOYLizwdNckkxpaXldCw0cJRzC4xrYf+V8hnTjKN9mJVTqr0z4Cs3KR5zJXg6hjVV4iNqtrAu+uuevYWg1/nZ8X99b130l3bWyBfylRwTn/wgF+SK3JCAcPLf2XIaTtM9cq/d327w3Oo6y5kDshLu/RM03scN Divide-and-conquer recurrences: master theorem

Master theorem. Let a ≥ 1, b ≥ 2, and c ≥ 0 and suppose that T (n) is a function on the non-negative integers that satisfies the recurrence n T (n)=aT + (nc) AAACb3icbVBtSxtBEN6c2lq1NeoXQZClQUiohLtSqEUEQRA/KiRVyMUwt5lLFvf2jt25YjjuH/lr/Cb6K/wF3bx8MNGB3X145pmdmSfKlLTk+08Vb2l55dPn1S9r6xtfv21Wt7b/2jQ3AtsiVam5icCikhrbJEnhTWYQkkjhdXR3Ns5f/0NjZapbNMqwm8BAy1gKIEf1quetum7w8JifjC/g4SFvhQpj4vUwNiAKXRZRyUMjB0PiTnnoFD+mT9gaIkFd34pGr1rzm/4k+HsQzECNzeKyt1XZDvupyBPUJBRY2wn8jLoFGJJCYbkW5hYzEHcwwI6DGhK03WKycMkPHNPncWrc0cQn7NuKAhJrR4kb/CABGtrF3Jj8KNfJKT7qFlJnOaEW00ZxrjilfOwe70uDgtTIARBGulm5GILziZzHc10mf2co5jYp7nMtRdrHBVbRPRkonYvBomfvQftn808zuPpVOz2a2bnK9th3VmcB+81O2QW7ZG0m2AN7ZM/spfLq7Xr7Hp9KvcqsZofNhdf4D8cluuQ= b with T(0) = 0 and T(1) = Θ(1), where n / b means either ⎣n / b⎦ or ⎡n / b⎤. Then,

c Case 1. If c > logb a, then T (n) = Θ(n ). c Case 2. If c = logb a, then T (n) = Θ(n log n). log a Case 3. If c < logb a, then T (n) = Θ(n b ).

Extensions. ・Can replace Θ with O everywhere. ・Can replace Θ with Ω everywhere. ・Can replace initial conditions with T(n) = Θ(1) for all n ≤ n0 and require recurrence to hold only for all n > n0.

7 Divide-and-conquer recurrences: master theorem

Master theorem. Let a ≥ 1, b ≥ 2, and c ≥ 0 and suppose that T (n) is a function on the non-negative integers that satisfies the recurrence n T (n)=aT + (nc) AAACb3icbVBtSxtBEN6c2lq1NeoXQZClQUiohLtSqEUEQRA/KiRVyMUwt5lLFvf2jt25YjjuH/lr/Cb6K/wF3bx8MNGB3X145pmdmSfKlLTk+08Vb2l55dPn1S9r6xtfv21Wt7b/2jQ3AtsiVam5icCikhrbJEnhTWYQkkjhdXR3Ns5f/0NjZapbNMqwm8BAy1gKIEf1quetum7w8JifjC/g4SFvhQpj4vUwNiAKXRZRyUMjB0PiTnnoFD+mT9gaIkFd34pGr1rzm/4k+HsQzECNzeKyt1XZDvupyBPUJBRY2wn8jLoFGJJCYbkW5hYzEHcwwI6DGhK03WKycMkPHNPncWrc0cQn7NuKAhJrR4kb/CABGtrF3Jj8KNfJKT7qFlJnOaEW00ZxrjilfOwe70uDgtTIARBGulm5GILziZzHc10mf2co5jYp7nMtRdrHBVbRPRkonYvBomfvQftn808zuPpVOz2a2bnK9th3VmcB+81O2QW7ZG0m2AN7ZM/spfLq7Xr7Hp9KvcqsZofNhdf4D8cluuQ= b with T(0) = 0 and T(1) = Θ(1), where n / b means either ⎣n / b⎦ or ⎡n / b⎤. Then,

c Case 1. If c > logb a, then T (n) = Θ(n ). c Case 2. If c = logb a, then T (n) = Θ(n log n). log a Case 3. If c < logb a, then T (n) = Θ(n b ).

Ex. [Case 1] T (n) = 3 T(⎣n / 2⎦) + 5 n. ・a = 3, b = 2, c = 1 < logb a = 1.5849.... ・T(n) = Θ(nlog23) = O(n1.58).

8 Divide-and-conquer recurrences: master theorem

Master theorem. Let a ≥ 1, b ≥ 2, and c ≥ 0 and suppose that T (n) is a function on the non-negative integers that satisfies the recurrence n T (n)=aT + (nc) AAACb3icbVBtSxtBEN6c2lq1NeoXQZClQUiohLtSqEUEQRA/KiRVyMUwt5lLFvf2jt25YjjuH/lr/Cb6K/wF3bx8MNGB3X145pmdmSfKlLTk+08Vb2l55dPn1S9r6xtfv21Wt7b/2jQ3AtsiVam5icCikhrbJEnhTWYQkkjhdXR3Ns5f/0NjZapbNMqwm8BAy1gKIEf1quetum7w8JifjC/g4SFvhQpj4vUwNiAKXRZRyUMjB0PiTnnoFD+mT9gaIkFd34pGr1rzm/4k+HsQzECNzeKyt1XZDvupyBPUJBRY2wn8jLoFGJJCYbkW5hYzEHcwwI6DGhK03WKycMkPHNPncWrc0cQn7NuKAhJrR4kb/CABGtrF3Jj8KNfJKT7qFlJnOaEW00ZxrjilfOwe70uDgtTIARBGulm5GILziZzHc10mf2co5jYp7nMtRdrHBVbRPRkonYvBomfvQftn808zuPpVOz2a2bnK9th3VmcB+81O2QW7ZG0m2AN7ZM/spfLq7Xr7Hp9KvcqsZofNhdf4D8cluuQ= b with T(0) = 0 and T(1) = Θ(1), where n / b means either ⎣n / b⎦ or ⎡n / b⎤. Then,

c Case 1. If c > logb a, then T (n) = Θ(n ). c Case 2. If c = logb a, then T (n) = Θ(n log n). log a Case 3. If c < logb a, then T (n) = Θ(n b ).

ok to intermix floor and ceiling

Ex. [Case 2] T (n) = T(⎣n / 2⎦) + T(⎡n / 2⎤) + 17 n. . ・a = 2, b = 2, c = 1 = logb a ・T (n) = Θ(n log n).

9 Divide-and-conquer recurrences: master theorem

Master theorem. Let a ≥ 1, b ≥ 2, and c ≥ 0 and suppose that T (n) is a function on the non-negative integers that satisfies the recurrence n T (n)=aT + (nc) AAACb3icbVBtSxtBEN6c2lq1NeoXQZClQUiohLtSqEUEQRA/KiRVyMUwt5lLFvf2jt25YjjuH/lr/Cb6K/wF3bx8MNGB3X145pmdmSfKlLTk+08Vb2l55dPn1S9r6xtfv21Wt7b/2jQ3AtsiVam5icCikhrbJEnhTWYQkkjhdXR3Ns5f/0NjZapbNMqwm8BAy1gKIEf1quetum7w8JifjC/g4SFvhQpj4vUwNiAKXRZRyUMjB0PiTnnoFD+mT9gaIkFd34pGr1rzm/4k+HsQzECNzeKyt1XZDvupyBPUJBRY2wn8jLoFGJJCYbkW5hYzEHcwwI6DGhK03WKycMkPHNPncWrc0cQn7NuKAhJrR4kb/CABGtrF3Jj8KNfJKT7qFlJnOaEW00ZxrjilfOwe70uDgtTIARBGulm5GILziZzHc10mf2co5jYp7nMtRdrHBVbRPRkonYvBomfvQftn808zuPpVOz2a2bnK9th3VmcB+81O2QW7ZG0m2AN7ZM/spfLq7Xr7Hp9KvcqsZofNhdf4D8cluuQ= b with T(0) = 0 and T(1) = Θ(1), where n / b means either ⎣n / b⎦ or ⎡n / b⎤. Then,

c Case 1. If c > logb a, then T (n) = Θ(n ). c Case 2. If c = logb a, then T (n) = Θ(n log n). log a Case 3. If c < logb a, then T (n) = Θ(n b ).

3 Ex. [Case 3] T (n) = 48 T(⎣n / 4⎦) + n . ・a = 48, b = 4, c = 3 > logb a = 2.7924.... 3 ・T (n) = Θ(n ).

10 Master theorem need not apply

Gaps in master theorem.

・Number of subproblems is not a constant.

T (n)=nT(n/2) + n2

・Number of subproblems is less than 1. 1 T (n)= T (n/2) + n2 2

・Work to divide and combine subproblems is not Θ(nc). T (n)=2T (n/2) + n log n

AAACT3icbZDNSgMxEMez9bt+VT16CRZFUepuEayIIHjxqGBV6JaSTac1mE2WZFZalr6BT+NVn8Kbb+JJTOsebHUg8M9vZjKZf5RIYdH3P7zC1PTM7Nz8QnFxaXlltbS2fmt1ajjUuZba3EfMghQK6ihQwn1igMWRhLvo8WKYv3sCY4VWN9hPoBmzrhIdwRk61Crt3OyqPRqenoWntErDA+ruh9Uh2XdE0VDqLlXFVqnsV/xR0L8iyEWZ5HHVWvPWw7bmaQwKuWTWNgI/wWbGDAouYVAMUwsJ44+sCw0nFYvBNrPRQgO67UibdrRxRyEd0d8dGYut7ceRq4wZPtjJ3BD+l2uk2Kk1M6GSFEHxn0GdVFLUdOgObQsDHGXfCcaNcH+l/IEZxtF5ODZl9HYCfGyTrJcqwXUbJqjEHho2cC4Gk579FfVq5aQSXB+Vz2u5nfNkk2yRXRKQY3JOLskVqRNOnskLeSVv3rv36X0V8tKCl4sNMhaFhW/957Ap

11 Divide-and-conquer II: quiz 1

Consider the following recurrence. Which case of the master theorem?

(1) n =1 T (n)= 3T ( n/2 )+(n) n>1 AAACvnicbVFdb9MwFHUyPkb5WDceeTF0oI5JJRlIG5pAQ7zwOKSWTaqjynFuWmuOHdk3U6uoP4Mfx2/hBacLEm25kq2jc+891/c4LZV0GEW/gnDn3v0HD3cfdR4/efpsr7t/8MOZygoYCaOMvU65AyU1jFCiguvSAi9SBVfpzdcmf3UL1kmjh7goISn4VMtcCo6emnR/Dvv6iLJz+qm5OiyFqdS18Ipu2WHnbDgD5P34iL6hDGGONZU5PdS+PD6kS8rYOBqcQpH42vd02GdKgFRUvzuhzDaw0T72yq2Q3hL6/Feow0Bn7eRJtxcNolXQbRC3oEfauJzsBwcsM6IqQKNQ3LlxHJWY1NyiFAr8KpWDkosbPoWxh5oX4JJ65d+SvvZMRnNj/dFIV+y/HTUvnFsUqa8sOM7cZq4h/5cbV5ifJbXUZYWgxd2gvFIUDW0+g2bSgkC18IALK/1bqZhxywX6L1ubstIuQaxtUs8rLYXJYINVOEfLGxfjTc+2wehk8HEQf//Quzhr7dwlL8gr0icxOSUX5Bu5JCMiyO/gZfA2OA6/hNOwCM1daRi0Pc/JWoTzP3z/0Zs=

A. Case 3: T(n) = Θ(n).

B. Case 2: T(n) = Θ(n log n).

C. Case 1: T(n) = Θ(nlog23) = O(n1.585).

D. Master theorem not applicable.

12 Divide-and-conquer II: quiz 2

Consider the following recurrence. Which case of the master theorem?

0 n 1 T (n)= T ( n/5 )+T (n 3 n/10 )+ 11 n n>1 AAAC03icbVHLbtNAFB2bVwmPpmXJ5ooIlAoRPEBEpYhSiQ3LIiVtpYwVjSfX6ajjsTUzRomsbBBbPoRP4m8Yu5ZoEu7q6Jz7PDcplLQuiv4E4Z279+4/2HvYefT4ydP97sHhuc1LI3AicpWby4RbVFLjxEmn8LIwyLNE4UVy/aXWL76jsTLXY7cqMM74QstUCu48Nev+Hvf1EbDRJzbqsAQXUlfCt7PrDhtF8AqYw6WDCmQKa9DAFAIFxqYUs9injPtMpSrPDei3Q2CmwUfwGnxbeAPv4Z9Mo9s6Sw0XFaXralj33Rl00ozpMNTzdqFZtxcNoiZgF9AW9EgbZ7OD4JDNc1FmqJ1Q3NopjQoXV9w4KRT6C0uLBRfXfIFTDzXP0MZV4+kaXnpmDqlfPc21g4a9XVHxzNpVlvjMjLsru63V5P+0aenS47iSuigdanEzKC0VuBzqB8FcGhROrTzgwki/K4gr7t1y/o0bU5reBYqNS6plqaXI57jFKrd0htcu0m3PdsH5uwGNBvTbh97pcevnHnlOXpA+oeQjOSVfyRmZEBHsB8PgJPgcTsIq/BH+vEkNg7bmGdmI8Ndf5PTZpg== 5

A. Case 1: T(n) = Θ(n).

B. Case 2: T(n) = Θ(n log n).

C. Case 3: T(n) = Θ(n).

D. Master theorem not applicable.

13 Akra–Bazzi theorem

Theorem. [Akra–Bazzi 1998] Given constants ai > 0 and 0 < bi < 1, 2 c functions ⎜hi (n)⎜ = O(n / log n) and g(n) = O(n ). If T(n) satisfies the recurrence:

k

T (n)= ai T (bin + hi(n)) + g(n) i=1 ai subproblems small perturbation to handle of size bi n floors and ceilings

n k p g(u) p then, T(n) = , where p satisfies a i b =1 . n 1+ p+1 du i 1 u i=1

Ex. T(n) = T(⎣n / 5⎦) + T(n – 3⎣n / 10⎦) + 11/5 n, with T(0) = 0 and T(1) = 0. ・a1 = 1, b1 = 1/5, a2 = 1, b2 = 7/10 ⇒ p = 0.83978… < 1. ・h1(n) = ⎣n / 5⎦ – n / 5, h2(n) = 3/10 n – 3⎣n / 10⎦. ・g(n) = 11/5 n ⇒ T(n) = Θ(n).

14 DIVIDE AND CONQUER II

‣ master theorem ‣ integer multiplication ‣ matrix multiplication ‣ convolution and FFT

SECTION 5.5 Integer addition and subtraction

Addition. Given two n-bit integers a and b, compute a + b. Subtraction. Given two n-bit integers a and b, compute a – b.

“bit complexity” Grade-school . Θ(n) bit operations. (instead of word RAM)

1 1 1 1 1 1 0 1

1 1 0 1 0 1 0 1

+ 0 1 1 1 1 1 0 1

1 0 1 0 1 0 0 1 0

Remark. Grade-school addition and subtraction are optimal.

16 Integer multiplication

Multiplication. Given two n-bit integers a and b, compute a × b. Grade-school algorithm (long multiplication). Θ(n2) bit operations.

1 1 0 1 0 1 0 1 × 0 1 1 1 1 1 0 1

1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1

1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1

Conjecture. [Kolmogorov 1956] Grade-school algorithm is optimal. Theorem. [Karatsuba 1960] Conjecture is false. 17 Divide-and-conquer multiplication

To multiply two n-bit integers x and y: ・Divide x and y into low- and high-order bits. ・Multiply four ½n-bit integers, recursively. ・Add and shift to obtain result.

m = ⎡ n / 2 ⎤

m m a = ⎣ x / 2 ⎦ b = x mod 2 use bit shifting to compute 4 terms c = ⎣ y / 2m ⎦ d = y mod 2m

x y = (2m a + b) (2m c + d) = 22m ac + 2m (bc + ad) + bd

1 2 3 4

Ex. x = 10001101 y = 11100001

a b c d

18 Divide-and-conquer multiplication

MULTIPLY(x, y, n)

______

IF (n = 1) RETURN x � y. ELSE

m ← ⎡ n / 2 ⎤. Θ(n) a ← ⎣ x / 2m ⎦; b ← x mod 2m. c ← ⎣ y / 2m ⎦; d ← y mod 2m. e ← MULTIPLY(a, c, m).

f ← MULTIPLY(b, d, m). 4 T(⎡n / 2⎤) g ← MULTIPLY(b, c, m). h ← MULTIPLY(a, d, m). RETURN 22m e + 2m (g + h) + f. Θ(n)

______

19 Divide-and-conquer II: quiz 3

How many bit operations to multiply two n-bit integers using the divide-and-conquer ?

(1) n =1 T (n)= 4T ( n/2 )+(n) n>1 AAACvnicbVFda9swFJW9ry77aNo+7kVbupGukNml0Jay0bGXPXaQrIXIBFm+TkQl2UjXI8HkZ+zH7bfsZXLqwZLsgsTh3HvP1T1KSyUdRtGvIHzw8NHjJztPO8+ev3i5293b/+6KygoYiUIV9jblDpQ0MEKJCm5LC1ynCm7Suy9N/uYHWCcLM8RFCYnmUyNzKTh6atL9OeybI8ou6cfm6rAUptLUwiu6ZYddsuEMkPfjI/qOMoQ51lTm9ND48viQLilj42hwBjrxtad02GdKgFTUfDihzDaw0T72yq2Q2RL69Feow8Bk7eRJtxcNolXQbRC3oEfauJ7sBfssK0SlwaBQ3LlxHJWY1NyiFAr8KpWDkos7PoWxh4ZrcEm98m9J33omo3lh/TFIV+y/HTXXzi106is1x5nbzDXk/3LjCvPzpJamrBCMuB+UV4piQZvPoJm0IFAtPODCSv9WKmbccoH+y9amrLRLEGub1PPKSFFksMEqnKPljYvxpmfbYHQyuBjE3057V+etnTvkFXlD+iQmZ+SKfCXXZEQE+R28Dt4Hx+HncBrqsLgvDYO254CsRTj/A38N0Zw=

A. T(n) = Θ(n1/2).

B. T(n) = Θ(n log n).

C. T(n) = Θ(nlog23) = O(n1.585).

D. T(n) = Θ(n2). Case 3 of master theorem

20 Karatsuba trick

To multiply two n-bit integers x and y: ・Divide x and y into low- and high-order bits. ・To compute middle term bc + ad, use identity:

bc + ad = ac + bd – (a – b) (c – d)

・Multiply only three ½n-bit integers, recursively.

x = 10001101 m = ⎡ n / 2 ⎤ a b a = ⎣ x / 2m ⎦ b = x mod 2m middle term y = 11100001 c = ⎣ y / 2m ⎦ d = y mod 2m c d x y = (2m a + b) (2m c + d) = 22m ac + 2m (bc + ad ) + bd = 22m ac + 2m (ac + bd – (a – b)(c – d)) + bd

1 1 3 2 3

21 Karatsuba multiplication

KARATSUBA-MULTIPLY(x, y, n)

______

IF (n = 1) RETURN x � y. ELSE m ← ⎡ n / 2 ⎤. a ← ⎣ x / 2m ⎦; b ← x mod 2m. Θ(n) c ← ⎣ y / 2m ⎦; d ← y mod 2m. e ← KARATSUBA-MULTIPLY(a, c, m). f ← KARATSUBA-MULTIPLY(b, d, m). 3 T(⎡n / 2⎤) g ← KARATSUBA-MULTIPLY(⎢a – b⎢, ⎢c – d⎢, m). Flip sign of g if needed.

RETURN 22m e + 2m (e + f – g) + f. Θ(n)

______

22 Karatsuba analysis

Proposition. Karatsuba’s algorithm requires O(n1.585) bit operations to multiply two n-bit integers.

Pf. Apply Case 3 of the master theorem to the recurrence:

(1) n =1 T (n)= 3T ( n/2 )+(n) n>1 AAACvnicbVFdb9MwFHUyPkb5WDceeTF0oI5JJRlIG5pAQ7zwOKSWTaqjynFuWmuOHdk3U6uoP4Mfx2/hBacLEm25kq2jc+891/c4LZV0GEW/gnDn3v0HD3cfdR4/efpsr7t/8MOZygoYCaOMvU65AyU1jFCiguvSAi9SBVfpzdcmf3UL1kmjh7goISn4VMtcCo6emnR/Dvv6iLJz+qm5OiyFqdS18Ipu2WHnbDgD5P34iL6hDGGONZU5PdS+PD6kS8rYOBqcQpH42vd02GdKgFRUvzuhzDaw0T72yq2Q3hL6/Feow0Bn7eRJtxcNolXQbRC3oEfauJzsBwcsM6IqQKNQ3LlxHJWY1NyiFAr8KpWDkosbPoWxh5oX4JJ65d+SvvZMRnNj/dFIV+y/HTUvnFsUqa8sOM7cZq4h/5cbV5ifJbXUZYWgxd2gvFIUDW0+g2bSgkC18IALK/1bqZhxywX6L1ubstIuQaxtUs8rLYXJYINVOEfLGxfjTc+2wehk8HEQf//Quzhr7dwlL8gr0icxOSUX5Bu5JCMiyO/gZfA2OA6/hNOwCM1daRi0Pc/JWoTzP3z/0Zs=

T (n)=3T (n/2) + (n)= T (n)=(nlog2 3)=O(n1.585)

AAACmHicbVHfT9swEHYy2Bj7VdgjLxbVJNCkLilDK5omMfEAPI1N7UBquspxrq2FYwf7MlFF+TP3sL9lLzghAlp20knffd+dz/4cZ1JYDIK/nv9kZfXps7Xn6y9evnr9prWx+dPq3HAYcC21uYiZBSkUDFCghIvMAEtjCefx5VGln/8GY4VWfZxnMErZVImJ4AwdNW7p/o7apdFn+sXlnis+dKvyvaui/gyQ1fJVzhIaidTdCGxT3g3SB72/ikjq6bhL98p78VvFh5393n65O261g05QB30Mwga0SRNn4w1vM0o0z1NQyCWzdhgGGY4KZlBwCeV6lFvIGL9kUxg6qFgKdlTUzpT0nWMSOtHGpUJasw8nCpZaO09j15kynNllrSL/pw1znPRGhVBZjqD47aJJLilqWtlME2GAo5w7wLgR7q6Uz5hhHN1nLGypz86AL7ykuM6V4DqBJVbiNRpWOhfDZc8eg0G3c9AJv39sH/YaO9fIFtkmOyQkn8ghOSFnZEA4+UP+eSveqr/lf/WP/dPbVt9rZt6ShfB/3ABdlsTC

Practice. ・Use base 32 or 64 (instead of base 2). ・Faster than grade-school algorithm for about 320–640 bits.

23 Integer arithmetic reductions

Integer multiplication. Given two n-bit integers, compute their product.

arithmetic problem formula bit complexity

integer multiplication a × b M(n) (a + b)2 a2 b2 ab =

AAACTXicdVBNa9tAEF05bfPVDyc55rLULaSUCkmxY+cQCPTSYwp1E7AdM1qPkiWrldgdlRjhP5Bfk2vyK3rtH8kplK4UB+rQPtjl8d7Mzs6LcyUtBcEvr7H07PmL5ZXVtfWXr16/aW5sfrdZYQT2RaYycxKDRSU19kmSwpPcIKSxwuP44nPlH/9AY2Wmv9E0x1EKZ1omUgA5adx8BzE/4MPEgCh3gH/k8YfTiH/iUN/xaTQro9m42Qr8zt5ut9fmgR/ud3ajqCLtTq/b5qEf1GixOY7GG97mcJKJIkVNQoG1gzDIaVSCISkUztaGhcUcxAWc4cBRDSnaUVmvM+PvnTLhSWbc0cRr9e+OElJrp2nsKlOgc/vUq8R/eYOCkt6olDovCLV4GJQUilPGq2z4RBoUpKaOgDDS/ZWLc3DRkEtwYUr9do5iYZPystBSZBN8oiq6JANVio9R8f+TfuTv++HXduuwN49zhW2zt2yHhazLDtkXdsT6TLArds1u2K3307vz7r3fD6UNb96zxRbQWP4DfheyBQ== 2 integer square a 2 Θ(M(n))

integer division ⎣a / b⎦, a mod b Θ(M(n))

integer ⎣√a ⎦ Θ(M(n))

integer arithmetic problems with the same bit complexity M(n) as integer multiplication

24 History of asymptotic complexity of integer multiplication

year algorithm bit operations

12xx grade school O(n2)

1962 Karatsuba–Ofman O(n1.585)

1963 Toom-3, Toom-4 O(n1.465), O(n1.404)

1966 Toom–Cook O(n1 + ε)

1971 Schönhage–Strassen O(n log n ⋅ log log n)

2007 Fürer n log n 2 O(log*n)

2019 Harvey–van der Hoeven O(n log n)

O(n)

number of bit operations to multiply two n-bit integers

Remark. GNU Multiple Precision library uses one of first five algorithms depending on n.

used in Maple, Mathematica, gcc, cryptography, ... 25 DIVIDE AND CONQUER II

‣ master theorem ‣ integer multiplication ‣ matrix multiplication ‣ convolution and FFT

SECTION 4.2 Dot product

Dot product. Given two length-n vectors a and b, compute c = a ⋅ b. Grade-school. Θ(n) arithmetic operations. n a b = a b · i i i=1

a = .70 .20 .10 [ ] b .30 .40 .30 = [ ] a ⋅ b = (.70 × .30) + (.20 × .40) + (.10 × .30) = .32

Remark. “Grade-school” dot product algorithm is asymptotically optimal.

27 Matrix multiplication

Matrix multiplication. Given two n-by-n matrices A and B, compute C = AB. 3 Grade-school. Θ(n ) arithmetic operations. n cij = aik bkj k=1 "c 11 c12 ! c1n % "a 11 a12 ! a1n % "b 11 b12 ! b1n % $ ' $ ' $ ' c c ! c a a ! a b b ! b $ 21 22 2n ' = $ 21 22 2n ' × $ 21 22 2n ' $ " " # " ' $ " " # " ' $ " " # " ' $c c c ' $a a a ' $b b b ' # n1 n2 ! nn & # n1 n2 ! nn & # n1 n2 ! nn &

€ ".59 .32 .41% ".70 .20 .10% " .80 .30 .50% $ ' $ ' $ ' $.31 .36 .25' = $.30 .60 .10' × $ .10 .40 .10' #$.45 .31 .42&' #$.50 .10 .40&' #$ .10 .30 .40&'

Q. Is “grade-school” matrix multiplication algorithm asymptotically optimal?

28 Block matrix multiplication

A11 A12 B11 C11

" 152 158 164 170 % " 0 1 2 3 % "16 17 18 19% $ 504 526 548 570 ' $ 4 5 6 7 ' $20 21 22 23' $ ' = $ ' × $ ' $ 856 894 932 970 ' $ 8 9 10 11' $24 25 26 27' $ ' $ ' $ ' #1208 1262 1316 1370& #12 13 14 15& #28 29 30 31&

B € 21

# 0 1& #16 17& #2 3& #24 25& #152 158& C A B A B 11 = 11 × 11 + 12 × 21 = % ( × % ( + % ( × % ( = % ( $ 4 5' $20 21' $6 7' $28 29' $504 526'

€ 29 Block matrix multiplication: warmup

To multiply two n-by-n matrices A and B: ・Divide: partition A and B into ½n-by-½n blocks. ・Conquer: multiply 8 pairs of ½n-by-½n matrices, recursively. ・Combine: add appropriate products using 4 matrix additions. 8 matrix multiplications n-by-n matrices (of ½n-by-½n matrices)

C = A B C11 = (A11 × B11) + (A12 × B21) C = A × B + A × B "C 11 C12 % " A11 A12 % " B11 B12 % 12 ( 11 12 ) ( 12 22 ) = × $ ' $ ' $ ' C21 = (A21 × B11) + (A22 × B21) # C 21 C22& # A21 A22& # B21 B22& C A B A B 22 = ( 21 × 12 ) + ( 22 × 22 )

½n-by-½n matrices € € 4 matrix additions (of ½n-by-½n matrices) Running time. Apply Case 3 of the master theorem.

T(n) = 8T n/2 + Θ(n2 ) ⇒ T(n) = Θ(n3) !# "( # $ ) !# #" # #$ recursive calls add, form submatrices 30

€ Strassen’s trick

Key idea. Can multiply two 2-by-2 matrices via 7 scalar multiplications (plus 11 additions and 7 subtractions).

scalars

P1 ← A11 � (B12 – B22) " C11 C12 % " A11 A12 % " B11 B12 % $ ' = $ ' × $ ' C C A A B B # 21 22& # 21 22& # 21 22& P2 ← (A11 + A12) � B22 P3 ← (A21 + A22) � B11

P4 ← A22 � (B21 – B11) € C11 = P5 + P4 – P2 + P6

P5 ← (A11 + A22) � (B11 + B22) C12 = P1 + P2

C21 = P3 + P4 P6 ← (A12 – A22) � (B21 + B22)

C22 = P1 + P5 – P3 – P7 P7 ← (A11 – A21) � (B11 + B12)

Pf. C12 = P1 + P2 7 scalar multiplications = A11 � (B12 – B22) + (A11 + A12) � B22 . ✔ = A11 � B12 + A12 � B22 31 Strassen’s trick

n-by-n ½n-by-½n matrix Key idea. Can multiply two 2-by-2 matrices via 7 scalar multiplications (plus 11 additions and 7 subtractions).

½n-by-½n matrices

P1 ← A11 � (B12 – B22) " C11 C12 % " A11 A12 % " B11 B12 % $ ' = $ ' × $ ' C C A A B B # 21 22& # 21 22& # 21 22& P2 ← (A11 + A12) � B22 P3 ← (A21 + A22) � B11

P4 ← A22 � (B21 – B11) € C11 = P5 + P4 – P2 + P6

P5 ← (A11 + A22) � (B11 + B22) C12 = P1 + P2

C21 = P3 + P4 P6 ← (A12 – A22) � (B21 + B22)

C22 = P1 + P5 – P3 – P7 P7 ← (A11 – A21) � (B11 + B12)

Pf. C12 = P1 + P2 7 matrix multiplications (of ½n-by-½n matrices) = A11 � (B12 – B22) + (A11 + A12) � B22 . ✔ = A11 � B12 + A12 � B22 32 Strassen’s algorithm

assume n is a power of 2 STRASSEN(n, A, B)

______

IF (n = 1) RETURN A � B. Partition A and B into ½n-by-½n blocks.

P1 ← STRASSEN(n / 2, A11, (B12 – B22)).

P2 ← STRASSEN(n / 2, (A11 + A12), B22).

P3 ← STRASSEN(n / 2, (A21 + A22), B11).

P4 ← STRASSEN(n / 2, A22, (B21 – B11)). 7 T(n / 2) + Θ(n2)

P5 ← STRASSEN(n / 2, (A11 + A22), (B11 + B22)).

P6 ← STRASSEN(n / 2, (A12 – A22), (B21 + B22)).

P7 ← STRASSEN(n / 2, (A11 – A21), (B11 + B12)).

C11 = P5 + P4 – P2 + P6.

C12 = P1 + P2. Θ(n2) C21 = P3 + P4.

C22 = P1 + P5 – P3 – P7. " C11 C12 % " A11 A12 % " B11 B12 % $ ' = $ ' × $ ' RETURN C. # C 21 C22& # A21 A22& # B21 B22& 33

€ Analysis of Strassen’s algorithm

Theorem. Strassen’s algorithm requires O(n2.81) arithmetic operations to multiply two n-by-n matrices.

Numer. Math. t3, 354--356 (t969)

Gaussian Elimination is not Optimal

VOLKER ~TRASSEN*

Received December 12, t 968

t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4.7-n l°g7 arithmetical operations (all logarithms in this paper are for base 2, thus tog 7 ~ 2.8; the usual method requires approximately 2n 3 arithmetical operations). The algorithm induces algorithms for inverting a matrix of order n, solving a system of n linear equations in n unknowns, com- puting a determinant of order n etc. all requiring less than const n l°g 7 arithmetical operations. This fact should be compared with the result of KLYUYEV and KOKOVKIN- SHCHERBAK [1 ] that Gaussian elimination for solving a system of linearequations is optimal if one restricts oneself to operations upon rows and columns as a whole. We also note that WlNOGRAD [21 modifies the usual algorithms for matrix multiplication and inversion and for solving systems of linear equations, trading roughly half of the multiplications for additions and subtractions. It is a pleasure to thank D. BRILLINGERfor inspiring discussions about the present subject and ST. COOK and B. PARLETT for encouraging me to write this paper. 2. We define algorithms e~, ~ which multiply matrices of order m2 ~, by in- duction on k: ~,0 is the usual algorithm, for matrix multiplication (requiring m a multiplications and m 2 (m- t) additions), e~,k already being known, define ~, ~ +t as follows: 34 If A, B are matrices of order m 2 k ~ to be multiplied, write (All A~2 t (B~I B12~ (Cll C1~ A = \A21 A~2], B --- \B.21 B2~], A B = \C21 C22], where the Ai~,, Bik, Cik are matrices of order m2 ~. Then compute I = (All + A~) (Bit + Be2) , II =(A~I+A~2 )BlI, III =All (Bt~ -- B~.~), IV = A~ (-- Bll + B21), V = (Alx+AI~)B~2, VI = (--Ax~ + A~) (Bll + B~2), VII = (AI~--A~2 ) (B21 + B22), * The results have been found while the author was at the Department of Statistics of the University of California, Berkeley. The author wishes to thank the National Science Foundation for their support (NSF GP-7454). Analysis of Strassen’s algorithm

Theorem. Strassen’s algorithm requires O(n2.81) arithmetic operations to multiply two n-by-n matrices.

Pf. ・When n is a power of 2, apply Case 1 of the master theorem:

T(n) = 7T n/2 + Θ(n2 ) ⇒ T(n) = Θ(nlog2 7 ) = O(n2.81) !# "( # $ ) !# " # $ recursive calls add, subtract

When n is not a power of 2, pad matrices with zeros to be n -by-n , ・€ ʹ ʹ where n ≤ nʹ < 2n and nʹ is a power of 2.

1230 10 11 12 0 84 90 96 0 4560 13 14 15 0 201 216 231 0 = 7890 16 17 18 0 318 342 366 0 0000 0000 0000

35 Strassen’s algorithm: practice

Implementation issues. ・Sparsity. ・Caching. ・n not a power of 2. ・Numerical stability. ・Non-square matrices. ・Storage for intermediate submatrices. ・Crossover to classical algorithm when n is “small.” ・Parallelism for multi-core and many-core architectures.

Common misperception. “Strassen’s algorithm is only a theoretical curiosity.” ・Apple reports 8x speedup when n ≈ 2,048. ・Range of instances where it’s useful is a subject of controversy.

Strassen’s Algorithm Reloaded

Jianyu Huang⇤, Tyler M. Smith⇤†, Greg M. Henry‡, Robert A. van de Geijn⇤† ⇤Department of Computer Science and †Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX 78712 Email: jianyu,tms,[email protected] ‡Intel Corporation, Hillsboro, OR 97124 Email: [email protected]

36 Abstract—We dispel with “street wisdom” regarding the that can be computed and makes it so an implementation is not practical implementation of Strassen’s algorithm for matrix- plug-compatible with the standard calling supported matrix multiplication (DGEMM). Conventional wisdom: it is only by the BLAS. practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being An important recent advance in the high-performance imple- multiplied should be relatively square. Our implementation is mentation of DGEMM is the BLAS-like Library Instantiation practical for rank-k updates, where k is relatively small (a shape Software (BLIS framework) [14], a careful refactoring of the of importance for libraries like LAPACK). Conventional wisdom: best-known approach to implementing conventional DGEMM it inherently requires substantial workspace. Our implementation introduced by Goto [15]. Of importance to the present paper requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. are the building blocks that BLIS exposes, minor modifica- Conventional wisdom: a Strassen DGEMM interface must pass tions of which support a new approach to implementating in workspace. Our implementation requires no such workspace STRASSEN. This approach changes data movement between and can be plug-compatible with the standard DGEMM inter- memory layers and can thus mitigate the negative impact of face. Conventional wisdom: it is hard to demonstrate speedup the additional lower order terms incurred by STRASSEN. These on multi-core architectures. Our implementation demonstrates R building blocks have similarly been exploited to improve upon speedup over conventional DGEMM even on an Intel Xeon PhiTM coprocessor1 utilizing 240 threads. We show how a dis- the performance of, for example, the computation of the tributed memory matrix-matrix multiplication also benefits from K-Nearest Neighbor [16] and Tensor Contraction [17], [18] these advances. problem. The result is a family of STRASSEN implementations, Index Terms—Strassen, numerical algorithm, performance members of which attain superior performance depending on model, matrix multiplication, linear algebra library, BLAS. the sizes of the matrices. I. INTRODUCTION The resulting family improves upon prior implementations of STRASSEN in a number of surprising ways: Strassen’s algorithm (STRASSEN) [1] for matrix-matrix It can outperform classical DGEMM even for small square multiplication (DGEMM) has fascinated theoreticians and prac- • matrices. titioners alike since it was first published, in 1969. That paper It can achieve high performance for rank-k updates demonstrated that multiplication of n n matrices can be • ⇥ (DGEMM with a small “inner matrix size”), a case of achieved in less than the O(n3) arithmetic operations required DGEMM frequently encountered in the implementation of by a conventional formulation. It has led to many variants that libraries like LAPACK [19]. improve upon this result [2], [3], [4], [5] as well as practical It needs not require additional workspace. implementations [6], [7], [8], [9]. The method can yield a • It can incorporate directly the multi-threading in tradi- shorter execution time than the best conventional algorithm • tional DGEMM implementations. with a modest degradation in numerical stability [10], [11], It can be plug-compatible with the standard DGEMM [12] by only incorporating a few levels of recursion. • interface supported by the BLAS. From 30,000 feet the algorithm can be described as shifting It can be incorporated into practical distributed memory computation with submatrices from multiplications to addi- • implementations of DGEMM. tions, reducing the O(n3) term at the expense of adding O(n2) complexity. For current architectures, of greater consequence Most of these advances run counter to conventional wisdom is the additional memory movements that are incurred when and are backed up by theoretical analysis and practical imple- the algorithm is implemented in terms of a conventional mentation. DGEMM provided by a high-performance implementation II. STANDARD MATRIX-MATRIX MULTIPLICATION through the Basic Linear Algebra Subprograms (BLAS) [13] We start by discussing naive computation of matrix-matrix interface. A secondary concern has been the extra workspace multiplication (DGEMM), how it is supported as a library rou- that is required. This simultaneously limits the size of problem tine by the Basic Linear Algebra Subprograms (BLAS) [13], 1Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the how modern implementations block for caches, and how that U.S. and/or other countries. implementation supports multi-threaded parallelization.

SC16; Salt Lake City, Utah, USA; November 2016 978-1-4673-8815-3/16/$31.00 c 2016 IEEE Divide-and-conquer II: quiz 4

Suppose that you could multiply two 3-by-3 matrices with 21 scalar multiplications. How fast could you multiply two n-by-n matrices?

A. Θ(n log321)

B. Θ(n log221)

C. Θ(n log921) Θ(nlog3 21) = O(n 2.77 ) D. Θ(n 2)

37 Divide-and-conquer II: quiz 5

Is it possible to multiply two 3-by-3 matrices using only 21 scalar multiplications?

A. Yes.

B. No.

C. Unknown.

38 Fast matrix multiplication: theory

Q. Multiply two 2-by-2 matrices with 7 scalar multiplications?

A. Yes! [Strassen 1969] Θ(nlog27 ) = O(n2.81)

Q. Multiply two 2-by-2 matrices with 6 scalar multiplications?

A. Impossible. [Hopcroft–Kerr, Winograd 1971] Θ(nlog26 ) = O(n2.59)

Begun, the decimal wars have. [Pan 1978, Bini et al., Schönhage, …] ・Two 70-by-70 matrices with 143,640 scalar multiplications. O(n2.7962) ・Two 48-by-48 matrices with 47,217 scalar multiplications. O(n2.7801) ・A year later. O(n2.7799) ・December 1979. O(n2.521813) ・January 1980. O(n2.521801)

39 History of arithmetic complexity of matrix multiplication

year algorithm arithmetic operations

1858 “grade school” O(n 3 )

1969 Strassen O(n 2.808 )

1978 Pan O(n 2.796 )

1979 Bini O(n 2.780 )

1981 Schönhage O(n 2.522 )

1982 Romani O(n 2.517 )

1982 Coppersmith–Winograd O(n 2.496 ) galactic algorithms 1986 Strassen O(n 2.479 )

1989 Coppersmith–Winograd O(n 2.3755 )

2010 Strother O(n 2.3737 )

2011 Williams O(n 2.372873 )

2014 Le Gall O(n 2.372864 )

O(n 2 + ε )

number of arithmetic operations to multiply two n-by-n matrices 40 Numeric linear algebra reductions

Matrix multiplication. Given two n-by-n matrices, compute their product.

linear algebra problem expression arithmetic complexity

matrix multiplication A × B MM(n)

matrix squaring A2 Θ(MM(n))

matrix inversion A –1 Θ(MM(n))

determinant ⎢A ⎢ Θ(MM(n))

rank rank(A) Θ(MM(n))

system of linear equations Ax = b Θ(MM(n))

LU decomposition A = L U Θ(MM(n))

least squares min ⎢⎢Ax – b ⎢⎢2 Θ(MM(n))

numerical linear algebra problems with the same

arithmetic complexity MM(n) as matrix multiplication 41 DIVIDE AND CONQUER II

‣ master theorem ‣ integer multiplication ‣ matrix multiplication ‣ convolution and FFT

SECTION 5.6 Fourier analysis

Fourier theorem. [Fourier, Dirichlet, Riemann] Any (sufficiently smooth) periodic function can be expressed as the sum of a series of sinusoids.

t

2 n sin kt y(t)= n = 1510100 k AAACZXicbVBda9swFFXcfWTdliXZ6MseJhYGHYxgh8EyukBgL33sYFkLcRZk5boVlmUjXY8E4T/TX9PX9m3/YD9jsuOHJd0BicM59+rqniiXwqDv/255Bw8ePnrcfnL49Nnzzotur//DZIXmMOOZzPRFxAxIoWCGAiVc5BpYGkk4j5KvlX/+C7QRmfqOmxwWKbtUIhacoZOW3S+bY3xPwxM6qa4w1ozbUWnDXJQ0/EBDU6RLm0yC8qdq3NAIRRMsbVIuuwN/6Neg90nQkAFpcLbstfrhKuNFCgq5ZMbMAz/HhWUaBZdQHoaFgZzxhF3C3FHFUjALW69Z0ndOWdE40+4opLX6b4dlqTGbNHKVKcMrs+9V4v+8eYHxeGGFygsExbeD4kJSzGiVGV0JDRzlxhHGtXB/pfyKuTDQJbszpX47B76ziV0XSvBsBXuqxDVqVqUY7Gd2n8xGw8/D4NvHwXTcxNkmr8lbckwC8olMySk5IzPCyTW5IbfkrvXH63ivvKNtqddqel6SHXhv/gLMbrqD k=1

43 Euler’s identity

Euler’s identity. e ix = cos x + i sin x.

Sinusoids. Sum of sine and cosines = sum of complex exponentials.

44 Time domain vs. frequency domain

1 1 Signal. [touch tone button 1] y(t) = 2 sin(2π ⋅ 697 t) + 2 sin(2π ⋅ 1209 t)

€ Time domain.

sound pressure

Frequency domain.

0.5

amplitude

Reference: Cleve Moler, Numerical Computing with MATLAB

45 Time domain vs. frequency domain

Signal. [recording, 8192 samples per second]

Magnitude of discrete Fourier transform.

Reference: Cleve Moler, Numerical Computing with MATLAB

46 Fast Fourier transform

FFT. Fast way to convert between time domain and frequency domain.

Alternate viewpoint. Fast way to multiply and evaluate polynomials.

we take this approach

“ If you speed up any nontrivial algorithm by a factor of a million or so the world will beat a path towards finding useful applications for it. ” — Numerical Recipes

47 Fast Fourier transform: applications

Applications. ・Optics, acoustics, quantum physics, telecommunications, radar, control systems, signal processing, speech recognition, data compression, image processing, seismology, mass spectrometry, … ・Digital media. [DVD, JPEG, MP3, H.264] ・Medical diagnostics. [MRI, CT, PET scans, ultrasound] ・Numerical solutions to Poisson’s equation. ・Integer and polynomial multiplication. ・Shor’s quantum factoring algorithm. ・…

“ The FFT is one of the truly great computational developments of [the 20th] century. It has changed the face of science and engineering so much that it is not an exaggeration to say that life as we know it would be very different without the FFT. ” — Charles van Loan

48 Fast Fourier transform: applications

https://xkcd.com/26

49 Fast Fourier transform: brief history

Gauss (1805, 1866). Analyzed periodic motion of asteroid Ceres.

Runge–König (1924). Laid theoretical groundwork.

Danielson–Lanczos (1942). Efficient algorithm, x-ray crystallography.

Cooley–Tukey (1965). Detect nuclear tests in Soviet Union and track submarines. Rediscovered and popularized FFT.

Importance not fully realized until emergence of digital computers. 50 Polynomials: coefficient representation

Univariate polynomial. [ coefficient representation ]

2 n 1 A(x)=a0 + a1x + a2x + ...+ an 1x 2 n 1 B(x)=b + b x + b x + ...+ b x

AAAC13icbVFba9swFJa9W5vukrZ764tY2OgYC3YotFsZ9LKHPXawtGVxGmT5pBGVJSMdDwcT2NPoa//I/s/+zWQnjDnpAZ3z8Z3LJx3FmRQWg+CP5z94+Ojxk7X11sbTZ89ftDe3zq3ODYc+11Kby5hZkEJBHwVKuMwMsDSWcBHfnFb5ix9grNDqG04zGKbsWomx4AwdNWr/Pt4t3tLokL75VHk2Cug750Na1LFHi6ueQ5FMNNqaKtX7cFZc1YFGUeukMSB2A6KJzRiHshuEPaFmriuuB97HzwVWM/8E44bgqN0JukFtdBWEC9AhCzsbbXpbUaJ5noJCLpm1gzDIcFgyg4JLmLWi3ILTvmHXMHBQsRTssKwXO6OvHZPQsTbuKKQ1+39HyVJrp2nsKlOGE7ucq8j7coMcxwfDUqgsR1B8LjTOJUVNq1+iiTDAUU4dYNwId1fKJ8wwju4vGyr17Ax44yVlkSvBdQJLrMQCDau2GC7vbBX0e90P3fDrXufoYLHONbJDXpFdEpJ9ckS+kDPSJ9x76X30Tr3P/nf/p//Lv52X+t6iZ5s0zL/7C7hY2FI= 0 1 2 n 1 Addition. O(n) arithmetic operations.

n 1 A(x)+B(x)=(a + b )+(a + b )x + ...+(a + b )x AAACgHicbVBNTxsxEHWWllI+Axx7sRohJUKENUIqBVWCcuFIJUKQknTl9U7Awmuv7Nkq0So/rT+EM1f4D/VucmgCI9l+896Mx35xpqTDMHyqBUsfPi5/Wvm8ura+sblV3965dSa3AjrCKGPvYu5ASQ0dlKjgLrPA01hBN368LPXuH7BOGn2D4wwGKb/XcigFR09F9e5Fc9Si+/RnefTP6I9ya/Io9FwchaXkM1ZlrEVHHtC+Sgy6qVLoAzap1Ar5it8ViOqNsB1WQd8CNgMNMovraLu200+MyFPQKBR3rsfCDAcFtyiFgslqP3eQcfHI76HnoeYpuEFROTChe55J6NBYvzTSiv2/o+Cpc+M09pUpxwe3qJXke1ovx+HJoJA6yxG0mA4a5oqioaWdNJEWBKqxB1xY6d9KxQO3XKA3fW5KdXcGYu4nxSjXUpgEFliFI7S8dJEtevYWdI7a39vs13Hj/GRm5wr5Qr6SJmHkGzknV+SadIggf8kzeSGvwVLQCg4DNi0NarOeXTIXwek/t9G8Lg== 0 0 1 1 n 1 n 1 Evaluation. O(n) using Horner’s method.

A(x)=a +(x(a + x(a + ...+ x(a + x(a )) ...)) AAACdXicbVBdaxNBFJ2sH631K20fBRmMShYx3Q0FW0So+OKDDxWMLSRhuTt70w6dnVlm7paEZf+Uv0Yf9Vf46GyyoEm9MHMP59w7d+5JCyUdRdGPTnDr9p27W9v3du4/ePjocXd376szpRU4EkYZe56CQyU1jkiSwvPCIuSpwrP06kOjn12jddLoL7QocJrDhZYzKYA8lXQ/ve/PQz55y981FyQRf8X78z4ksQdNHvo8UZkh1xKVfj2s/+K4DsOVHoZJtxcNomXwmyBuQY+1cZrsdvYmmRFljpqEAufGcVTQtAJLUiisdyalwwLEFVzg2EMNObpptVy75i88k/GZsf5o4kv2344KcucWeeorc6BLt6k15P+0cUmzo2kldVESarEaNCsVJ8MbD3kmLQpSCw9AWOn/ysUlWBDknV6bsny7QLG2STUvtRQmww1W0Zws1N7FeNOzm2A0HBwP4s+HvZOj1s5t9oQ9Y30WszfshH1kp2zEBPvGvrOf7Ffnd/A0eB68XJUGnbZnn61FcPAHKrW7Vg== 0 1 2 n 2 n 1 double val = 0.0; for (int j = n-1; j >= 0; j--) val = a[j] + (x * val);

Multiplication (linear convolution). O(n2) using brute force.

2n 2 i i A(x) B(x)= cix ci = aj bi j

AAACkXicbVFdT9swFHUD2/jYRmG87cVahcQeqJIKCQaqVOBlEi9MWgGpLZHj3lIXx4nsG5Yqyl/knf/BKwgnzcNauJKto3Puh+9xEEth0HUfa87S8oePn1ZW19Y/f/m6Ud/cujRRojl0eSQjfR0wA1Io6KJACdexBhYGEq6Cu7NCv7oHbUSk/uI0hkHIbpUYCc7QUn59fLKb/qR9FCEYelriY9ourr5JQj8TbTe/yVpqr5VT7gua3ggrFjJCitm/MWjIZ0wht6uySVEmKPMnNLBN9ia5X2+4TbcM+hZ4FWiQKi78zdpWfxjxJASFXDJjep4b4yBjGgWXkK/1EwMx43fsFnoWKmZXGGSlJTndscyQjiJtj0Jasv9XZCw0ZhoGNjNkODaLWkG+p/USHB0OMqHiBEHx2aBRIilGtPCXDoUGjnJqAeNa2LdSPmaacbS/MDel7B0Dn9skSxMleDSEBVZiipoVLnqLnr0F3VbzV9P7s9/oHFZ2rpDv5AfZJR45IB3ym1yQLuHkgTyRZ/LibDtHTsc5naU6tarmG5kL5/wVpenIUw== i=0 j=0 51 Divide-and-conquer II: quiz 6

What was the subject of Gauss’ Ph.D thesis?

A. Gaussian elimination.

B. Fast Fourier transform.

C. Prime number theorem.

D. Cauchy integral theorem.

E. Fundamental theorem of algebra.

F. Angle-preserving maps.

G. Method of least squares.

H. Non-Euclidean geometry.

I. Constructing a regular heptadecagon with straightedge and compass.

52 A modest Ph.D. dissertation title

DEMONSTRATIO NOVA THEOREMATIS OMNEM FVNCTIONEM ALGEBRAICAM RATIONALEM INTEGRAM VNIVS VARIABILIS IN FACTORES REALES PRIMI VEL SECUNDI GRADVS RESOLVI POSSE

AVCTORE CAROLO FRIDERICO GAVSS HELMSTADII APVD C. G. FLECKEISEN. 1799

1.

Quaelibet aequatio algebraica determinata reduci potest ad formam xm+Axm-1+Bxm-2+ etc. +M=0, ita vt m sit numerus integer positiuus. Si partem primam huius aequationis per X denotamus, aequationique X=0 per plures valores inaequales ipsius x satisfieri supponimus, puta ponendo x=α, x=β, x=γ etc. functio X per productum e factoribus x-α, x-β, x-γ etc. diuisibilis erit. Vice versa, si productum e pluribus factoribus simplicibus x- α, x-β, x-γ etc. functionem X metitur: aequationi X=0 satisfiet, aequando ipsam x cuicunque quantitatum α, β, γ etc. Denique si X producto ex m factoribus talibus simplicibus aequalis est (siue omnes diuersi sint, siue quidam ex ipsis identici): alii factores simplices praeter hos functionem X metiri non poterunt. Quamobrem aequatio mti gradus plures quam m radices habere nequit; simul vero patet, aequationem mti gradus pauciores radices habere posse, etsi X in m factores simplices resolubilis sit:

“ New proof of the theorem that every algebraic rational integral function in one

variable can be resolved into real factors of the first or the second degree. ” 53 Polynomials: point-value representation

Fundamental theorem of algebra. A degree n univariate polynomial with complex coefficients has exactly n complex roots.

Corollary. A degree n – 1 univariate polynomial A(x) is uniquely specified by its evaluation at n distinct values of x.

y

yj = A(xj )

xj x

54 Polynomials: point-value representation

Univariate polynomial. [ point-value representation ]

A(x): (x0, y0), …, (xn−1, yn−1)

B(x): (x0, z0), …, (xn−1, zn−1)

Addition. O(n) arithmetic operations.

A(x) + B(x): (x0, y0 + z0), …, (xn−1, yn−1 + zn−1)

Multiplication. O(n), but represent A(x) and B(x) using 2n points.

A(x) � B(x): (x0, y0 � z0), …, (x2n−1, y2n−1 � z2n−1)

Evaluation. O(n2) using Lagrange’s formula.

n 1 j=k(x xj ) A(x)= yk not used j=k(xk xj ) AAAClnicbVFNa9wwEJWdfqTp1yY9lV5El8LmkMUOhaYspQmltMctdJvAamtkeZwo1ocrycWL8a/MKT8lt8obH7qbDGh4em9GIz2lpeDWRdF1EG49ePjo8faTnafPnr94Odjd+2V1ZRjMmBbanKXUguAKZo47AWelASpTAadp8aXTT/+CsVyrn25ZwkLSc8VzzqjzVDKQJ6N6H5MJ/tQlYiuZNIXfRO3vRh3Ebccuk8JnTHJDWUNKo7OkucREwR9ctHhU4wNcJ5f7ZEIm7X16UvQVbTIYRuNoFfguiHswRH1Mk91gj2SaVRKUY4JaO4+j0i0aahxnAtodUlkoKSvoOcw9VFSCXTQrX1r8zjMZzrXxSzm8Yv/vaKi0dilTXympu7CbWkfep80rlx8tGq7KyoFit4PySmCncWcyzrgB5sTSA8oM93fF7IJ6+5z/irUpq7NLYGsvaepKcaYz2GCFq52hnYvxpmd3wexw/HEc/3g/PD7q7dxGb9BbNEIx+oCO0Xc0RTPE0BW6CcJgK3wdfg6/ht9uS8Og73mF1iKc/gMJQ8fb k=0 55 Converting between two representations

Tradeoff. Either fast evaluation or fast multiplication. We want both!

representation multiply evaluate

coefcient O(n2) O(n)

point-value O(n) O(n2)

Goal. Efficient conversion between two representations ⇒ all ops fast.

coefcient representation point-value representation

a ,a,...,a (x ,y ),...,(x ,y ) 0 1 n 1 AAACVHicbVDLSgMxFM2Mr/quunQTLIKilhkR1F3BjUsFq0JbhkzmVoOZZEjuSMvQj/Br3OpXiB8jmBm7sNVDAodz7s3NPXEmhcUg+PT8mdm5+YXa4tLyyuraen1j89bq3HBocy21uY+ZBSkUtFGghPvMAEtjCXfx00Xp3z2DsUKrGxxm0EvZgxJ9wRk6Kaof7A2i4JAOo2D/kHbdkYlGW1JnFOooHJVmRfajeiNoBhXoXxKOSYOMcRVteJvdRPM8BYVcMms7YZBhr2AGBZcwWurmFjLGn9gDdBxVLAXbK6qtRnTXKQnta+OuQlqpvzsKllo7TGNXmTJ8tNNeKf7ndXLsn/UKobIcQfGfQf1cUtS0jIgmwgBHOXSEcSPcXyl/ZIZxdEFOTKnezoBPbFIMciW4TmBKlThAw0YuxXA6s7+kfdw8b4bXJ43W2TjOGtkmO2SPhOSUtMgluSJtwskLeSVv5N378L78GX/up9T3xj1bZAL+2jesjLKd 0 0 n 1 n 1

AAACjHicbVFNaxsxEJW3H0ndprGTYy6iptCDY3ZDQhxKIRAIOaZQNwGvWWa141hEKy3SbIhZ/AP7E/orem1v0dp7iJ0OiHm8mTcjPaWFko7C8HcrePX6zdut7Xft9x92Pu52uns/nSmtwJEwytjbFBwqqXFEkhTeFhYhTxXepPcXdf3mAa2TRv+geYGTHO60nEoB5KmkIyAJ+3GfQxLVKVaZIedzH5JKH0YLHsftdNUSz1wBAqtBGB1JveBpI9mgn81IVzOSTi8chMvgL0HUgB5r4jrptvbizIgyR01CgXPjKCxoUoElKRQu2nHp0C+9hzsce6ghRzeplm4s+GfPZHxqrD+a+JJ9rqggd26ep74zB5q5zVpN/q82Lmk6nFRSFyWhFqtF01JxMry2lmfSoiA19wCElf6uXMzAgiD/AWtblrMLFGsvqR5LLYXJcINV9EgWahejTc9egtHR4GwQfT/unQ8bO7fZAfvEvrCInbJzdsWu2YgJ9ov9YX/Zv2A3OAm+Bt9WrUGr0eyztQgunwAsEcTQ b0,b1,...,bn 1

56 Converting between two representations

Application. Polynomial multiplication (coefficient representation).

coefcient representation point value representation

a0,a1,...,an 1 (x0,y0),...,(x2n 1,y2n 1) FFT

AAACjHicbVFNaxsxEJW3H0ndprGTYy6iptCDY3ZDQhxKIRAIOaZQNwGvWWa141hEKy3SbIhZ/AP7E/orem1v0dp7iJ0OiHm8mTcjPaWFko7C8HcrePX6zdut7Xft9x92Pu52uns/nSmtwJEwytjbFBwqqXFEkhTeFhYhTxXepPcXdf3mAa2TRv+geYGTHO60nEoB5KmkIyAJ+3GfQxLVKVaZIedzH5JKH0YLHsftdNUSz1wBAqtBGB1JveBpI9mgn81IVzOSTi8chMvgL0HUgB5r4jrptvbizIgyR01CgXPjKCxoUoElKRQu2nHp0C+9hzsce6ghRzeplm4s+GfPZHxqrD+a+JJ9rqggd26ep74zB5q5zVpN/q82Lmk6nFRSFyWhFqtF01JxMry2lmfSoiA19wCElf6uXMzAgiD/AWtblrMLFGsvqR5LLYXJcINV9EgWahejTc9egtHR4GwQfT/unQ8bO7fZAfvEvrCInbJzdsWu2YgJ9ov9YX/Zv2A3OAm+Bt9WrUGr0eyztQgunwAsEcTQ b0,b1,...,bn 1 (x ,z ),...,(x ,z ) O(n log n) AAACh3icfVBdT9swFHUztrHsq7DHvVhUSK3EqgRNG3sD8cIjSAsgNV3kOLetVceO7JupJeq/25/YX+CV/QCcNkhrQVzZ0tE599xrn7SQwmIQ/G15L7Zevnq9/cZ/++79h4/tnd1Lq0vDIeJaanOdMgtSKIhQoITrwgDLUwlX6fS01q9+g7FCq584L2CYs7ESI8EZOipp/+rOkuCAzpOgd0Bjd2Sm0dbQCdWh+hIuanWFejSO/ZXh5lnDzYMhaXeCfrAs+hiEDeiQps6TndZunGle5qCQS2btIAwKHFbMoOASFn5cWigYn7IxDBxULAc7rJZBLOi+YzI60sZdhXTJ/u+oWG7tPE9dZ85wYje1mnxKG5Q4OhpWQhUlguKrRaNSUtS0TpVmwgBHOXeAcSPcWymfMMM4uuzXtixnF8DXflLNSiW4zmCDlThDwxYuxXAzs8cgOuz/6IcXXzvHR02c2+Qz2SNdEpLv5JickXMSEU7+kFtyR/55vhd437ym12s1nk9krbyTe1JCwRM= 0 0 2n 1 2n 1

point-value multiplication O(n)

inverse FFT

cAAACa3icbVBNT9tAEN2YftBAS/g4tRxWREg9pJEdVQJuqFw4UqkBpDiy1uMJWbHetXbHiMjy/+HXcIX+CX4Dm5ADCYy00tN7M/N2Xloo6SgM/zeClQ8fP31e/dJcW//6baO1uXXuTGkB+2CUsZepcKikxj5JUnhZWBR5qvAivT6Z6hc3aJ00+h9NChzm4krLkQRBnkpafyAJO3GHx2NXCMCqG0Y9qWsOSfQezWOVGXJe6EBS9fSvXt1MWu2wG86KvwXRHLTZvM6SzcZWnBkoc9QESjg3iMKChpWwJEFh3YxLh972WlzhwEMtcnTDanZszfc9k/GRsf5p4jP29UQlcucmeeo7c0Fjt6xNyfe0QUmjw2EldVESangxGpWKk+HT5HgmLQKpiQcCrPR/5TAWVgD5fBdcZrsLhIVLqttSSzAZLrGKbsmK2qcYLWf2FvR73aNu9Pd3+/hwHucq+8H22E8WsQN2zE7ZGeszYHfsnj2wx8ZTsBN8D3ZfWoPGfGabLVSw/wypk7qA 0,c1,...,c2n 2 (x ,y z ),...,(x ,y z ) AAACcXicbVBNSwMxEE3Xr/pd600vwSIoatktgnoTvHhUsCq0ZclmpxrMJksyK61L/5K/xosH/RX+ALNtQVsdEnh582Ym86JUCou+/17yZmbn5hfKi0vLK6tr65WN6q3VmeHQ5Fpqcx8xC1IoaKJACfepAZZEEu6ip4sif/cMxgqtbrCfQidhD0p0BWfoqLByudcL/UPaD33aRpGApS+hv39I2+7IWKMtoNPkDXUUDArhCP2oR+/9sFLz6/4w6F8QjEGNjOMq3ChV27HmWQIKuWTWtgI/xU7ODAouYbDUziykjD+xB2g5qJib18mHKw/ormNi2tXGXYV0yP6uyFlibT+JnDJh+GincwX5X66VYfe0kwuVZgiKjwZ1M0lR08I/GgsDHGXfAcaNcH+l/JEZxtG5PDFl2DsFPrFJ3suU4DqGKVZiDw0bOBeDac/+gmajflYPro9r56djO8tkm+yQPRKQE3JOLskVaRJOXskb+SCfpS9vy6PezkjqlcY1m2QivINvAdS8eA== 0 0 0 2n 1 2n 1 2n 1 O(n log n) coefcient representation point value representation 57 Converting between two representations: brute force

n–1 Coefficient ⇒ point-value. Given a polynomial A(x) = a0 + a1 x + ... + an–1 x , evaluate it at n distinct points x0 , ..., xn–1.

2 n−1 # y0 & # 1 x x " x & # a0 & % 0 0 0 ( % y ( 2 n−1 % a ( % 1 ( % 1 x1 x1 " x1 ( % 1 ( % 2 n−1 ( % y2 ( = 1 x2 x2 " x2 % a2 ( % ( % ( % ( % ! ( % ! ! ! # ! ( % ! ( % y ( % x x2 xn−1 ( % a ( $ n−1 ' $ 1 n−1 n−1 " n−1 ' $ n−1 '

Running time. O(n2) via matrix–vector multiply (or n Horner’s).

58 Converting between two representations: brute force

Point-value ⇒ coefficient. Given n distinct points x0, ... , xn–1 and values n–1 y0, ... , yn–1, find unique polynomial A(x) = a0 + a1 x + ... + an–1 x , that has given values at given points.

2 n−1 # y0 & # 1 x x " x & # a0 & % 0 0 0 ( % y ( 2 n−1 % a ( % 1 ( % 1 x1 x1 " x1 ( % 1 ( % 2 n−1 ( % y2 ( = 1 x2 x2 " x2 % a2 ( % ( % ( % ( % ! ( % ! ! ! # ! ( % ! ( % y ( % x x2 xn−1 ( % a ( $ n−1 ' $ 1 n−1 n−1 " n−1 ' $ n−1 '

Vandermonde matrix is invertible iff xi distinct €

Running time. O(n3) via Gaussian elimination.

or O(n2.38) via fast matrix multiplication

59 Divide-and-conquer II: quiz 7

Which divide-and-conquer approach to use to multiply polynomials?

2 3 4 5 6 7 A(x) = a0 + a1 x + a2 x + a3 x + a4 x + a5 x + a6 x + a7 x .

A. Divide polynomial into low- and high-degree terms.

2 3 Alow(x) = a0 + a1 x + a2 x + a3 x . 2 3 Ahigh (x) = a4 + a5 x + a6 x + a7 x .

B. Divide polynomial into even- and odd-degree terms.

2 3 Aeven(x) = a0 + a2 x + a4 x + a6 x .

2 3 Aodd (x) = a1 + a3 x + a5 x + a7 x .

C. Either A or B.

D. Neither A nor B.

60 Divide-and-conquer

Decimation in time. Divide into even- and odd- degree terms.

2 3 4 5 6 7 ・A(x) = a0 + a1 x + a2 x + a3 x + a4 x + a5 x + a6 x + a7 x . 2 3 ・Aeven(x) = a0 + a2 x + a4 x + a6 x . 2 3 ・Aodd (x) = a1 + a3 x + a5 x + a7 x . A(x) = A (x2) + x A (x2). ・ even odd Cooley–Tukey radix 2 FFT

Decimation in frequency. Divide into low- and high-degree terms.

2 3 4 5 6 7 ・A(x) = a0 + a1 x + a2 x + a3 x + a4 x + a5 x + a6 x + a7 x . 2 3 ・Alow(x) = a0 + a1 x + a2 x + a3 x . 2 3 ・Ahigh (x) = a4 + a5 x + a6 x + a7 x . 4 ・A(x) = Alow(x) + x Ahigh(x). Sande–Tukey radix 2 FFT

61 Coefficient to point-value representation: intuition

n–1 Coefficient ⇒ point-value. Given a polynomial A(x) = a0 + a1 x + ... + an–1 x , evaluate it at n distinct points x0 , ..., xn–1. we get to choose which ones!

Divide. Break up polynomial into even- and odd-degree terms.

2 3 4 5 6 7 ・A(x) = a0 + a1 x + a2 x + a3 x + a4 x + a5 x + a6 x + a7 x . 2 3 ・Aeven(x) = a0 + a2 x + a4 x + a6 x . 2 3 ・Aodd (x) = a1 + a3 x + a5 x + a7 x . 2 2 ・A(x) = Aeven(x ) + x Aodd(x ). 2 2 ・A(−x) = Aeven(x ) – x Aodd(x ).

Intuition. Choose two points to be ±1. Can evaluate polynomial of degree n-1 A( 1) = Aeven(1) + 1 Aodd(1). ・ at 2 points by evaluating two polynomials A(−1) = A (1) – 1 A (1). ・ even odd of degree ½n - 1 at only 1 point.

62 Coefficient to point-value representation: intuition

n–1 Coefficient ⇒ point-value. Given a polynomial A(x) = a0 + a1 x + ... + an–1 x , evaluate it at n distinct points x0 , ..., xn–1. we get to choose which ones!

Divide. Break up polynomial into even- and odd-degree terms.

2 3 4 5 6 7 ・A(x) = a0 + a1 x + a2 x + a3 x + a4 x + a5 x + a6 x + a7 x . 2 3 ・Aeven(x) = a0 + a2 x + a4 x + a6 x . 2 3 ・Aodd (x) = a1 + a3 x + a5 x + a7 x . 2 2 ・A(x) = Aeven(x ) + x Aodd(x ). 2 2 ・A(−x) = Aeven(x ) – x Aodd(x ).

Intuition. Choose four complex points to be ±1, ±i. ・A( 1) = Aeven(1) + 1 Aodd(1). Can evaluate polynomial of degree n-1 A(−1) = Aeven(1) – 1 Aodd(1). ・ at 4 points by evaluating two polynomials A( i ) = A (−1) + i A (−1). ・ even odd of degree ½n - 1 at only 2 points. ・A( −i ) = Aeven(−1) – i Aodd(−1).

63 Discrete Fourier transform

n–1 Coefficient ⇒ point-value. Given a polynomial A(x) = a0 + a1 x + ... + an–1 x , evaluate it at n distinct points x0 , ..., xn–1. we get to choose which ones!

k th Key idea. Choose xk = ω where ω is principal n root of unity.

# y0 & # 1 1 1 1 " 1 & # a0 & % 1 2 3 n−1 ( % y ( 1 ω ω ω " ω % a ( % 1 ( % ( % 1 ( 2 4 6 2(n−1) k % ( yk = A(ω ) % y2 ( 1 ω ω ω " ω % a2 ( = % 3 6 9 3(n 1) ( % y ( 1 ω ω ω " ω − % a ( % 3 ( % ( % 3 ( % ! ( % ! ! ! ! # ! ( % ! ( % n 1 2(n 1) 3(n 1) (n 1)(n 1)( % y ( 1 ω − ω − ω − " ω − − %a ( $ n−1' $ ' $ n−1'

DFT Fourier matrix Fn €

64 Roots of unity

Def. An nth root of unity is a complex number x such that x n = 1.

Fact. The nth roots of unity are: ω0, ω1, …, ωn–1 where ω = e 2π i / n. Pf. (ωk) n = (e 2π i k / n) n = (e π i ) 2k = (−1) 2k = 1. common alternative: ω = e −2π i / n

Fact. The ½nth roots of unity are: ν0, ν1, …, ν n/2–1 where ν = ω2 = e 4π i / n.

ω2 = ν1 = i

3 ω ω1

4 2 n = 8 ω = ν = −1 ω0 = ν0 = 1

7 ω5 ω

ω6 = ν3 = −i

65 Fast Fourier transform

n–1 Goal. Evaluate a degree n – 1 polynomial A(x) = a0 + ... + an–1 x at its nth roots of unity: ω0, ω1, …, ω n–1.

Divide. Break up polynomial into even- and odd-degree terms. 2 n/2–1 ・Aeven(x) = a0 + a2 x + a4 x + … + an−2 x . 2 n/2–1 ・Aodd (x) = a1 + a3 x + a5 x + … + an−1 x . 2 2 ・A(x) = Aeven(x ) + x Aodd(x ). 2 2 ・A(−x) = Aeven(x ) – x Aodd(x ).

th 0 1 n/2–1 Conquer. Evaluate Aeven(x) and Aodd(x) at the ½n roots of unity: ν , ν , …, ν .

νk = (ωk)2 Combine. k k k k . ・yk = A(ω ) = Aeven(ν ) + ω Aodd (ν ), 0 ≤ k < n/2 k+ ½n k k k . ・yk+ ½n = A(ω ) = Aeven(ν ) – ω Aodd (ν ), 0 ≤ k < n/2

A(−ωk)

66 FFT: implementation

n–1 Goal. Evaluate a degree n – 1 polynomial A(x) = a0 + ... + an–1 x at its nth roots of unity: ω0, ω1, …, ω n–1. k k k k . ・yk = A(ω ) = Aeven(ν ) + ω Aodd (ν ), 0 ≤ k < n/2 k+ ½n k k k . ・yk+ ½n = A(ω ) = Aeven(ν ) – ω Aodd (ν ), 0 ≤ k < n/2

FFT(n, a0, a1, a2, …, an–1)

______

IF (n = 1) RETURN a0.

(e0, e1, …, en/2–1) ← FFT(n / 2, a0, a2, a4, …, an–2). 2 T(n / 2) (d0, d1, …, dn/2–1) ← FFT(n / 2, a1, a3, a5, …, an–1).

FOR k = 0 TO n / 2 – 1. ωk ← e 2π i k/n. Θ(n) yk ← ek + ωk dk.

yk + n/2 ← ek – ωk dk .

RETURN (y0, y1, y2, …, yn–1).

______67 FFT: summary

Theorem. The FFT algorithm evaluates a degree n – 1 polynomial at each of the nth roots of unity in O(n log n) arithmetic operations and O(n) extra space.

Pf. assumes n is a power of 2 (1) n =1 T (n)= 2T (n/2) + (n) n>1 AAACr3icbVFda9swFJXdfXTZV5o+dg9i6UbKILPDoC2lo7CXPXaQrIHIy2T5OhGVZSNdlwST1/3H/Yf+iMmpB0uyCzKHc+85VzqOCyUtBsFvz9979PjJ0/1nrecvXr563T7ofLd5aQSMRK5yM465BSU1jFCignFhgGexgpv49kvdv7kDY2Wuh7gsIMr4TMtUCo6OmrZ/DXv6hLILell/WiyGmdSVcI521WIXbDgH5L3whL6nDGGBFZUpPdZuPDymK8rYpH8KWeRGB87p46D2+uCcaKPUO8rPf5UtBjppVk3b3aAfrIvugrABXdLU9fTA67AkF2UGGoXi1k7CoMCo4galUODuXloouLjlM5g4qHkGNqrWga3oO8ckNM2NOxrpmv1XUfHM2mUWu8mM49xu92ryf71JielZVEldlAhaPCxKS0Uxp3X6NJEGBKqlA1wY6e5KxZwbLtD9o40ta+8CxMZLqkWppcgT2GIVLtDwOsVwO7NdMBr0z/vht0/dq7Mmzn1yRN6SHgnJKbkiX8k1GRFB7r2Od+S98Qf+2P/h/3wY9b1Gc0g2ypd/AF7fy8Q=

coefcient representation O(n log n) point-value representation FFT a ,a,...,a (x ,y ),...,(x ,y ) 0 1 n 1 AAACVHicbVDLSgMxFM2Mr/quunQTLIKilhkR1F3BjUsFq0JbhkzmVoOZZEjuSMvQj/Br3OpXiB8jmBm7sNVDAodz7s3NPXEmhcUg+PT8mdm5+YXa4tLyyuraen1j89bq3HBocy21uY+ZBSkUtFGghPvMAEtjCXfx00Xp3z2DsUKrGxxm0EvZgxJ9wRk6Kaof7A2i4JAOo2D/kHbdkYlGW1JnFOooHJVmRfajeiNoBhXoXxKOSYOMcRVteJvdRPM8BYVcMms7YZBhr2AGBZcwWurmFjLGn9gDdBxVLAXbK6qtRnTXKQnta+OuQlqpvzsKllo7TGNXmTJ8tNNeKf7ndXLsn/UKobIcQfGfQf1cUtS0jIgmwgBHOXSEcSPcXyl/ZIZxdEFOTKnezoBPbFIMciW4TmBKlThAw0YuxXA6s7+kfdw8b4bXJ43W2TjOGtkmO2SPhOSUtMgluSJtwskLeSVv5N378L78GX/up9T3xj1bZAL+2jesjLKd 0 0 n 1 n 1 ???

AAACjHicbVFNaxsxEJW3H0ndprGTYy6iptCDY3ZDQhxKIRAIOaZQNwGvWWa141hEKy3SbIhZ/AP7E/orem1v0dp7iJ0OiHm8mTcjPaWFko7C8HcrePX6zdut7Xft9x92Pu52uns/nSmtwJEwytjbFBwqqXFEkhTeFhYhTxXepPcXdf3mAa2TRv+geYGTHO60nEoB5KmkIyAJ+3GfQxLVKVaZIedzH5JKH0YLHsftdNUSz1wBAqtBGB1JveBpI9mgn81IVzOSTi8chMvgL0HUgB5r4jrptvbizIgyR01CgXPjKCxoUoElKRQu2nHp0C+9hzsce6ghRzeplm4s+GfPZHxqrD+a+JJ9rqggd26ep74zB5q5zVpN/q82Lmk6nFRSFyWhFqtF01JxMry2lmfSoiA19wCElf6uXMzAgiD/AWtblrMLFGsvqR5LLYXJcINV9EgWahejTc9egtHR4GwQfT/unQ8bO7fZAfvEvrCInbJzdsWu2YgJ9ov9YX/Zv2A3OAm+Bt9WrUGr0eyztQgunwAsEcTQ b0,b1,...,bn 1

68 Divide-and-conquer II: quiz 8

When computing the FFT of (a0, a1, a2, …, a7), which are the first two coefcients involved in an arithmetic operation?

A. a0 and a1.

B. a0 and a2.

C. a0 and a4.

D. a0 and a7.

E. None of the above.

69 FFT: recursion tree

a0, a1, a2, a3, a4, a5, a6, a7

inverse perfect shufe

a0, a2, a4, a6 a1, a3, a5, a7

a0, a4 a2, a6 a1, a5 a3, a7

a0 a4 a2 a6 a1 a5 a3 a7 000 100 010 110 001 101 011 111

“bit-reversed” order

70 FFT: Fourier matrix decomposition

Alternative viewpoint. FFT is a recursive decomposition of Fourier matrix.

$ 1 1 1 1 ! 1 ' a & 1 2 3 n−1 ) 0 1 ω ω ω ! ω & ) a1 & 1 2 4 6 2(n−1) ) ω ω ω ! ω a = a2 Fn = & 3 6 9 3(n 1) ) 1 ω ω ω ! ω − & ) & " " " " # " ) Fourier matrix an 1 & n 1 2(n 1) 3(n 1) (n 1)(n 1)) AAACinicbVBdSxtBFJ2stdVUa9RHX4aGgi+GXVtQKQWhPvioYKqQDeHu7E0cnJ1ZZu5KwrK/r7/BH+GrfXUSV2JiDwycOWfux5wkV9JRGD40gpUPqx8/ra03P29sftlqbe/8caawArvCKGNvEnCopMYuSVJ4k1uELFF4ndz9nvrX92idNPqKJjn2MxhpOZQCyEuDFgD/xZtxgiOpyyQDsnJcNbkHDEIexzWN5vTwlcb3qSE3N0p9EFXTa4w6nfcatNphJ5yBvydRTdqsxsVgu7ETp0YUGWoSCpzrRWFO/RIsSaGwasaFwxzEHYyw56mGDF2/nGVR8W9eSfnQWH808Zn6tqKEzLlJlviXfsNbt+xNxf95vYKGx/1S6rwg1OJl0LBQnAyfBstTaVGQmngCwkq/Kxe3YEGQj39hyqx3jmLhJ+W40FKYFJdURWOyUPkUo+XM3pPuYeekE13+aJ8e13GusT32le2ziB2xU3bOLliXCfaXPbIn9i/YDL4HJ8HPl6dBo67ZZQsIzp4BhJXBjA== 1 ω − ω − ω − ! ω − − % (

100... 0 0 00... 0 € 010... 0 0 1 0 ... 0 001... 0 002 ... 0 In = Dn = n 1 000... 1 000...

AAAC9nicfVFNj9MwEHUCLN3CLm05crGoQJyqZBeJckCqxAVuRdrQrpqqcpxpa63jRPakahX1t3BCe90/wg/g3+Ck4aMt2pFGenpv/GY8E2VSGPS8n4774OGjk8eN0+aTp2fnz1rtzleT5ppDwFOZ6nHEDEihIECBEsaZBpZEEkbRzcdSH61AG5GqK9xkME3YQom54AwtNWv9+DxT9ANthhEshCqihKEW622T2vDpa+rVGcYpmgqHYSWWpH+f+LvgSAxXNXMMbPEfVHF7fgfd/NIwBBX/nXrW6no9rwp6DPwadEkdw1nb6Vg7niegkEtmzMT3MpwWTKPgErbNMDeQMX7DFjCxULEEzLSo9r6lrywT03mqbSqkFfvvi4IlxmySyFbaCZfmUCvJ/2mTHOf9aSFUliMovms0zyXFlJZHpLHQwFFuLGBcCzsr5UumGUd76r0ulXcGfO8nxTpXgqcxHLAS16jZ1m7RP9zZMQgueu97/pe33UG/XmeDvCAvyRvik3dkQD6RIQkIdy6daydyuLtxv7nf3dtdqevUb56TvXDvfgGMxtrY AAADHHicfZLNahsxEMe127RN3C8nOfYiYhp6qdkNhaaHQqA95JhCnQ+8rtFqx46IVlqk2WCzLL22T9Gn6SnkWuhz5AUqy0sS26UDWv7856cZabRpIYXFKPoThA/WHj56vL7RevL02fMX7c2tY6tLw6HHtdTmNGUWpFDQQ4ESTgsDLE8lnKQXH2f5k0swVmj1BacFDHI2VmIkOENnDds3n4aKfqCtJIWxUFWaMzRiUreoi0TnMGZfI0p3afPZpUmm0XqdJJ7y5pyM6X+pe+SeJ1eo5LJxVoWDb5X3FgovtW26VOpNXM9qJ6Cyu7sN252oG/mgqyJuRIc0cTTcDLZcZV7moJBLZm0/jgocVMyg4BLqVlJaKBi/YGPoO6lYDnZQ+dep6SvnZHSkjVsKqXfv76hYbu00Tx3pTnhul3Mz81+5fomj/UElVFEiKD5vNColRU1nT00zYYCjnDrBuBHurJSfM8M4uh9ioYuvXQBfuEk1KZXgOoMlV+IEDavdFOPlma2K3l73fTf+/LZzsN+Mc528JDvkNYnJO3JADskR6REenAXfgu/Bj/Bn+Cu8Cq/naBg0e7bJQoS//wLYQumu

DFT

In/2 Dn/2 Fn/2 aeven y = Fn a = In/2 Dn/2 Fn/2 aodd AAAC6XicbZHdatswFMdl76vLPpp2l7sRCyu7WDO7DNYRBoWNssEuOpjbQhSMLJ8korJkJLkkCL/ErsZu9yJ7jb3NZNdlSdoDEn9+51NHWSm4sVH0Nwjv3L13/8HWw96jx0+ebvd3dk+NqjSDhCmh9HlGDQguIbHcCjgvNdAiE3CWXXxs/GeXoA1X8rtdljAp6EzyKWfUepT2/ywxGeEPzXWcSkxeY3oNeiSDGZcuK6jVfFH3vqROvjmo8R4m5ZxKqwq3X3/qICEr/v0VSkDm/2vcUvb4OnaEaergEmSbt45Vntdk5PVmxbQ/iIZRa/imiDsxQJ2dpDvBLskVqwqQlglqzDiOSjtxVFvOBPgRKwMlZRd0BmMvJS3ATFy76hq/9CTHU6X9kRa3dDXD0cKYZZH5SD/g3Gz6Gnibb1zZ6eHEcVlWFiS7ajStBLYKN/+Gc66BWbH0gjLN/ayYzammzPrfXevS1i6Brb3ELSrJmcphgwq7sJo2W4w3d3ZTJAfD98P429vB0WG3zi30HL1Ar1CM3qEj9BmdoASxYC/4GiTBaSjCH+HP8NdVaBh0Oc/QmoW//wFSjOU3 71 Inverse discrete Fourier transform

Point-value ⇒ coefficient. Given n distinct points x0, ... , xn–1 and n–1 values y0, ... , yn–1, find unique polynomial a0 + a1x + ... + an–1 x , that has given values at given points.

−1 # a0 & # 1 1 1 1 " 1 & # y0 & % 1 2 3 n−1 ( % a ( 1 ω ω ω " ω % y ( % 1 ( % ( % 1 ( 2 4 6 2(n−1) % a2 ( % 1 ω ω ω " ω ( % y2 ( = % 3 6 9 3(n 1) ( % a ( 1 ω ω ω " ω − % y ( % 3 ( % ( % 3 ( % ! ( % ! ! ! ! # ! ( % ! ( % n 1 2(n 1) 3(n 1) (n 1)(n 1)( %a ( 1 ω − ω − ω − " ω − − % y ( $ n−1' $ ' $ n−1'

Inverse DFT −1 € Fourier matrix inverse (Fn)

72 Inverse discrete Fourier transform

Claim. Inverse of Fourier matrix Fn is given by following formula:

$ 1 1 1 1 ! 1 ' & 1 −1 −2 −3 −(n−1) ) & ω ω ω ! ω ) −2 −4 −6 −2(n−1) 1 & 1 ω ω ω ! ω ) Gn = & −3 −6 −9 −3(n−1) ) n & 1 ω ω ω ! ω ) & " " " " # " ) & (n 1) 2(n 1) 3(n 1) (n 1)(n 1)) 1 ω− − ω− − ω− − ! ω− − − % (

Fn / √n is a unitary matrix

Consequence. To compute the inverse FFT, apply the same algorithm but use ω–1 = e –2π i / n as principal nth root of unity (and divide the result by n).

73 Inverse FFT: proof of correctness

Claim. Fn and Gn are inverses. Pf. n 1 n 1 1 kj jk 1 (k k)j 1 k = k (FnGn)kk = = = n n 0

AAADDnicjVLditNAFJ7Ev7X+dddLbwaLtgu2JCK4IoUFQQVvVrDuQtMNk+lJO00yiTMn2jrkIXwar8RbX8EH8D2cdMNqu1544MDH953vnJkzExWp0Oh5Px330uUrV6/tXG/duHnr9p327t57nZeKw4jnaa5OIqYhFRJGKDCFk0IBy6IUjqPkRa0ffwSlRS7f4aqAScZmUsSCM7RU2P4VpBAj7dGXoaSvbAZKzOa4H5ok6VY0eD602QpixbjxKyMtpcssNIuhV50a2fctkWcwY6cmWfzB/UXj/l9zL+kn3f3FuSeCmZCG25vpqhU8oj59SAOEJRoR10U0oUOadGkQ1Kp3ruY4B/VJaLAukNOmQ9jueANvHfQi8BvQIU0chbvOXjDNeZmBRJ4yrce+V+DEMIWCp3XzUkPBeMJmMLZQsgz0xKzfo6IPLDOlca5sSqRr9m+HYZnWqyyylRnDud7WavJf2rjE+GBihCxKBMnPBsVlSjGn9ePSqVDAMV1ZwLgS9qyUz5ldP9ovsDFl3bsAvnETsyyl4PkUttgUl6hYvUV/e2cXwejx4NnAf/ukc3jQrHOH3CP3SY/45Ck5JK/JERkR7rxxPjifHeN+cb+639zvZ6Wu03juko1wf/wGpJL1Lg== j=0 j=0

summation lemma (below)

Summation lemma. Let ω be a principal nth root of unity. Then

n 1 n k 0 (mod n) kj = 0 AAACp3icbVFdixMxFM2MX2v92O6ub74EiyKiZUYEV5aFBV98rGB3C023ZDK3bbaZZEzurC1h/J+++0PMdAexXS8EDueee25ykpVKOkySX1F85+69+w/2HnYePX7ydL97cHjuTGUFDIVRxo4y7kBJDUOUqGBUWuBFpuAiW35u+hfXYJ00+huuS5gUfK7lTAqOgZp2fzJXFVN/dZrUl16/S2vKTAFzfumXVwGfnLKTDstgLrUXYY2rO+wt1fQVZQgr9HLWiOiSMvheyWuaUFYWJve6ZqxRJn+VBhdgf0gHwQF03rpNu72kn2yK3gZpC3qkrcH0IDpkuRFVARqF4s6N06TEiecWpVCNeeWg5GLJ5zAOUPMC3MRvgqrpy8DkdGZsOBrphv13wvPCuXWRBWXBceF2ew35v964wtnxxEtdVgha3CyaVYqioU3qNJcWBKp1AFxYGe5KxYJbLjD8zdaWjXcJYuslflVpKUwOO6zCFVrepJjuZnYbDN/3P/XTrx96Z8dtnHvkOXlBXpOUfCRn5AsZkCER5HfUiY6iZ/GbeBCfx6MbaRy1M0dkq2L+B+kA0JM= j=0 Pf. ・If k is a multiple of n, then ωk = 1 ⇒ series sums to n. ・Each nth root of unity ωk is a root of x n – 1 = (x – 1) (1 + x + x2 + ... + x n−1). ・if ωk ≠ 1, then 1 + ωk + ωk(2) + … + ωk(n−1) = 0 ⇒ series sums to 0. ▪

74 Inverse FFT: implementation

Note. Need to divide result by n.

INVERSE-FFT(n, y0, y1, y2, …, yn–1)

______

IF (n = 1) RETURN y0.

(e0, e1, …, en/2–1) ← INVERSE-FFT(n / 2, y0, y2, y4, …, yn–2).

(d0, d1, …, dn/2–1) ← INVERSE-FFT(n / 2, y1, y3, y5, …, yn–1).

FOR k = 0 TO n / 2 – 1.

ωk ← e –2π i k / n. switch roles of ai and yi

ak ← ek + ωk dk.

ak + n/2 ← ek – ωk dk.

RETURN (a0, a1, a2, …, an–1).

______

75 Inverse FFT: summary

Theorem. The inverse FFT algorithm interpolates a degree n – 1 polynomial at each of the nth roots of unity in O(n log n) arithmetic operations.

assumes n is a power of 2

Corollary. Can convert between coefficient and point-value representations in O(n log n) arithmetic operations.

coefcient representation O(n log n) point-value representation FFT a ,a,...,a (x ,y ),...,(x ,y ) 0 1 n 1 AAACVHicbVDLSgMxFM2Mr/quunQTLIKilhkR1F3BjUsFq0JbhkzmVoOZZEjuSMvQj/Br3OpXiB8jmBm7sNVDAodz7s3NPXEmhcUg+PT8mdm5+YXa4tLyyuraen1j89bq3HBocy21uY+ZBSkUtFGghPvMAEtjCXfx00Xp3z2DsUKrGxxm0EvZgxJ9wRk6Kaof7A2i4JAOo2D/kHbdkYlGW1JnFOooHJVmRfajeiNoBhXoXxKOSYOMcRVteJvdRPM8BYVcMms7YZBhr2AGBZcwWurmFjLGn9gDdBxVLAXbK6qtRnTXKQnta+OuQlqpvzsKllo7TGNXmTJ8tNNeKf7ndXLsn/UKobIcQfGfQf1cUtS0jIgmwgBHOXSEcSPcXyl/ZIZxdEFOTKnezoBPbFIMciW4TmBKlThAw0YuxXA6s7+kfdw8b4bXJ43W2TjOGtkmO2SPhOSUtMgluSJtwskLeSVv5N378L78GX/up9T3xj1bZAL+2jesjLKd 0 0 n 1 n 1 inverse FFT

AAACjHicbVFNaxsxEJW3H0ndprGTYy6iptCDY3ZDQhxKIRAIOaZQNwGvWWa141hEKy3SbIhZ/AP7E/orem1v0dp7iJ0OiHm8mTcjPaWFko7C8HcrePX6zdut7Xft9x92Pu52uns/nSmtwJEwytjbFBwqqXFEkhTeFhYhTxXepPcXdf3mAa2TRv+geYGTHO60nEoB5KmkIyAJ+3GfQxLVKVaZIedzH5JKH0YLHsftdNUSz1wBAqtBGB1JveBpI9mgn81IVzOSTi8chMvgL0HUgB5r4jrptvbizIgyR01CgXPjKCxoUoElKRQu2nHp0C+9hzsce6ghRzeplm4s+GfPZHxqrD+a+JJ9rqggd26ep74zB5q5zVpN/q82Lmk6nFRSFyWhFqtF01JxMry2lmfSoiA19wCElf6uXMzAgiD/AWtblrMLFGsvqR5LLYXJcINV9EgWahejTc9egtHR4GwQfT/unQ8bO7fZAfvEvrCInbJzdsWu2YgJ9ov9YX/Zv2A3OAm+Bt9WrUGr0eyztQgunwAsEcTQ b0,b1,...,bn 1 O(n log n)

76 Polynomial multiplication

n−1 Theorem. Given two polynomials A(x) = a0 + a1 x + … + an−1 x n−1 and B(x) = b0 + b1 x + … + bn−1 b of degree n – 1, can multiply pad with 0s to make n a power of 2 them in O(n log n) arithmetic operations.

Pf.

coefcient representation point-value representation

0 2n 1 a0,a1,...,an 1 two FFTs A( ),..., A( ) O(n log n) 0 2n 1

AAACjHicbVFNaxsxEJW3H0ndprGTYy6iptCDY3ZDQhxKIRAIOaZQNwGvWWa141hEKy3SbIhZ/AP7E/orem1v0dp7iJ0OiHm8mTcjPaWFko7C8HcrePX6zdut7Xft9x92Pu52uns/nSmtwJEwytjbFBwqqXFEkhTeFhYhTxXepPcXdf3mAa2TRv+geYGTHO60nEoB5KmkIyAJ+3GfQxLVKVaZIedzH5JKH0YLHsftdNUSz1wBAqtBGB1JveBpI9mgn81IVzOSTi8chMvgL0HUgB5r4jrptvbizIgyR01CgXPjKCxoUoElKRQu2nHp0C+9hzsce6ghRzeplm4s+GfPZHxqrD+a+JJ9rqggd26ep74zB5q5zVpN/q82Lmk6nFRSFyWhFqtF01JxMry2lmfSoiA19wCElf6uXMzAgiD/AWtblrMLFGsvqR5LLYXJcINV9EgWahejTc9egtHR4GwQfT/unQ8bO7fZAfvEvrCInbJzdsWu2YgJ9ov9YX/Zv2A3OAm+Bt9WrUGr0eyztQgunwAsEcTQ b0,b1,...,bn 1 B( ),..., B( ) 0 2n 1 C( ),..., C( )

AAACpnicbVFda9swFJXdrWvTrzR97ItYKDTQBjsMlr11zcvelsK8FOI0k+WbVFSWjHQ9Gox/aJ/3R6akoTQfF4QO55yrKx0luRQWg+DF83c+fNz9tLdfOzg8Oj6pnzZ+W10YDhHXUpv7hFmQQkGEAiXc5wZYlkgYJE+9uT74C8YKrX7hLIdRxqZKTARn6Khxvfp+GesMpuwhaF3RWKYarduv3uiyo67DqkXjuHa73Xq7xdrbbu2tWcf1ZtAOFkU3QbgETbKs/vjUa8Sp5kUGCrlk1g7DIMdRyQwKLqGqxYWFnPEnNoWhg4plYEflIqeKXjgmpRNt3FJIF+z7jpJl1s6yxDkzho92XZuT27RhgZPuqBQqLxAUfx00KSRFTeeh01QY4ChnDjBuhLsr5Y/MMI7ua1amLM7Oga+8pHwulOA6hTVW4jMaVrkUw/XMNkHUaX9rh3dfmjfdZZx75Jx8JpckJF/JDflB+iQinPzz9r2Gd+a3/J9+5A9erb637DkjK+X/+Q/+S8ss

point-value multiplication O(n)

0 2n 1 A( ),..., A( ) 0 2n 1 B( ),..., B( ) inverse FFT 0 2n 1

cAAACa3icbVBNT9tAEN2YftBAS/g4tRxWREg9pJEdVQJuqFw4UqkBpDiy1uMJWbHetXbHiMjy/+HXcIX+CX4Dm5ADCYy00tN7M/N2Xloo6SgM/zeClQ8fP31e/dJcW//6baO1uXXuTGkB+2CUsZepcKikxj5JUnhZWBR5qvAivT6Z6hc3aJ00+h9NChzm4krLkQRBnkpafyAJO3GHx2NXCMCqG0Y9qWsOSfQezWOVGXJe6EBS9fSvXt1MWu2wG86KvwXRHLTZvM6SzcZWnBkoc9QESjg3iMKChpWwJEFh3YxLh972WlzhwEMtcnTDanZszfc9k/GRsf5p4jP29UQlcucmeeo7c0Fjt6xNyfe0QUmjw2EldVESangxGpWKk+HT5HgmLQKpiQcCrPR/5TAWVgD5fBdcZrsLhIVLqttSSzAZLrGKbsmK2qcYLWf2FvR73aNu9Pd3+/hwHucq+8H22E8WsQN2zE7ZGeszYHfsnj2wx8ZTsBN8D3ZfWoPGfGabLVSw/wypk7qA 0,c1,...,c2n 2 C( ),..., C( ) O(n log n) AAACpnicbVFda9swFJXdrWvTrzR97ItYKDTQBjsMlr11zcvelsK8FOI0k+WbVFSWjHQ9Gox/aJ/3R6akoTQfF4QO55yrKx0luRQWg+DF83c+fNz9tLdfOzg8Oj6pnzZ+W10YDhHXUpv7hFmQQkGEAiXc5wZYlkgYJE+9uT74C8YKrX7hLIdRxqZKTARn6Khxvfp+GesMpuwhaF3RWKYarduv3uiyo67DqkXjuHa73Xq7xdrbbu2tWcf1ZtAOFkU3QbgETbKs/vjUa8Sp5kUGCrlk1g7DIMdRyQwKLqGqxYWFnPEnNoWhg4plYEflIqeKXjgmpRNt3FJIF+z7jpJl1s6yxDkzho92XZuT27RhgZPuqBQqLxAUfx00KSRFTeeh01QY4ChnDjBuhLsr5Y/MMI7ua1amLM7Oga+8pHwulOA6hTVW4jMaVrkUw/XMNkHUaX9rh3dfmjfdZZx75Jx8JpckJF/JDflB+iQinPzz9r2Gd+a3/J9+5A9erb637DkjK+X/+Q/+S8ss

coefcient representation point-value representation 77 FFT in practice ?

78 FFT in practice

Fastest Fourier transform in the West. [Frigo–Johnson] ・Optimized C library. ・Features: DFT, DCT, real, complex, any size, any dimension. ・Won 1999 Wilkinson Prize for Numerical Software. ・Portable, competitive with vendor-tuned code.

Implementation details. ・Core algorithm is an in-place, nonrecursive version of Cooley–Tukey. ・Instead of executing a fixed algorithm, it evaluates the hardware and uses a special-purpose compiler to generate an optimized algorithm catered to “shape” of the problem. ・Runs in O(n log n) time, even when n is prime. ・Multidimensional FFTs. ・Parallelism.

http://www.ftw.org 79 methods for this problem. integer relation problem: given n real numbers x1, …, Introducing the decompositional approach to ma- xn, find the n integers a1, … , an (if they exist) such that trix computations revolutionized the field. G.W. Stew- a1x1 + … + anxn = 0. Originally, the algorithm was used art describes the history leading up to the decomposi- to find the coefficients of the minimal integer polyno- tional approach and presents a brief tour of the six mial an algebraic number satisfied. However, more re- central decompositions that have evolved and are in cently, researchers have used them to discover un- use today in many areas of scientific computation. known mathematical identities, as well as to identify Top 10 algorithms ofDavid the Padua20th ar centurygues that the Fortran I compiler, some constants that arise in quantum field theory in with its parsing, analysis, and code-optimization tech- terms of mathematical constants. niques, qualifies as one of the top 10 “algorithms.” The The Fast Multipole Algorithm was developed orig- article describes the language, compiler, and opti- inally to calculate gravitational and electrostatic po- mization techniques that the first compiler had. tentials. The method utilizes techniques to quickly

F RTheO M QR T H AlgorithmE for computing eigenvalues of a compute and combine the pair-wise approximation matrixE D I T hasO R transforS med the approach to computing the in O(N) operations. This has led to a significant re- spectrum of a matrix. Beresford Parlett takes us duction in the computational complexity from O(N2) THE JOYthrOoughF AL theGO historRITHyM ofS early eigenvalue computations to O(N log N) to O(N) in certain important cases. and the discovery of the family of algorithms referred John Board and Klaus Schulten describe the ap- Francis Sullivan, Associate Editor-in-Chief to as the QR Algorithm. proach and its importance in the field. Sorting is a central problem in many areas of com- HE THEME puting OF THIS FIRST-OF-THE-CENTURY so it is no surprise ISSUE OFto COMPUTING see an appr IN oach to solving T SCIENCE & ENGINEERINGthe problem IS ALGORITHMS. as one of IN FACT,the top WE WERE 10. BOLDJoseph JaJa describes Your thoughts? ENOUGH—AND PERHAPSQuicksor FOOLISH ENOUGH—TOt as one CALL of theTHE 10 best EXAMPLES practical WE’VE SE- sorting algorithm We have had fun putting together this issue, and we LECTED “THE TOP 10 ALGORITHMS OF THE CENTURY.” for general inputs. In addition, its complexity analysis assume that some of you will have strong feelings Computational algorithms are probably as old as civilization. mysterious. But once unlocked, they cast a brilliant new light Sumerian cuneiform, one of the most ancient written records, on some aspect of computing. A colleague recently claimed consists partly of algorithm descriptions for reckoningand in baseits strthat ucturhe’d done eonly have 15 minutes been of productive a rich work insour his ce of inspiration about our selection. Please let us know what you think. 60. And I suppose we could claim that the Druid algorithm for whole life. He wasn’t joking, because he was referring to the estimating the start of summer is embodied in forStonehenge. developing15 minutes during general which he’d sketched algorithm out a fundamental techniquesop- for vari- (That’s really hard hardware!) timization algorithm. He regarded the previous years of Like so many other things that technology ousaffects, algo-applications.thought and investigation as a sunk cost that might or might Ja c k D o n g a r r a is a professor of computer science in the rithms have advanced in startling and unexpected ways in the not have paid off. 20th century—at least it looks that way to us now. TheDaniel algo- Researchers Rockmor have crackede many describes hard problems since the 1 Jan- FFT as an algo- Computer Science Department at the University of Ten- rithms we chose for this issue have been essential for progress uary 1900, but we are passing some even harder ones on to the in communications, health care, manufacturing,rithm economics, “thenext century. whole In spite of family a lot of good work,can the use.”question of The FFT is per- nessee and a scientist in the mathematical science section weather prediction, defense, and fundamental science. Con- how to extract information from extremely large masses of versely, progress in these areas has stimulated thehaps search forthedata most is still almost ubiquitous untouched. There are stillalgorithm very big chal- in use today to of Oak Ridge National Lab. He received his BS in mathe- ever-better algorithms. I recall one late-night bull session on lenges coming from more “traditional” tasks, too. For exam- the Maryland Shore when someone asked, “Who first ate a ple, we need efficient methods to tell when the result of a large crab? After all, they don’t look very appetizing.’’ Afteranalyze the usual floating-point and manipulate calculation is likely to bedigital correct. Think or of the discrete data. The matics from Chicago State University, his MS in computer speculations about the observed behavior of sea gulls, someone way that check sums function. The added computational cost gave what must be the right answer—namely, “AFFT very hungry takesis very small, the but theoperation added confidence in count the answer is large.for discrete Fourier scien ce from t h e Illin ois In st it ut e of Technology, and his person first ate a crab.” Is there an analog for things 2such as huge, multidisciplinary The flip side to “necessity is the mother of invention’’transfor is “in- optimizations?m from At anO even(N deeper) tolevel isO the( issueN oflog reason- N). PhD in applied mathematics from the University of New vention creates its own necessity.’’ Our need for powerful ma- able methods for solving specific cases of “impossible’’ prob- chines always exceeds their availability. Each significantSome com- lems. recently Instances of NP-completediscover problemsed integercrop up in at- relation detection Mexico. Contact him at [email protected]; www.cs. putation brings insights that suggest the next, usually much tempting to answer many practical questions. Are there larger, computation to be done. New algorithms arealgorithms an attempt efficient haveways to attack become them? a centerpiece of the emerg- utk.edu/~dongarra. to bridge the gap between the demand for cycles and the avail- I suspect that in the 21st century, things will be ripe for an- able supply of them. We’ve become accustomed toing gaining discipline the other revolution of in“experimental our understanding of the foundations mathematics”—the of use 80 Moore’s Law factor of two every 18 months. In effect, Moore’s computational theory. Questions already arising from quan- Law changes the constant in front of the estimate of running tum computing and problems associated with the generation time as a function of problem size. Important newof algorithms moderof randomn computer numbers seem to requiretechnology that we somehow as tie to- an exploratory tool Francis Sullivan’s b iog rap h y ap p ears in h is art icle on do not come along every 1.5 years, but when they do, they can gether theories of computing, logic, and the nature of the change the exponent of the complexity! in mathematicalphysical world. research. David Bailey describes the page 69. For me, great algorithms are the poetry of computation. The new century is not going to be very restful for us, but it Just like verse, they can be terse, allusive, dense, and even is not going to be dull either!

2 JANUARY/FEBRU ARY 2000 C O MPUTIN G IN SCIEN CE & EN GINEERIN G 23 Integer multiplication, redux

Integer multiplication. Given two n-bit integers a = an–1 … a1a0 and b = bn–1 … b1b0, compute their product a ⋅ b.

Convolution algorithm. 2 n 1 Form two polynomials. A(x)=a0 + a1x + a2x + ...+ an 1x ・ 2 n 1 B(x)=b + b x + b x + ...+ b x Note: a = A(2), b = B(2). AAAC13icbVFba9swFJa9W5vukrZ764tY2OgYC3YotFsZ9LKHPXawtGVxGmT5pBGVJSMdDwcT2NPoa//I/s/+zWQnjDnpAZ3z8Z3LJx3FmRQWg+CP5z94+Ojxk7X11sbTZ89ftDe3zq3ODYc+11Kby5hZkEJBHwVKuMwMsDSWcBHfnFb5ix9grNDqG04zGKbsWomx4AwdNWr/Pt4t3tLokL75VHk2Cug750Na1LFHi6ueQ5FMNNqaKtX7cFZc1YFGUeukMSB2A6KJzRiHshuEPaFmriuuB97HzwVWM/8E44bgqN0JukFtdBWEC9AhCzsbbXpbUaJ5noJCLpm1gzDIcFgyg4JLmLWi3ILTvmHXMHBQsRTssKwXO6OvHZPQsTbuKKQ1+39HyVJrp2nsKlOGE7ucq8j7coMcxwfDUqgsR1B8LjTOJUVNq1+iiTDAUU4dYNwId1fKJ8wwju4vGyr17Ax44yVlkSvBdQJLrMQCDau2GC7vbBX0e90P3fDrXufoYLHONbJDXpFdEpJ9ckS+kDPSJ9x76X30Tr3P/nf/p//Lv52X+t6iZ5s0zL/7C7hY2FI= 0 1 2 n 1 ・ ・Compute C(x) = A(x) ⋅ B(x). ・Evaluate C(2) = a ⋅ b. ・Running time: O(n log n) floating-point operations.

Theory. [Schönhage–Strassen 1971] ・O(n log2 n) bit operations. FFT over complex numbers; need O(log n) bits of precision ・O(n log n ⋅ log log n) bit operations. FFT over ring of integers (modulo a Fermat number)

Practice. [GNU Multiple Precision Arithmetic Library] Switches to FFT-based algorithm when n is large (≥ 5–10K).

81 3-SUM (REVISITED)

3-SUM. Given three sets X, Y, and Z of n integers each, determine whether there is a triple i ∈ X, j ∈ Y, k ∈ Z such that i + j = k.

Assumption. All integers are between 0 and m.

Goal. O(m log m + n log n) time.

m = 19, n = 3 X = { 4, 7, 10 } Y = { 5, 8, 15 } Z = { 4, 13, 19 }

a yes instance (4 + 15 = 19) 82 3-SUM (REVISITED)

An O(m log m + n) solution. m ・Form polynomial A(x) = a0 + a1 x + … + am x with ai = 1 iff i ∈ X. m ・Form polynomial B(x) = b0 + b1 x + … + bm x with bj = 1 iff j ∈ Y. ・Compute product/convolution C(x) = A(x) � B(x). ・The coefficient ck = number of ways to choose an integer i ∈ X and an integer j ∈ Y that sum to exactly k. ・For each k ∈ Z : check whether ck > 0.

m = 19, n = 3 X = { 4, 7, 10 } A(x)=x4 + x7 + x10 Y = { 5, 8, 15 } B(x)=x5 + x8 + x15 Z = { 4, 13, 19 } C(x)=x9 +2x12 +2x15 + x18 + x19 + x22 + x25

AAACqXicbZFdT9swFIadDAZ0fBQmuOHGWjXEBKqSiIr2YhIbN1wCoqOi7SrHPQULx4nsk6lVFGl/c3+A34EbUkTLjmTr0fsef70OEykMet4/x/2wtPxxZXWt8ml9Y3Orur3zy8Sp5tDmsYx1J2QGpFDQRoESOokGFoUSbsPH86l/+we0EbG6wUkC/YjdKzESnKGVBtW/Pw7H3+jBdzr+fUKP7HxazJnv5bTXq/x8dRuF3izdRuGev7otqwdTI8hn1MjL3uYMWiUEwQxsz6Ba8+peUfQ9+CXUSFmXg21npzeMeRqBQi6ZMV3fS7CfMY2CS8grvdRAwvgju4euRcUiMP2siCqnX60ypKNY26GQFurbFRmLjJlEoe2MGD6YRW8q/s/rpjhq9jOhkhRB8ZeDRqmkGNNp7nQoNHCUEwuMa2HvSvkD04yj/Z25U4q9E+BzL8nGqRI8HsKCKnGMmuU2RX8xs/fQDuqtun91UjtrlnGukn3yhRwSn5ySM3JBLkmbcPLkrDu7zp577F67HffupdV1yjWfyVy5/BnIdsaH

a yes instance (4 + 15 = 19) 83