CECS 228 Lectures

Darin Goldstein

1 Modular Arithmetic Introduction

The modulus relation is special and is used in computer science all the time. You can think of all numbers in the world of (mod n) as being represented by what is left over when they are divided by n; therefore, all numbers are related to some number between 0 and n-1 in world of (mod n). You are allowed to add, subtract, and multiply in the world of (mod n), but not divide! Why not? What is 4 divided by 2 in the word of (mod 4)? Is it 0 or 2? Calculate the following expressions: • 7 + 9(mod 11) • 4 · 6(mod 11)

• 43 · 172(mod 5) • 10017(mod 6) • 97(1023)52(mod 3)

1.1 Application Consider the following network:

1 Imagine that there is a source that is sending data through the network. (Netflix is streaming a movie, for example, to two customers.) For the sake of clarity, assume that the source can send only a single bit through each line segment per time unit. If there is only one customer (either side), then it is clear that the source can simultaneously send two bits of information through the network, both aimed at the same destination. However, if both of the destination nodes are considered customers, then there is contention for the middle link. If we treat the data as if it were water flowing through a pipe, it is clear that the source can only send 1 bit to each destination. In other words, Netflix would have to send the same bit to both customers, one to the left and one to the right, cutting down the throughput from 2 bits with one customer to 1 bit for 2 customers. Now imagine that we send two different bits, b1 down the left and b2 down the right. (The bits might not actually be different, but they don’t necessarily have to be the same.) Assume the secondary node, then copies the bits and sends one down the left and one down the right. The node in the center computes b1 + b2 (in the world of (mod Z2)). That value is then sent to both nodes along with the value of b1 on the left and b2 on the right. Observe that when the two customers get their bits, they can both retrieve the original two bits! So we now have 2 bits for 2 customers without changing the network at all!

1.2 Units ∗ ∗ Let Zn denote the set of numbers in the world of (mod n) such that x ∈ Zn → ∗ ∃y ∈ Zn, yx = 1. The number y is called the inverse of the number x. ∗ ∗ Example: Z10 consists of the numbers {1, 3, 7, 9}. Z11 consists of {1, 2,..., 10}. ∗ Z12 consists of {1, 5, 7, 11}.

1.3 Subgroups ∗ Let S ⊆ Zn such that the following three rules hold (a) 1 ∈ S (b) x, y ∈ S → xy ∈ S (c) x ∈ S → ∃y ∈ S, yx = 1. S is called a subgroup. ∗ Example: All of these are subgroups: {1} ⊂ {1, 7} ⊂ {1, 5, 7, 11} = Z12 ∗ For each x in Zn, if S is a subgroup, define xS = {xs : s ∈ S}. Exercise: Show that ∀s ∈ S, sS = S. • Choose any s ∈ S. For any s0 ∈ S, ss0 ∈ S ⇒ sS ⊆ S. • Now choose any s0 ∈ S. Note that s−1 ∈ S ⇒ s−1s0 ∈ S ⇒ s(s−1s0) = s0 ∈ sS ⇒ sS ⊆ S

2 Lagrange’s theorem

∗ ∗ Lagrange’s theorem: For any subgroup S of Zn, |S| divides |Zn|.

2 Examples: ∗ {1} ⊂ {1, 7} ⊂ {1, 5, 7, 11} = Z12 ∗ {1} ⊂ {1, 4} ⊂ {1, 2, 4, 8} ⊂ {1, 2, 4, 7, 8, 11, 13, 14} = Z15 Proof: Notice the following facts:

S ∗ 1. 1 ∈ S ⇒ ∗ xS = Z x∈Zn n 2. xS ∩ yS 6= ∅ ⇒ ∃s, s0 ∈ S, xs = ys0 ⇒ x = ys0s−1 ⇒ xS = ys0s−1S ⇒ (by the exercise from yesterday) xS = yS (Either xS and yS do not intersect or they are the same!)

3. |xS| ≤ |S| 4. Is it possible for |xS| < |S|? In this case, by the Pigeonhole Principle, there −1 −1 must exist two s1 6= s2 ∈ S, xs1 = xs2 ⇒ x xs1 = x xs2 ⇒ s1 = s2 which is a contradiction. So |xS| ≥ |S| ⇒ |xS| = |S|.

∗ We’ve sliced up Zn into sets xS, all of which have size |S|, none of which ∗ overlap, and all of which, when taken together, make up Zn. 2

3 Euclidean algorithm

Find GCD(1001,221) using the Euclidean algorithm. Find an integer solution for x, y: 5x + 3y = 1. Find an integer solution for x, y: 1001x+221y = 39. Can you find a solution for 1001x + 221y = 39? In general, when can you expect a solution to exist? A solution exists if and only if the gcd of the two coefficients evenly divides the constant. Why does the Euclidean algorithm work? First, notice that gcd(a, b) must divide into the final remainder because it divides into both a and b and therefore the remainder (and this argument continues to the very bottom). Thus gcd(a, b) is less than or equal to the final remainder. Can gcd(a, b) be strictly smaller than the final remainder? It is clear that the final remainder evenly divides both a and b (Start from the bottom and work your way up.) and the gcd(a, b) is the greatest number that has that property.

4 Lame’s theorem and Inverses 4.1 Inverses Find the inverse of 87 in the world of (mod 101). In general, when can you expect an inverse of a modulo b to exist? When gcd(a, b) = 1.

3 4.2 Recall Assume that there is a breed of immortal rabbits. A mature pair of these rabbits makes a new pair of baby rabbits every month (but it takes the full month to produce them). Baby rabbits mature after a single month. Assume that we go to an initially rabbit-free island and drop a pair of baby rabbits from a helicopter. Let f(n) be the number of pairs of rabbits during the nth month. Then f(n) is the function that represents the Fibonacci numbers. Recall that we previously showed that f(0) = 0, f(1) = 1, and f(n) = f(n − 1) + f(n − 2) ⇒ √ √ 1 1 + 5 1 1 − 5 f(n) = √ ( )n − √ ( )n 5 2 5 2 This implies that f(n) ∼ γn where γ is the golden ratio.

• One Egyptian pyramid is remarkably close to a ”golden pyramid”the Great Pyramid of Giza (also known as the Pyramid of Cheops or Khufu). Its slope is extremely close to the ”golden” pyramid inclination; other pyra- mids at Giza are also quite close. Whether the relationship to the golden ratio in these pyramids is by design or by accident remains open to specu- lation. Adding fuel to controversy over the architectural authorship of the Great Pyramid, Eric Temple Bell, mathematician and historian, claimed in 1950 that Egyptian mathematics would not have supported the ability to calculate the slant height of the pyramids, or the ratio to the height, except in the case of the 3:4:5 pyramid, since the 3:4:5 triangle was the only right triangle known to the Egyptians and they did not know the Pythagorean theorem nor any way to reason about irrationals. • Golden ratio appearances in paintings: The Parthenon’s faade as well as elements of its faade and elsewhere are said by some to be circumscribed by golden rectangles. Leonardo da Vinci’s illustrations of polyhedra in De divina proportione (On the Divine Proportion) and his views that some bodily proportions exhibit the golden ratio have led some scholars to speculate that he incorporated the golden ratio in his paintings. Salvador Dal explicitly used the golden ratio in his masterpiece, The Sacrament of the Last Supper. The dimensions of the canvas are a golden rectangle. A huge dodecahedron, in perspective so that edges appear in golden ratio to one another, is suspended above and behind Jesus and dominates the composition. • Some sources claim that the golden ratio is commonly used in everyday design, for example in the shapes of postcards, playing cards, posters, wide-screen televisions, photographs, light switch plates and cars. • The golden ratio expressed in the arrangement of parts such as leaves and branches along the stems of plants and of veins in leaves and the skeletons of animals and the branchings of their veins and nerves.

4 • Since 1991, several researchers have proposed connections between the golden ratio and human genome DNA. In 2010, the journal Science re- ported that the golden ratio is present at the atomic scale in the magnetic resonance of spins in cobalt niobate crystals.

4.3 Analysis of Euclidean Algorithm Lame’s Theorem: The Euclidean algorithm that is used to find gcd(a, b) takes number of steps proportional to logγ (min{a, b}) where γ is the golden ratio. Proof: What is the worst possible sequence for the Euclidean algorithm? We claim that the quotient should be as small as possible for every step, one. Consider the worst possible sequence in reverse: x = g(1)+0, x+g = x(1)+g, 2x+g = (x+g)(1)+x, 3x+2g = (2x+g)(1)+(x+g), etc.

Notice that the bottom line means that x = g so the sequence becomes

g = g, 2g = g(1) + g, 3g = 2g(1) + g, 5g = 3g(1) + 2g, etc.

Indexing in reverse, we claim that the nth step has left-hand side equal to f(n + 2)g. We use mathematical induction. The first two cases are trivial. To see why the induction is true, notice that the n + 1th step has left-hand side is equal to the sum of the two left-hand sides beneath (because the quotient is always 1). Obviously, this is true independent of g. So the worst case for the Euclidean algorithm is to have the two numbers a and b be multiples of two successive Fibonacci numbers. So if we let a = f(n)g n and b = f(n − 1)g, the total number of steps would be ∼ g . To make this number as large as possible, let g = 1. So we get that the number of steps to perform the Euclidean algorithm on two numbers a and b is at least as small as the index of the closest Fibonacci number to min{a, b}. But because f(n) ∼ γn, we are done. 2

5 Integer algorithms: Fermat’s Little Theorem, Chinese Remainder Theorem

Fermat’s Little Theorem: For any prime number p, ap−1 ≡ 1 (mod p) if gcd(a, p) = 1. Proof: Consider the set of numbers {1, 2, . . . , p − 1} in the world of (mod p). If we multiple each number in this set by any number a such that gcd(a, p) = 1, then a{1, 2, . . . , p − 1} = {1, 2, . . . , p − 1} (i.e. the numbers in the set are simply permuted) because x1 6= x2 ⇒ ax1 6= ax2. Now multiply all the numbers in each set together on both sides of the equation. Because (p − 1)! is invertible in the world of (mod p), ap−1(p − 1)! ≡ (p − 1)! ⇒ ap−1 ≡ 1. 2

Chinese Remainder Theorem: Let m1, m2, . . . , mn be pairwise relatively prime positive integers. Then the system x ≡ a1(mod m1), x ≡ a2(mod m2), . . . , x ≡

5 1 an(mod mn) has a unique solution modulo M = m1m2 . . . mn . We are going to solve this problem using the following procedure.

1. Write down the relevant congruences.

x ≡ 2(mod 3)

x ≡ 3(mod 5) x ≡ 2(mod 7)

M 2. Identify all the variables and solve for M and M1,M2,...,Mn.(Mi = ) mi M = 3 · 5 · 7 = 105

M1 = 35,M2 = 21,M3 = 15

3. Find the inverses of Mi(mod mi) and call them yi. (We know how to do this using the Euclidean Algorithm if necessary.)

35y1 ≡ 1(mod 3) ⇒ y1 = 2

21y2 ≡ 1(mod 5) ⇒ y2 = 1

15y3 ≡ 1(mod 7) ⇒ y3 = 1

4. Write down the answer. The answer to the problem is

x ≡ a1M1y1 + a2M2y2 + ... + anMnyn(mod M)

So, for our example, we get

x ≡ 2 · 35 · 2 + 3 · 21 · 1 + 2 · 15 · 1(mod 105) ≡ 23(mod 105)

We can easily verify that this answer is correct, not just in this example, but in general.

So we know that we can construct the answer to a series of linear congruences using the above procedure, but how do we know that the answer that we get is unique modulo M? Let’s assume that there are two solutions, x1and x2 such that 0 ≤ x1, x2 < M. Then it is possible to show that x1 − x2 must be divisible by every single one of the mi. The only possibility is x1 − x2 = 0 modulo M.

1The reason that this is called the Chinese Remainder Theorem is that in the first century A.D., the Chinese mathematician and general Sun-Tsu asked the question: “There are a certain number of things that need to be divided among several people. If we try to divide them evenly between 3 people, then there are 2 things left over. If we try to divide them evenly between 5 people, then there are 3 things left over. If we try to divide them evenly between 7 people, there are 2 left over. How many things are there?”

6 6 Application of CRT: Asmuth-Bloom Secret Shar- ing

Secret sharing: Assume that we have N people and we want any subset of k (where 2 ≤ k ≤ N) of them to be able to launch a nuclear attack. However, no subset of k − 1 people should be able to launch an attack. What should we do? Asmuth-Bloom(1983): Choose a set of pairwise coprime integers r < m1 < m2 < . . . < mN such that

r · mN−k+2 . . . mN < m1 . . . mk

(This restriction implies that if you multiply together any k − 1 of the mi together, you are still smaller than the smallest product of any k.) Choose the secret launch code S to be uniformly random within the world m1...mk of (mod r). Now choose a random integer α such that 0 ≤ α < b r c − 1. Claim: 0 ≤ S + αr and S + αr is strictly less than the product of any k of the mi’s. Proof: m . . . m m . . . m 0 ≤ α < b 1 k c − 1 ⇒ 0 ≤ αr < rb 1 k c − r r r From this, we get m . . . m 0 ≤ S < r ⇒ 0 ≤ S + αr < rb 1 k c ≤ m . . . m r 1 k 2 Each person 1 ≤ i ≤ N is privately shared the information

Si = (S + αr)(mod mi) and mi

The values of r and k are publicly shared.

6.1 What happens if k people want to know the secret? If k people get together and share their secrets, then we get k congruences of the form x ≡ sj(mod mj)

There must be a unique solution modulo the product of the mj. So we know Q a number S + αr + c mj where c is some constant. By the Claim, we know that S + αr is strictly less than the product of the mj (because there are k of Q Q them). So we take S + αr + c mj in the world of (mod mj) to yield the Q exact value of S + αr; it is the value where c = 0 because 0 ≤ S + αr < mj. Now we take the value of S + αr in the world of (mod r) to recover the secret, and S will be the unique value between 0 and r.

7 6.2 What happens if there aren’t enough people? If k − 1 people get together to share their secrets, how far have we narrowed down the secret? Assume that we are given k−1 equivalences of the form x ≡ sj (mod mj). The result will be unique modulo the product of the mj’s. Thus, we get a number S + αr + cM 0. The question now is “What are the possible values of the secret S?” The only information that we have about S + αr is that it must be strictly less than the product of any k of the mi’s. Claim: S can theoretically be any value in [0, r − 1], and because it was chosen uniformly randomly first, no information about the secret is released. Proof: So if the result of the Chinese Remainder Theorem calculation is S0 Q 0 and the set of indices for the mj is M, letting j∈M,|M|=k−1 mj ≡ M , then we have the information that

S + αr(mod M 0) = S0 where S0 ∈ [0,M 0 − 1]. m1...mk Note that α can be any value up to b r c − 1. This means that the r·mN−k+2...mN choice of α could be as high as b r c − 1 = mN−k+2 . . . mN − 1 and therefore as high as M 0 − 1. Let us assume that we wanted the value of the secret S to be x. Is there some way to set the value of α to make S = x? That is equivalent to asking whether there exists a solution for α to the equation

x + αr = S0(mod M 0) ⇒ α = r−1(S0 − x)(mod M 0)

(Note that we use the fact that r is coprime to the rest of the mi’s to guarantee the that the inverse exists.) Because α is allowed to be any value between 0 and M 0 − 1, there exists such an α. 2

6.3 Example

Let k = 3 and N = 5. Choose r = 89 and (m1, m2, m3, m4, m5) = (101, 103, 107, 109, 113). You can verify that the appropriate inequalities are satisfied. We choose our secret to randomly be S = 11 and let α = 2013. The secret shares are then (11+2013·89)(mod 101) = 95 ⇒ (95, 101) for #1, (11+2013·89)(mod 103) = 51 ⇒ (51, 103) for #2, and (50, 107), (81, 109), (63, 113) respectively for # 3,4,5. What if #2,3,5 want to launch a nuclear strike. They share their private information to get the following problem:

x = 51(mod 103), x = 50(mod 107), x = 63(mod 113)

We solve this via the CRT. M = (103)(107)(113) = 1, 245, 373. M1 = (107)(113) = 12091,M2 = 11639,M3 = 11021 ⇒ y1 = 85, y2 = 49, y3 = 81. So the final an- swer is 179168(mod 1245373)

8 Now we take this result modulo r = 89 to get the secret S = 11. So what happens if we only have # 2,3 that want to launch the strikes? We get the following problem:

x = 51(mod 103), x = 50(mod 107)

We solve this via the CRT. M = (103)(107) = 11021. M1 = 107,M2 = 103 ⇒ y1 = 26, y2 = 80. So the final answer is S + αr = 2832(mod 11021). So we know that 2832 + 11021c = S +αr where c is some unknown constant. Unfortunately, we don’t know what c should be. All we know is that S + αr < m1m2m3 ⇒ 2832 + 11021c < 1113121 ⇒ 0 ≤ c < 101. It turns out that there is a solution for α and S for every c within the appropriate range! If c = 0, then S = 73, α = 31. If c = 1, then S = 58, α = 155. If c = 2, then S = 43, α = 279. And so on. If c = 16, then S = 11, α = 2013, but we have no idea that c = 16 is the correct value of c. We have no idea what the secret is! There is one minor warning with this method of secret sharing: It is not perfect. In this case, we have 101 possibilities for the value of c and there is a unique solution for α in each of these cases and we chose α uniformly at random so no information can be gleaned about α. However, we chose S so that 0 ≤ S < 89. There are only 89 possibilities for S, which means that some of those values must appear more than once in our c computations. Because we chose α uniformly, those values for S that appear twice are twice as likely to be the secret as those that only appear once.

7 Cryptographic Signatures: ElGamal

ElGamal digital signature encryption protocol: The basic idea behind this is as follows. Alice sends a document to Bob, and Bob wants to make sure that Alice is indeed the one that drafted and agreed to the document and that it has not been altered en route to him. Both have access to a shared public hash function (e.g. a 64-bit XOR of the file). Along with the document, Alice sends a cryptographic message. Bob unlocks the cryptographic message with Alice’s public key (revealing the hash value that the document should have) and then hashes the document. If the hashes match, then he can say with extremely high probability that the document has been signed and authorized with Alice’s private key. Perform the following steps: 1. Alice select a large prime p and then randomly select values for 2 ≤ g < p and 2 ≤ X < p.(p, g) should be made publicly available. 2. Alice calculates Y = gX (mod p). She posts Y publicly too. 3. Alice choose a one-time random number K such that 0 < K < p − 1 and gcd(K, p − 1) = 1. K will only every be used once. After a single use, it needs to be thrown away privately.

9 4. Let M be the value of the hash of the document. The digital signature will K −1 have two parts: s1 ≡ g (mod p) and s2 ≡ K · (M − X · s1)(mod p − 1). How do we know that K−1 exists? The signature for the document will be (s1, s2) and should be sent to Bob along with the document. 5. Bob checks the following:

s1 s2 M Y · s1 (mod p) = g (mod p) If that equation checks out, then Bob concludes that the document has been verified. Why is that equivalence correct? In the world of (mod p),

−1 s1 s2 X s1 K K (M−X·s1) Y · s1 = (g ) (g ) =

gXs1 gM−Xs1 = gM Something cool to note: Because K is only used once and thrown away, specific to each signature, the ElGamal algorithm allows you to create truly time-unique signatures: If Alice signs a document today, signs it, and sends it off to Bob, then Bob can consider it truly signed by Alice. Even if a thief stole Alice’s laptop and got hold of her private key, he would not be able to recreate the signature without K. Another way of looking at it: If you signed a document with a pen, even if a thief stole your pen, he would not be able to recreate your signature; this digital signature scheme follows that idea. Historical note: Taher ElGamal played a central role in the development of the SSL (Secure Socket Layer) protocol in his capacity as the Chief Scientist of Netscape Communications in the late 1990s. SSL [and its later cousin TLS (for Transport Layer Security)] form the security backbone for a large number of protocols.

7.1 Example Let Alice select p = 11, g = 2,X = 5. Then Y = 10. Alice choose K = 9. Assume that the hash of the document winds up being M = 4. Then we calculate (s1, s2) = (6, 6). Bob receives M = 4 and (s1, s2) = (6, 6) along with the values of p, g, Y . He calculates that the equation is true and verifies the digital signature.

7.2 Reusing K Assume that someone is stupid and reuses his value of K. Then from the two messages, an adversary Eve will know

−1 0 −1 0 s2 = K · (M − X · s1)(mod p − 1) and s2 = K · (M − X · s1)(mod p − 1)

10 0 Notice that the value of s1 will not change between M and M if the value of K does not change. Eve computes

0 −1 0 s2 − s2 = K (M − M )(mod p − 1) From this, Eve can recover K. Once she has K, she can solve for X using −1 s2 ≡ K · (M − X · s1)(mod p − 1).

8 Diffie-Hellman Key Exchange

Most computer applications need to have some method for guaranteeing secu- rity built into them. E-mail (in particular, any Internet transaction) requires encryption so that some degree of privacy can be counted on (and so that mes- sages, etc. can’t be faked by a third party). This is especially true of military applications, many of which the government doesn’t even make public. The most widely used crypto system in the world today is the public-key crypto system. This section will outline how such a system works. Say that Alice and Bob wish to send messages to each other back and forth across a potentially unsecured line of communication. (Alice and Bob are in different countries and they are communicating on the Internet, for example. Most Internet backbone servers are owned/operated by the U.S. government and are regularly tapped by the NSA for anti-terrorism purposes. Embassy phone lines are regularly tapped by host countries, etc.) They can communicate together in secrecy as follows: Alice creates a private key and a public key. The private key is used for decoding, and the public key is used for encoding. She sends the public key over the line to Bob. Bob uses Alice’s public key to encode his message. He then sends the encrypted message over the line to Alice. Upon receiving the transmission, Alice uses her private key to decode Bob’s message. It is presumed that one needs the private key in order to decode the message, and thus anyone listening in on the line will not be able to decipher the random noise that is Bob’s encrypted message.

1. Alice and Bob choose a large prime p and choose a random g such that ∗ 2 ≤ g < p and g generates a large-order cyclic subgroup of Zp . Both are public. 2. Alice secretly chooses a random a ∈ [1, p − 1], computes A = ga(mod p), and sends that value to Bob. 3. Bob secretly chooses a random b ∈ [1, p − 1], computes B = gb(mod p), and sends that value to Alice. 4. Alice computes the secret key: S = Ba(mod p). Bob computes the secret key: S = Ab(mod p).

Notice that both Alice and Bob have the same secret:

Ba(mod p) = (gb)a(mod p) = (ga)b(mod p) = Ab(mod p)

11 8.1 Example Let p = 23, g = 5. Alice chooses a = 4 ⇒ A = 4. Bob chooses b = 3 ⇒ B = 10. Alice and Bob both compute S = 18.

8.2 Extending to multiple secret sharing parties Assume that there are 2n people, all of whom want to share a mutual secret that nobody else should know. We can think of these people as being arranged in a tournament structure with each player at the leaf of a tree with height n.

1. All players agree on a large prime prime p and a g as before such that 2 ≤ g < p. These parameters are made public.

xi 2. All players choose a random xi ∈ [1, p − 1] and compute Xi = g (mod p).

3. All games at the leaves contain two players. Let their secrets be xi and xixi+1 xi+1. Via the two-player DH algorithm, they can agree on g (mod p). Let them do so.

4. Let the two players at the bottom merge into a single player. Their group secret key is the value computed above. All communication to the outside world is heard and sent by both players though they do not exchange any more information strictly between themselves. The tree therefore reduces in height by 1. Repeat the step above until the tree is reduced to a single node.

5. At the point that the tree is reduced to a single node, all players know Q x the secret g i i (mod p).

8.3 How to break Diffie-Hellman Consider the following problem: Given a large prime p, a number 2 ≤ g < p, and the result of an exponentiation ga(mod p), determine the value of a. This is called the discrete logarithm problem. Humanity does not yet know how to solve this problem efficiently (unless quantum computation is used).

9 Integer algorithms: RSA

Arguably the most important use of modular arithmetic is in the field of cryp- tography. Given two numbers, p and q, it is a simple matter to multiply them and get the answer n = pq. However, it is a much more difficult matter to factor an arbitrary number n into its prime factors. As of today, no fast algorithm using classical computers has been discovered that factors an arbitrary number

12 n and 95% of the world’s encryption (as of 1999) was based on the assumption that no fast algorithm exists 2. Approximately 95% of the crypto algorithms built into commercial software (as of roughly 1999) are based upon some variant of the 512-bit RSA cryptogra- phy protocol (according to Shamir). This protocol is considered secure because of the difficulty of factoring large integers efficiently. Factoring large numbers is considered an intractable problem though nobody has presented a conclusive proof. However, in 1999, a 512-bit key was cracked in reasonable time (6-7 months). The fastest known theoretical algorithm for cracking RSA keys is the general number field sieve. Nowadays keys have to be much larger than 512-bits if you are using RSA to guarantee any security at all... The steps of the RSA protocol are outlined below. Example numbers: Let p = 11, q = 17,N = 187, (p − 1)(q − 1) = 160, e = 23, d = 7,M = 2,M 0 = 162.

1. Alice chooses two large primes p and q. These are typically hundreds of binary digits long. 2. She then computes N = pq and selects a small integer e that has no factors in common with the number (p − 1)(q − 1).

3. Using Euclid’s Algorithm, Alice then computes the number d, the multi- plicative inverse of e(mod (p − 1)(q − 1)). 4. Alice then sends Bob the pair of numbers (e, N) and keeps the number d private.

5. Bob splits his message into bit blocks of size L = blog Nc. For each block M 6= 0 (he and Alice can agree never to send a block of all 0’s), he then encodes the message as follows:

M 0 ≡ M e(mod N)

He then sends the blocks of encrypted text over the line to Alice. 6. For each bit block, Alice takes the encrypted message M 0 and deciphers it as follows: M ≡ (M 0)d(mod N) 2There are several interesting things to notice about the things I’ve said here. First, no fast algorithm using classical computers has been discovered. That doesn’t mean that one doesn’t exist and we simply haven’t found it yet. Even though researchers haven’t been able to discover a fast algorithm for factoring, it doesn’t mean that one doesn’t necessarily exist! Thus far, there is no proof that such an algorithm doesn’t exist... Another interesting point is that I’ve specifically made use of the term classical computers. If we allow ourselves to use computers that are able to use quantum physics in addition to standard electromagnetism (classical computation that you use at home), it is possible to factor large numbers in a small amount of time. Don’t get worried that your bank accounts are unsafe: Such a computer has never been built!

13 Are we sure that the RSA decoding actually transforms M 0 back to the message M? In other words, we need to prove that (M 0)d(mod N) ≡ (M e)d(mod N) ≡ M(mod N) Note that ed = 1(mod (p−1)(q −1)) ⇒ ed = 1+k(p−1)(q −1) for some number k. Then M ed(mod N) ≡ M 1+k(p−1)(q−1)(mod N) So if we show that M (p−1)(q−1) ≡ 1(mod N), we’ll be done. We will do this in steps. Claim: If p 6= q are both primes, then for any nonzero M with no factors in common with the number pq, M (p−1)(q−1) = 1(mod pq). Proof: Well, for any nonzero M with no factor in common with pq, by Fermat’s Theorem, M (p−1) = 1(mod p) ⇒ M (p−1)(q−1) = 1(mod p) and M (q−1) = 1(mod q) ⇒ M (q−1)(p−1) = 1(mod q) So the number M (p−1)(q−1) = 1 + pk for some k and M (p−1)(q−1) = 1 + ql for some l. We get 1 + pk = 1 + ql ⇒ pk = ql. But because p 6= q and p is prime, q must evenly divide into k. Thus, for some m, k = qm ⇒ M (p−1)(q−1) = 1 + pqm ⇒ M (p−1)(q−1) ≡ 1(mod pq). 2 It is not immediately obvious that this system is secure. In fact, this issue has not yet been put to rest. It has not been proven that you must factor the number N in order to break the system, but after decades of research, nobody has yet been able to prove otherwise. How can we break the RSA cryptography system if we can factor numbers fast? Well, if we can factor the original number N, then we obviously can get the numbers p and q fast. It is then easy to compute the number (p−1)(q−1). From this point, we can compute the inverse to the number e via Euclid’s Algorithm and this will produce the private key d with barely any work. The whole system depends on the factoring problem being difficult to solve! As of 2010, the largest factored RSA number was 768 bits long (232 decimal digits, see RSA-768). Its factorization, by a state-of-the-art distributed imple- mentation, took around fifteen hundred CPU years (two years of real time, on many hundreds of computers). No larger RSA key is known publicly to have been factored. In practice, RSA keys are typically 1024 to 4096 bits long. Some experts believe that 1024-bit keys may become breakable in the near future or may already be breakable by a sufficiently well-funded attacker, though this is disputable. Few people see any way that 4096-bit keys could be broken in the foreseeable future (unless quanutm computers are built).

10 Miller-Rabin primality testing: algorithm

As we have seen, it is a very important problem to be able to find large primes. It turns out that the number of primes less than a large number n is asymptotically

14 n equal to ln(n) . (This is called the Prime Number Theorem.) So if you need to find a prime of order e1000, it will take around 1000 random tries before you get one. If you think about it, that’s not too many... assuming that you can differentiate between a prime and composite number extremely fast... The Miller-Rabin algorithm is: For probabilistically determining whether a large number n is prime or composite... • Factor n − 1 into the form n − 1 = 2sd. • Choose a number k ≥ 1. If the output of the program is prime, then the result will be true with probability at least 1 − 2−k. If the output of the program is composite, then the result will be true with certainty. • Do the following steps k times: 1. Pick a randomly in the range [1, n − 1]. r 2. If ad 6= 1 (mod n) and a2 d 6= −1 (mod n) for all r in the range [0, s − 1], then return composite. • Return prime (with probability 1 − 2−k). Now how and why does this work? Example: Pretend we do not know that 101 is prime. Let k = 3. 100 = 2225. Choose the number a = 16 randomly. Then, in the world of mod 101,

1625 ≡ 1

So choose a = 11. 1125 ≡ 10 6= 1, 102 ≡ 100 ≡ −1 So choose a = 35. 3925 ≡ 91 6= 1, 912 ≡ 100 ≡ −1 so we return that 101 is prime. Pretend we do not know that 217 is composite. Let k = 3. 216 = 2327. Choose the number a = 13 randomly. Then, in the world of mod 217,

1327 ≡ 209 6= 1, 2092 ≡ 64 6= −1, 642 ≡ 190 6= −1 in the world of mod 217 so 217 is composite. Lemma 1: Call 1 and −1 trivial square roots of 1. If n is prime, then the only square roots of 1 are the trivial square roots. Proof: Assume that x2 ≡ 1 (mod n). Assume that n is prime. Then by the equation above (x − 1)(x + 1) ≡ 0 (mod n). Because n is prime, we must have that n either evenly divides into x + 1 or x − 1. But then that means that either x ≡ 1 or x ≡ −1 (mod n) and these are trivial square roots. 2 Lemma 2: If n is an odd prime and n − 1 = 2sd, then one of the following must be true for any number a 6= 0. 1. ad ≡ 1 (mod n)

15 r 2. There is some number r ∈ [0, s − 1] so that a2 d ≡ −1 (mod n). Proof: Pick any number a 6= 0. If n is prime, then Fermat’s Theorem tells us that an−1 ≡ 1 (mod n). By Lemma 1, the only valid “square roots” of s s−1 s an−1 = a2 d are 1 and −1. Obviously a2 d is a square root of an−1 = a2 d so it must be either 1 or −1. If it is −1, then we are done. Otherwise, continue... Eventually, you must reach d. 2 Theorem: The Miller-Rabin primality test works. Proof: If n is prime then by Lemma 2, either ad ≡ 1 (mod n) or there is some r number r ∈ [0, s − 1] so that a2 d ≡ −1 (mod n). If both of these statements are false, then n must be composite. Therefore, if we return composite, then our answer must be correct with 100% probability. On the other hand, what happens if we return prime? In order for the answer to be incorrect, n must be composite and we must have repeatedly found at least one the two statements to be true. We will show that if n is composite then the 1 probability that both of these statements are false is greater than 2 .

11 Miller-Rabin primality testing: proof

Definition: A Carmichael number is a composite number n such that ∀a ∈ {1, . . . , n − 1}, an−1 ≡ 1(mod n). The smallest Carmichael number is 561. All Carmichael numbers are re- quired to be odd and square-free.

1. ad ≡ 1 (mod n)

r 2. There is some number r ∈ [0, s − 1] so that a2 d ≡ −1 (mod n).

Claim: Our goal is to show that if n is composite, the probability that both of 1 the above statement are false is at least 2 . (Because the Miller-Rabin primality test will yield composite in that case.) Proof: Assume that n is composite. In both cases below, the goal will be to construct a subgroup B such that the probability of picking a random element ∗ of Zn that is not in B is high and such that if a is selected to be outside of B, both of the above statements are false.

∗ n−1 1. Assume that n is not a Carmichael number. Let B ≡ {x ∈ Zn : x ≡ 1 (mod n) }. Note that if a 6∈ B, then (a) a 6∈ B ⇒ an−1 6≡ 1 (mod n)⇒ ad 6≡ 1 (mod n). r (b) If there is some number r ∈ [0, s − 1] so that a2 d ≡ −1(mod n), then squaring it enough times yields an−1 ≡ 1( mod n) ⇒ a ∈ B, a r contradiction. Thus, 6 ∃r ∈ [0, s − 1], a2 d ≡ 1(mod n).

16 So if we wind up choosing a 6∈ B for our tests, then both statements are false. What is the probability that we choose a 6∈ B? ∗ Because n is not a Carmichael number, B is a proper subgroup of Zn and ∗ ∗ |Zn| |Zn| therefore 2 ≤ |B| ⇒ |B| ≤ 2 because the order of a subgroup must divide the order of the group (Lagrange’s Theorem). But because we are choosing a random a ∈ [1, n − 1], the probability we 1 choose a 6∈ B is ≥ 2 . 2. Assume that n is a Carmichael number. ∗ d ∗ Note that −1 ∈ Zn ⇒ −1 ∈ {x : x ∈ Zn}. Let t be the largest element 2td ∗ of [0, s − 1] such that −1 ∈ {x : x ∈ Zn}. Such a t must exist because 20d d ∗ −1 ∈ {x = x : x ∈ Zn}. ∗ 2td Let B = {x ∈ Zn : x ≡ ±1 (mod n) }. Note that B is again a subgroup. ∗ ∗ 2td To show that B 6= Zn, note that ∃v ∈ Zn, v ≡ −1 (mod n) by con- 2td ∗ struction (recall that −1 ∈ {x : x ∈ Zn}). Because n is composite, we can write n = n1 · n2 where gcd(n1, n2) = 1. Then the following two congruences have a solution via the Chinese Remainder Theorem: w ≡ v (mod n1) and w ≡ 1 (mod n2). Raising both sides of both congruences to t 2td 2td the power 2 d, we get that w ≡ −1 (mod n1) and w ≡ 1 (mod n2). 2td 2td Is it possible for w ≡ 1(mod n)? No; if so, then w ≡ 1(mod n1). Is 2td 2td it possible for w ≡ −1(mod n)? No; if so, then w ≡ −1(mod n2). 2td ∗ Thus, we have that w 6≡ ±1(mod n) ⇒ w 6∈ B ⇒ B 6= Zn. By the same ∗ |Zn| reasoning as in the previous case, we now know that |B| ≤ 2 . So what happens if we select a 6∈ B in the test? t Is it possible for ad ≡ 1(mod n)? Well, ad ≡ 1(mod n) ⇒ a2 d ≡ 1(mod n) ⇒ a ∈ B, a contradiction. r Is it possible for a2 d ≡ −1(mod n) for any r ∈ [0, s − 1]? Note that t a 6∈ B ⇒ a2 d 6≡ ±1 (mod n). By the definition of t as the largest number r with its property, if a2 d ≡ −1 (mod n), then we must have that r ≤ t. r But if a2 d ≡ −1 (mod n) and r ≤ t, then a ∈ B, a contradiction.

2

12 Montgomery multiplication

Whenever we use RSA, we need to exponentiate very large numbers to the power of other very large numbers modulo yet other very large numbers. Your basic key can be up to 4k bits in PGP. (US Govt won’t let it be larger.) To compute ab(mod p) requires multiple large-scale multiplications. Montgomery multiplication is the best known way to speed them up in practice.

17 Assume that we want to compute ab(mod m) where a, b < m. Introduce another number r > m such that gcd(r, m) = 1. Usually, m is a large odd prime, and we will be selecting r to be a power of 2. Generally speaking, if the register size is 64 bits, we let r = 264. If m can be represented in (for example) 4096 bits, then we let r = 24096. To multiply together two numbers ab(mod m), do the following steps. The overall idea of the method is to convert the two numbers a and b to a0 = ar and b0 = br. A Montgomery multiplication is not a standard multiplication. To multiply in Montgomery world, a multiplied by b is abr−1(mod m). If we convert to Montgomery form, we get

(ar)(br)r−1 = abr(mod m) which means that the product of a and b in Montgomery form is equal to ab in Montgomery form.

−1 1. Find two numbers 0 ≤ r < m and 0 ≤ m−1 < r such that

−1 r r − m−1m = 1

We can do this because gcd(r, m) = 1.The normal method of doing the Euclidean algorithm will yield numbers such that ar+bm = 1. Let r−1 ≡ a and let m−1 ≡ −b. 2. Convert a and b to Montgomery form.

a0 = ar(mod m), b0 = br(mod m)

In general, we want to keep numbers in Montgomery form until we abso- lutely must convert them back.

3. Perform the Montgomery multiplication of u = a0b0(mod m) in the follow- ing way: (a) t = a0b0

(b) u = (t + (tm−1(mod r))m)/r (c) if u ≥ m then return u − m else return u

Why is this extremely efficient? Note that the operation (mod r) is merely a restriction to be in the registers represented by r, an indication to the algorithm that the higher order bits of the tm−1 multiplication can be ignored. Division by r can simply be thought of as a bit shift to the right. Most importantly, there is not a single division operation.

4. Convert the product a0b0(mod m) out of Montgomery space by multiplying by r−1.

18 Why does this work? The big computation occurs in step 3. Assume that we want to compute the value u = abr(mod m).

u = abr(mod m) = arbrr−1(mod m) = a0b0r−1(mod m) =

0 0 −1 0 0 (a b rr /r)(mod m) = (a b (1 + m−1m)/r)(mod m) = 0 0 0 0 ((a b + a b m−1m)/r + km)(mod m) for any integer k = 0 0 0 0 ((a b + a b m−1m + kmr)/r)(mod m) = 0 0 0 0 ((a b + (a b m−1 + kr)m)/r)(mod m) = choosing k conviently 0 0 0 0 ((a b + (a b m−1(mod r))m)/r)(mod m) This bottom line, with the exception of the final (mod m), mirrors the first two computational steps. How do we get rid of the (mod m)? Notice that 0 0 0 0 (a b m−1(mod r) < r and a < m and b < m. Thus the inner expression is upper-bounded by m2 + mr m + r = m ≤ 2m r r Thus the final line in the step determines whether we need to subtract an m to get within [0, m − 1]. Important note: If we are only doing a single multiplication between two numbers, then this is not much of a speed-up, but imagine what would happen if you needed to exponentiate ac(mod m) using repeated squaring. RSA needs to take numbers to very high powers on a regular basis.

12.1 Example Assume that we want to make computations in (mod 2777105129). (That modu- lus will fit inside of a 32-bit register.) We will attempt to multiply a = 950537165 and b = 105888915. Without Montgomery multiplication, we would multiply ab to get 100651349069025975 (which requires 2 32-bit registers) and then do a division by the modulus. Notice that long division is going to require many smaller multiplications... We assign r ≡ 232. We do the Euclidean algorithm on the equation 232a + 2777105129b = 1 to get a = 570044373 and b = −881609383 ⇒ r−1 = 570044373 and m−1 = 881609383. Now we convert a and b to Montgomery form. a0 = ar(mod m) = 2748955174 and b0 = 1931242319. t = 5308898565062808506 (which requires 2 32-bit registers). u = 1495596648 (and u < m so we can stop here) ur−1(mod m) = 277135177

13 Vectors

A vector is an n-tuple of ordered numbers. For example, (1, 0, 3) and (0, 0, 0, 0, 0, 0, 0) are vectors. The entries of a vector can be elements of any set.

19 13.1 One-time pads Take a message as a sequence of bits so that any message can be represented as a long vector with entries in Z2. For a message of length n, we say that the n vector is in Z2 . 3 Assume that we are given a message m = (0, 1, 0) ∈ Z2 . We want to encrypt it and send it so that someone who intercepts the message cannot read it. Choose 3 3 a random vector in Z2 . Assume that it is r = (0, 1, 1) ∈ Z2 . To get the secret 3 message s = m + r = (0, 1, 0) + (0, 1, 1) = (0, 0, 1) ∈ Z2 . To show that the message s is truly secret, consider the first bit. What 1 is P rob(first bit in m was 0—first bit in s was 0)? The answer is clearly 2 . So the first bit is completely secure. In fact, the same argument applies to all the bits. Historical note: The one-time pad is perfectly secure, meaning that, even information theoretically, there is no way to break it, even with potentially infinite computational resources. proved this originally in 1945 (classified until 1949). Vladimir Kotelnikov supposedly proved this as well in 1941, but the report remains classified. People don’t use one-time pads very much because they require: • Truly random (as opposed to pseudorandom) one-time pad values, which is a non-trivial requirement. • Secure generation and exchange of the one-time pad values, which must be at least as long as the message. • Careful treatment to make sure that it continues to remain secret, and is disposed of correctly preventing any reuse in whole or part3.

13.2 Dot product The dot product is a measure of similarity/difference between two vectors. X a · b ≡ aibi i

3The Venona project was a counterintelligence program initiated by the United States Army’s Signal Intelligence Service (later the National Security Agency) that ran for nearly four decades, spanning 1943 to 1980. The purpose of the Venona project was the decryption of messages transmitted by the intelligence agencies of the , e.g. the NKVD, the KGB (First Chief Directorate) and the GRU (military intelligence). Initially, the American government was worried that the Soviets might sign a peace treaty with Hitler and turn on the Allies. The Soviet company that manufactured the one-time pads produced around 35,000 pages of duplicate key numbers, as a result of pressures brought about by the German advance on during World War II. The duplication, which undermines the security of a one-time system, was discovered and attempts to lessen its impact were made by sending the duplicates to widely separated users. The Venona project helped catch Julius and Ethel Rosenberg, the husband and wife responsible for passing American nuclear designs to the Soviet Union. The Venona project remained secret for more than 15 years after it concluded, and some of the decoded Soviet messages were not declassified and published until 1995.

20 For example, (1, 0, 1) · (1, 1, 0) = 1. Notice that the dot product is linear. 1. c(a · b) = ca · b = a · cb if c is a constant

2. a · (b1 + b2) = a · b1 + a · b2

13.3 The geometric definition If one considers a and b as separate vectors (anchored at the origin), then one can think of the two vectors a||b and a⊥b as the parallel and perpendicular components of a projected onto b.

Notice that a = akb + a⊥b and by the Pythagorean Theorem,

||a||2 = ||akb||2 + ||a⊥b||2

Note that ||akb|| = ||a|| cos θ where θ is the angle between a and b. We make the following definition:

a · b ≡ ||a||||b|| cos θ where θ is the angle between a and b. What happens if a and b are parallel? Perpendicular? What is a · a? Then we clearly have that a · b = ||a|| cos θ = ||akb|| ⇒ ||b||

Denote b0 as the unit vector in the direction of b.

a · b0 = ||akb|| ⇒ (a · b0)b~0 = ||akb||b~0 = a~kb

Does this definition match up with the other one? P P Fix two vectors a and b. We can represent a = i ai ~ei and b = i bi ~ei where the ~ei are the standard orthonormal basis vectors. Because the basis vectors are orthonormal, we get by the geometric definition that ~ei · ~ej = δij ⇒ P a · b = i aibi, which is the algebraic definition.

21 13.4 Applications: Voting similarity, Audio sample simi- larity If you think of a politician’s voting record as being a sequence of -1,0, and 1’s (-1 for no, 1 for yes, and 0 for skipped), then the higher the dot product, the more similar the views of the two politicians. Audio samples are sound waves. The amplitude of the wave is measured every fraction of a second and stored into the audio file. If we consider this file as a long vector of measured values, then we can search for a small snippet of sound within the larger audio sample by taking dot products.

13.5 Scalar multiplicaton and vector spaces Vectors can be multiplied by constants. For example, 2(−1, 0, 2) = (−2, 0, 4). The constant is referred to as a scalar. Definition:A vector space S is a set of vectors with the following 3 properties. 1. 0 ∈ S 2. ∀x ∈ S, ∀a ∈ R, ax ∈ S

3. ∀x1, x2 ∈ S, x1 + x2 ∈ S Given two vectors ~x and ~y, one can form a linear combination of the two: a~x + b~y where a, b ∈ R. The result is a vector space. (Prove it.) We say that ~x and ~y generate it. Examples Assume that we allow the implicit constants a, b described above to be in R. 1. Describe the vector space spanned by the vectors (1, 0) and (0, 1). 2. Describe the vector space spanned by the vectors (2, 3) and (−1, 1). 3. Describe the vector space spanned by the vectors (1, 2) and (1.5, 3). 4. Describe the vector space spanned by the vectors (1, 0, 1) and (1, 0, 0).

14 Matrices

A matrix can be thought of as a vector of vectors. For ease of use, we usually don’t split up the vectors with commas and write the entries of the matrix in box form. 1 0.5 −2 4 0 3 3 2   0 0 1 5 0 0 0 2 For matrix multiplication, you can just use a formula. X AB = ( aikbkj)ij k

22 3 0 −1 0 0  0 4 −1 · 3 −2 2 0 0 5 0 To multiply a matrix M (n × m) by a vector ~x (m × 1), we use the formula P Mx = j Mijxj. (A vector is really just a matrix where the number of columns is always 1.) Assume that we have the following equalities and we want to solve for a vector ~x that satisfies all of them:

[1, 0.5, −2, 4] · ~x = −8

[0, 3, 3, 2] · ~x = 3 [0, 0, 1, 5] · ~x = −4 [0, 0, 0, 2] · ~x = 6 This can be written as a matrix equation as follows:

1 0.5 −2 4 −8 0 3 3 2  3    · ~x =   0 0 1 5 −4 0 0 0 2 6

Note that the matrix is really just a list/vector of vectors. One can interpret matrix multiplication as vector of vector multiplication. For example,

Multiply the following two matrices and interpret what is happening:

3 0 −1 0 0 −1 0 4 −1 · 3 −2 −1 2 0 0 5 0 −2

You can also think of matrix multiplication as an extension of the definition of dot products to vectors.

23 14.1 Properties of matrix multplication Definition: These two properties define what it means to be a linear function. 1. For any constant c, c(A~v) = A(c~v) 2. For any vectors ~x and ~y, A(~x + ~y) = A~x + A~y Remember that the dot product is linear so matrix multiplication must be linear too. Claim: Linear functions must map the origin/0-vector to itself. Proof: Assume that f(x) is a linear function and let o be the origin vector. Then f(0) = f(0 + 0) = f(0) + f(0) ⇒ f(0) = 0 2 P P Claim: For any linear function f(x), we have that f( i ai~xi) = i aif(~xi). Proof: Induction. 2

14.2 Transposition The transpose of a matrix A is written as AT . It is what happens when you reverse the rows and columns of a matrix.

T 0 0  0 3 5 3 −2 =   0 −2 0 5 0

Is it true that (AB)T = AT BT ? NO! (In fact, most of the time, it doesn’t even make sense. Can you use that formula on a 3 × 2 · 2 × 5 matrix multiplication?) So what is the correct formula? T ((AB) )ij is the (j, i)th entry of AB. Using the dot product interpretation of matrix multplication, this is the dot product of the jth row vector of A with the ith column vector of B. Notice that the jth row vector of A is the jth column vector of AT and the ith column vector of B is the ith row vector of T T B . This is (BT A )ij. So we get the formula that

(AB)T = BT AT

24 14.3 Application: error correction Hamming codes. Richard Hamming (from Manhattan project where he pro- grammed computers)4: Turing Award 1968. Imagine that you are sending information over a line. Our goal will be to transmit 4 bits across the line in such a way that a single bit-flip error can be both detected and corrected if it occurs. Definition: The identity matrix In is the matrix of size n × n with 1’s along the main diagonal and 0’s elsewhere. You should verify that multiplying any matrix M by the matrix In does not change M at all. T We will create a matrix G ≡ (Ik| − A ) where Ik is the k × k identity matrix and A is a matrix that does a parity check. For the case where k = 4, we get that 1 0 0 0 1 1 0 0 1 0 0 1 0 1 G =   0 0 1 0 0 1 1 0 0 0 1 1 1 1 Assume that our goal is to transmit the 4 bit vector ~xT = 1 0 1 1. We perform the multiplication in the world of Z2.

1 0 0 0 1 1 0 T   0 1 0 0 1 0 1   ~x G = 1 0 1 1   = 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 1 1 1

So we send this sequence of 7 bits across the line. We receive the bits on the other end of the line, and the data is assumed to be the first 4 bits. To check whether an error has occurred over the line, we multiply by H ≡ (A|In−k). In our case,

1 1 0 1 1 0 0 H = 1 0 1 1 0 1 0 0 1 1 1 0 0 1

To check whether there has been an error, we take our 7 bit message x0 and

4Shortly before the first field test (you realize that no small scale experiment can be doneei- ther you have a critical mass or you do not), a man asked me to check some arithmetic he had done, and I agreed, thinking to fob it off on some subordinate. When I asked what it was, he said, ”It is the probability that the test bomb will ignite the whole atmosphere.” I decided I would check it myself! The next day when he came for the answers I remarked to him, ”The arithmetic was apparently correct but I do not know about the formulas for the capture cross sections for oxygen and nitrogenafter all, there could be no experiments at the needed energy levels.” He replied, like a physicist talking to a mathematician, that he wanted me to check the arithmetic not the physics, and left. I said to myself, ”What have you done, Hamming, you are involved in risking all of life that is known in the Universe, and you do not know much of an essential part?” I was pacing up and down the corridor when a friend asked me what was bothering me. I told him. His reply was, ”Never mind, Hamming, no one will ever blame you.

25 right-multiply by HT .

T 1 1 0 1 1 0 0     1 0 1 1 0 1 0 1 0 1 1 0 1 0 = 0 0 0 0 1 1 1 0 0 1

This indicates that no error has occurred. Why not? Because xT GHT = xT (GHT ) = xT (0) = 0. Now what if there happens to be an error in transmission? Then (assuming that it is a single bit flip), we get that the result will be (xT G + eT )HT where e is a 7 × 1 vector that consists of all zeroes except for a single 1 bit somewhere. Say, for example, that eT = 0 1 0 0 0 0 0. Then we get the result

1 0 1

This is not 0, indicating an error. But where is the error in transmission? Notice that every column of H is different. The result is the transposed copy of the second column of H! How did we know that would happen? Because

(xT G + eT )HT = xT GhT + eT HT = 0 + (He)T = (He)T

The error simply picks out the correct column of H and transposes it!

15 From abstract to concrete

If you are given a matrix M, then it is easy to construct the function f(x) = Mx. Given a function f(x) that is guaranteed to be linear, it is a more difficult to reconstruct the matrix M. That will be our goal. Define the standard basis vectors as follows.  0   ...     0    ei ≡ 1 (in the ith position)    0     ...  0 Now assume that M looks like this.

If the function f(x) = Mx, then it is clear that f(ei) = vi. Example:

26 • Assume that you have a picture in 2D space and your goal is to scale it up/stretch it out by a factor of 2 in the horizontal direction. What is the function f(x) where x ∈ R2 that does this? f(1, 0) = (2, 0), f(0, 1) = 2 0 (0, 1) ⇒ M = 0 1 • What is the function f(x) that rotates the picture 90 degrees counter- 0 −1 clockwise? f(1, 0) = (0, 1), f(0, 1) = (−1, 0) ⇒ M = 1 0 • What is the function f(x) that rotates the picture θ degrees counterclock- cos θ − sin θ wise? f(1, 0) = (cos θ, sin θ), f(0, 1) = (− sin θ, cos θ) ⇒ M = sin θ cos θ • Trick question: What is the function f(x) that translates the picture by 1 in the x direction and 2 in the y direction? f(1, 0) = (2, 2), f(0, 1) = 2 1 (1, 3) ⇒ M = But does the function Mx applied to (0, 0) map to 2 3 (1, 2)? No. Why not? This function f(x) is not linear. (In fact, it violates both of the linear properties as defined above.) Claim: A linear function f(x) is one-to-one iff {x : f(x) = 0} = {0}. Proof: Assume that f(x) is a linear function. First, assume that ∃x 6= 0 such that f(x) = 0. We also know that f(0) = 0 which means that f(x) is not one-to-one. Now assume that f(x) is not one-to-one. Then ∃x 6= y such that f(x) = f(y) ⇒ f(x) − f(y) = 0 ⇒ f(x − y) = 0 ⇒ {x : f(x) = 0}= 6 {0}. 2

16 Composition, bijection, and inverse

Assume that we are given two linear functions f(x) = Ax where A is r × s and g(y) = By where B is s × t. Claim: f ◦ g(x) = f(g(x)) = ABx (where x is a column vector) Proof: Let the columns of B be b1, b2, . . . , bt. Before we begin the proof, note that the ith column of the matrix AB is Abi. So we have

t t X X g(x) = bi · xi ⇒ f(g(x)) = f( bi · xi) ⇒ i=1 i=1

t t X X xi · f(bi) ⇒ xi(Abi) = i=1 i=1

x1(column 1 of AB)+... +xt(column t of AB) = ABx 2 This is a major result. It means that function composition is equivalent to matrix multiplication.

27 Definition: We say that A−1 is the inverse of square matrix A iff A−1A = I. Example:

 1 1   3  6 0 2 0 0 2 1 1 −1 4 1 A = 0 − 2 3  ⇒ A =  3 −2 − 3  2 1 3 0 0 2 0 − 2 We define the elementary row operations as follows. 1. Interchange two rows.

This matrix is its own inverse. 2. Multiply a row by a constant.

1 The inverse of this matrix is the one with m 7→ m . 3. Add a constant multiple of one row to another.

The inverse of this matrix is the one with m 7→ −m. Our goal is to determine what A−1 should be in the equation A−1A = I. Place A and I on opposite sides of a dividing line. Perform elementary the same elemen- tary row operations on both sides until A turns into I (if possible). Whatever I becomes must be A−1. (Why?)  1 1  6 0 2 1 1 Example: Find the inverse of 0 − 2 3 . 2 3 0 0

28 16.1 Application: Solving linear equations Example: Find the solution to the following linear equations: 1 1 1 1 2 x + z = 1, − y + z = −2, x = 3 6 2 2 3 3 Notice that it’s possible for a system of linear equations to have no solutions or infinitely many. Example: 1 1 1 2 x = 1, − y + z = −2, x = 3 6 2 3 3 Example: 1 3 1 1 2 x = , − y + z = −2, x = 3 6 4 2 3 3 17 Gaussian Elimination

It turns out that it is much better in pratice to solve linear equations via Gaus- sian elimination because it always best to avoid performing a matrix inversion if you can (numerical instability, etc.). Example: Use Gaussian elimination to solve the following:

2x2 + x3 = −8, x1 − 2x2 − 3x3 = 0, −x1 + x2 + 2x3 = 3

The answer should work out to be x1 = −4, x2 = −5, x3 = 2. Example: 1 3 1 1 2 x = , − y + z = −2, x = 3 6 4 2 3 3 Infinitely many solutions. Example: Use Gaussian elimination to solve the following:

x1 − 2x2 − 6x3 = 12, 2x1 + 4x2 + 12x3 = −17, x1 − 4x2 − 12x3 = 22 No solution.

18 The Determinant

The determinant of a matrix can be considered to be the scaling factor of what happens to a unit square under the operation of the matrix. We will derive a formula for the determinant based on this quasi-definition. First, before we begin, note that “area” in this context can mean length in one dimension, area in two dimension, volume in three dimensions, and so on. We assume that if we have two basis vectors that make up a 2-dimensional parallelogram, then reversing the two vectors will cause the viewer to see the “back” of the parallelogram; thus, a reversal will yield the same area as before but negated. Yes, area can be negative. We begin with three standard properties:

29 1. |In| = 1 2. Exchanging tow rows will flip the sign of the determinant. 3. Stretching one of the dimensions by t will multiply the area by t; mathe- matically, ta tb a b det = t · det c d c d The determinant behaves linearly on the rows of the matrix.

a + a0 b + b0 a b a0 b0 det = det + det c d c d c d

Now we can find the formula for any 2 by 2 matrix:

a b a 0 0 b det = det + det c d c d c d

a 0 a 0 0 b 0 b = det + det + det + det c 0 0 d c 0 0 d c a 0 1 0 0 1 d 0 b = det + ad · det + detbc · + det a a 0 0 1 1 0 b 0 b = ad − bc In fact, we can find the formula for any matrix. When we separate out the matrices so that there is only a single nonzero element per row, only a comparatively few have nonzero determinant. In the 3 by 3 case, even though there are a total of 27 potential matrices with a single nonzero row entry in each row, only 6 of them will be nonzero.   a11 a12 a13 det a21 a22 a23 = a31 a32 a33       a11 0 0 a11 0 0 0 a12 0 det  0 a22 0  + det  0 0 a23 + det a21 0 0  + 0 0 a33 0 a32 0 0 0 a33       0 a12 0 0 0 a13 0 0 a13 det  0 0 a23 + a21 0 0  +  0 a22 0  a31 0 0 0 a32 0 a31 0 0

= a11a22a33 − a11a23a33 − a12a21a33 + a12a23a31 + a13a21a32 − a13a22a31 Note that we get a term for every permutation of 3 numbers because we get a single matrix for every sequence of questions: Which row contains the nonzero term in the first, second, and third columns? If the permutation is even (i.e. the number of row swaps is even), then we get a positive; otherwise, negative. This general idea extends to any dimension.

30 Examples

 2 1 det = 1 −1 0  3 4 2  det −3 −5 −1 = −3 0 0 1 −1 −1 −2 0   2 −3 1 0  det   = 124  3 3 −3 −1 −3 −1 2 −2

19 Markov Processes

Assume that you are observing a process with transition matrix as follows:

.5 0 .5 .2 .6 .2 1 0 0

If you are in state 1, where are you going to arrive in one time step and with what probabilities? If you are currently in state 3, where do you arrive next time? Once you leave state 2, can you ever return? If you start in a particular state, say state 1, then the probabilities that you arrive in each of the three possible states are computed as follows:

.5 0 .5 (1, 0, 0) .2 .6 .2 = (.5, 0,.5) 1 0 0 exactly as expected. Now assume that you take 2 steps starting from state 1:

.5 0 .5 3 1 (.5, 0,.5) .2 .6 .2 = ( , 0, )   4 4 1 0 0

Now 3 steps: .5 0 .5 5 3 (.75, 0,.25) .2 .6 .2 = ( , 0, )   8 8 1 0 0 ...and so on.

31 19.1 Steady state

What happens in the long run? We want a row vector x = (x1, x2, x3) such that

.5 0 .5 x .2 .6 .2 = x 1 0 0

This will indicate that the process does nothing to the limiting distribution. Of course, this yields a system of 3 equations with 3 unkowns:

.5x1 + .2x2 + x3 = x1

.6x2 = x2

.5x1 + .2x2 = x3 which yields the solution x2 = 0 and x1 = 2x3. Recalling that x1 + x2 + x3 = 1, 2 1 we get that x1 = 3 , x3 = 3 .

19.2 Eigensystems Let’s assume that we want a simple formula to tell us where the system is at any given point in time. If we assume that we start in the state s = (s1, s2, s3), then what is your probability distribution after n steps of the process? To determine this, we need to find the eigenvalues of the transition matrix M. These are the values λ such that for nonzero x, xM = λx ⇒ x(M − λI) = 0 ⇒ det(M − λI) = 0. So we get

.5 − λ 0 .5  det( .2 .6 − λ .2 ) = 0 ⇒ 1 0 0 − λ

11 λ 3 −λ3 + λ2 + − = 0 ⇒ 10 5 10 3 1 λ = 1, λ = , λ = − 1 2 5 3 2 Now, we need to determine the x values that correspond with these λs.

1. λ1 = 1 ⇒ xM = x ⇒ 1 1 3 1 1 x + x + x = x , x = x , x + x = x ⇒ 2 1 5 2 3 1 5 2 2 2 1 5 2 3

x1 = 2x3, x2 = 0 ⇒ 2 1 Normalizing so that x1 + x2 + x3 = 1 ⇒ x = ( 3 , 0, 3 ). This happens to be the same as the steady state distribution.

32 3 3 2. λ2 = 5 ⇒ xM = 5 x ⇒ 8 11 x = x , x = − x 1 3 3 2 3 3 It is interesting that this case is impossible to normalize... the sum of the variables must always equal 0. We can, for the sake of concreteness, assume that x3 = 3 ⇒ x = (8, −11, 3).

1 1 3. λ3 = − 2 ⇒ xM = − 2 x ⇒

x1 = −x3, x2 = 0

Once again, it is not possible to normalize. Letting x3 = 1 ⇒ x = (−1, 0, 1).

Let us define the matrix X to be x1 along the first row, x2 along the second row, and x3 along the third row. So

 2 1  3 0 3 X ≡  8 −11 3 −1 0 1

Let us define   λ1 0 0 E ≡  0 λ2 0  0 0 λ3 It is clear from all that we have done above that

XM = EX ⇒ M = X−1EX

Taking n steps from a start state s implies that we are calculating

sM n = s(X−1EX)n = sX−1EnX

For concreteness, we have that

 1  1 0 − 3 −1 1 2 X = 1 − 11 − 33  ⇒ 2 1 0 3

 2 1 1 n 1 1 1 n  3 + 3 (− 2 ) 0 3 − 3 (− 2 ) −1 n 2 8 3 n 2 1 n 3 n 1 2 1 n 3 3 n X E X =  3 − 11 ( 5 ) + 33 (− 2 ) ( 5 ) 3 − 33 (− 2 ) − 11 ( 5 )  2 2 1 n 1 2 1 n 3 − 3 (− 2 ) 0 3 + 3 (− 2 ) To check that this is correct, plug in n = 0 and make sure that the correct matrix pops out (HINT: the identity). Plug in n = 1 and make sure that the correct matrix pops out (HINT: M). So, in general, if you start in some state s = (s1, s2, s3), then you can find the probability that you are in any given state exactly at any given time n by computing sX−1EnX.

33 To find the limiting distribution, let’s find out what happens when n → ∞...

 2 1  3 0 3 −1 n 2 1 lim sX E X = s  3 0 3  = n→∞ 2 1 3 0 3 2 1 2 1 ( (s + s + s ), 0, (s + s + s )) = ( , 0, ) 3 1 2 3 3 1 2 3 3 3 So, eventually, independent of where you started, you will eventually spend two-thirds of your time in state 1 and one-third of your time in state 3.

20 Computing the SVD

(WARNING: There are some typos in this section. They will be corrected in a later version.) Consider any m × n matrix A. Our goal is to decompose A in a similar way to the eigenvector method from before. We want to create three matrices U, Σ,V such that A = UΣV T and the matrices have the following properties. U will be m × m and V will be n × n. The center matrix will be m × n, just like A. In addition, we need the following: The columns of U are orthonormal eigenvectors of AAT (so U T U = I). The rows of V are orthonormal eigenvectors of AT A (so V T V = I). Σ has the square roots of the eigenvalues from U or V in descending order. Of course, because A is not necessarily square, we cannot proceed directly with an eigendecomposition. Instead, we work with AAT . As an example, let 3 2 2  A = . 2 3 −2 17 8  Then we get that AAT = . Note that this is both symmetric and 8 17 square (and will always be so). We proceed with an eigendecomposition of this matrix. The characteristic polynomial is x2 − 34x + 225 = (x − 25)(x − 9) = 0 ⇒ x = 25, 9. We define the singular values to be the square-root of the eigenvalues of AAT (and, by convention, we label them from largest to smallest). Thus, σ1 = 5 and σ2 = 3. We now proceed to compute AT A and determine its eigenvectors. In our example, the three eigenvalues are 25, 9, and 0. The unit eigenvector correspond-  √1  2 ing to 25 is √1 (after normalization). The unit eigenvector corresponding to  2  0  √1    18 2/3 9 is − √1 . The final eigenvector corresponding to 0 is −2/3 . So now we  18    √4 −1/3 18

34 know V .  √1 √1 2/3  2 18 V =  √1 − √1 −2/3  2 18  0 √4 −1/3 18 Now we compute AAT and determine its eigenvectors. The two eigenvalues " √1 # " √1 # and 25 and 9 with eigenvectors (normalized) 2 and 2 respectively. √1 − √1 2 2 These vectors should be placed along the rows of U. Now we know U. " # √1 − √1 U = 2 2 √1 √1 2 2

For the matrix Σ, we create a matrix of the appropriate size and put the square roots of the eigenvalues along the diagonal in descending order. It is an interesting fact that the nonzero eigenvalues of U and V will always be the same and their square roots will always be the singular values.

5 0 0 Σ = 0 3 0

Now notice that A = UΣV T . Why is that true?

(UΣV T )T · (UΣV T ) = V ΣT U T UΣV T = V ΣT ΣV T shows that this is an eigendecomposition of AT A because of the way we con- structed Σ and V . (UΣV T ) · (UΣV T )T = UΣΣT U T shows that this is an eigendecomposition of AAT because of the way we con- structed Σ and U.

21 Singular Value Decomposition Application: Best Road

Example problem/Motivation: Assume that you are given the location of m 2 houses a1, a2, . . . , am ∈ R and the goal is to design a main thoroughfare through the neighborhood. The constraints are that (a) it must pass through the origin and (b) the sum of the squared distances from each house to the road must be minimized among all possible roads with property (a). What road should be drawn? We will only specify the direction for the road and hence attempt to specify a unit vector v. kv ⊥v Given any vector v, for each vector ai, we can write ai = ai + ai where kv ⊥v kv ai is the projection of ai onto v and ai ≡ ai − ai .

35 By the Pythagoraean Theorem, we know that

2 kv 2 ⊥ 2 ||ai|| = ||ai || + ||ai || ⇒

X ⊥ 2 X 2 k 2 ||ai || = ||ai|| − ||ai || i i

P ⊥ 2 P k 2 So minimizing i ||ai || is equivalent to maximizing i ||ai || . We know by kv 2 kv 2 the definition of the dot product that (ai ·v)v = ai ⇒ (ai ·v) = ||ai || . Hence we are really attempting to maximize

X 2 (ai · v) i But notice that Av can be represented as follows:

P 2 2 which means that maximizing i(ai · v) is equivalent to maximizing ||Av|| or equivalently, ||Av||. So we take the SVD and choose the v with the maximum singular value. Note: To remove the constraint that the road go through the origin, compute the controid of the points and then subract the centroid from each point to create a new problem. Find the optimal line through the origin that solves the new problem and then adjust it by adding the centroid back to the line you get. Example: Let the points be (3, 2), (2, 3), (−2, 2) and mandate that the road go through the origin.  3 2 A =  2 3 −2 2 The singular value decomposition of A is

 1 1  √ − √ −2/3   " # 2 18 5 0 √1 − √1 A =  √1 √1 2/3  0 3 2 2  2 18    √1 √1 0 − √4 1/3 0 0 2 2 18

" √1 # The v with the maximum singular value is 2 with singular value 5. This √1 2 show that we should take the line with slope 1 or y = x as the equation of the road.

36 22 Singular Value Decomposition Application: Picture Compression/Noise cancellation

Our goal during this section will be to find the best matrix B of the form

that approximates a general matrix A. More specifically, given a matrix A, our goal is to find a vector v from which we can create a matrix that looks like the following. (As an application, notice that a black-and-white picture can really be thought of as a matrix of white-intensities. Think about how much we could compress the picture if we could store all of the information into just two vectors u and v.)

We call a matrix of the above form a rank-one matrix. kv We use the formula ai = (ai · v)v to write the rank-one matrix in the following form. (Recall that v is a column matrix.)

 T    (a1, v)v (a1, v) T  (a2, v)v   (a2, v)  T T   =   v = Av · v  ...   ...  T (am, v)v (am, v)

Let’s use the SVD and let A = UΣV T . We can let v be one of the eigenvectors in V ! Then we would get that Av = UΣV T = σu for one of the vectors in u and its associated singular value. Which vector should we pick?

37 Note: If we want to send an image, we can use the singular value decom- position to send the entire image perfectly (while significantly increasing the amount of data that we are required to send). Reducing many of the singular values to 0 will significantly decrease the amount of data to send and serves as a good method of compression (which is essentially what we did above). It’s a very cool fact that taking small singular values and reducing them down to 0 can very often eliminate noise in the picture. Noise that is introduced into a picture often just takes the intensity of the pixel and changes it slightly. Noise can therefore many times be considered a subspace of the picture with smaller singular values.

38